Benchmark Project Page

GIR-Bench: Versatile Benchmark for Generating Images with Reasoning

Reasoning-centric evaluation of multimodal unified models across Understanding – Generation Consistency (UGC), Text-to-Image, and Editing, revealing the persistent gap between reasoning and faithful generation.

Paper (PDF) Leaderboard Huggingface Data Github Code

Highlights

Three views: Reasoning-driven Understanding ↔ Generation Consistency / T2I / Editing
Objective, task-specific, interpretable evaluation metrics, less LMM-as-a-Judge bias
21 models and meaningful analysis; shows reasoning → generation/editing transfer bottleneck

What’s inside

Understanding ↔ Generation Consistency: 300 real-world entities across zoology, botany, and geograph
T2I: 300 logical reasoning entities across numerical, layout, and text rendering
Editing: 370 planning and reasoning-driven entities across puzzle, sudoku, and reasoning perception

Motivation

Most existing evaluations focus on shallow text–image alignment and lean on “LMM-as-a-Judge.” GIR-Bench instead probes complex, implicit reasoning in unified multimodal models across Understanding–Generation Consistency, Text-to-Image, and Editing, using objective, task-specific, interpretable metrics for a more comprehensive and fair assessment. Large-scale analyses show that while unified models outperform generation-only systems on reasoning-heavy tasks, they still struggle to reliably translate reasoning into faithful visual outputs.

Benchmark Overview

Three sub-benchmarks cover parsing knowledge and constraints, and reliably injecting reasoning into generation/editing.

Goal: Compare understanding vs generation for the same real-world entity. Implicit descriptions trigger generation; real images are used for understanding evaluation.

Domains: zoology, botany, geography (world knowledge)
Metric: DINOv3 similarity (generation); VQA accuracy (understanding)

Goal: Reasoning-driven T2I (numerical, layout, text rendering).

Numerical: logic/counting; exact-match on detected counts
Layout: verify rules via detected boxes (e.g., left↔right)
Text: implicit slogan → OCR substring score (s_wc)

Goal: Reasoning-driven editing (puzzle reassembly, sudoku completion, reasoning perception segmentation).

Puzzle: normalized FID
Sudoku: OCR digits then grid-level accuracy
Perception: IoU of predicted mask vs ground truth

Qualitative Results

Preview typical cases for the three sub-benchmarks.

Understanding ↔ Generation

Implicit prompt triggers generation
300 real-world entities from Internet and open datasets
Verify if models use shared knowledge for understanding and generation
Domains: zoology, botany, geography

Evaluation

Real image used for understanding
DINOv3 similarity (generation)
VQA accuracy (understanding)

T2I: numerical, layout, and text rendering

Reasoning-driven T2I

300 complex logical reasoning entities
Not only requires retrieving world knowledge
But also applying precise logical reasoning to meet specified constraints

Evaluation

Numerical: exact-match on detected counts
Layout: verify rules via detected boxes (e.g., left↔right)
Text: implicit slogan → OCR substring score

Reasoning-driven Editing

370 planning and reasoning-driven entities
Evaluate spatial perception, logical reasoning, and complex reasoning chains

Evaluation

Puzzle: normalized FID
Sudoku: OCR digits → grid accuracy
Perception: IoU of prediction vs GT

Quantitative Results

Unified models outperform generaion-only baselines overall, yet reliable transfer from reasoning to generation remains the key bottleneck.

Leaderboard

Click headers to sort UCG = Understanding – Generation Consistency UCG-U = understanding, UCG-G = generation Type:

Model ⬍	Type ⬍	UCG-U ⬍	UCG-G ⬍	T2I ⬍	Edit ⬍
Qwen2.5VL-7B	Und	0.978	-	-	-
Qwen2.5-VL-32B	Und	0.976	-	-	-
GPT-5	Und	0.994	-	-	-
Gemini-2.5-Flash	Und	0.997	-	-	-
SD-3.5-Large	Gen	-	0.288	0.134	-
HiDream-I1-Full	Gen	-	0.378	0.157	-
FLUX.1-schnell	Gen	-	0.292	0.159	-
FLUX.1-Kontext-dev	Edit	-	-	-	0.105
ICEdit	Edit	-	-	-	0.095
Step1X-Edit	Edit	-	-	-	0.071
Show-o2-7B	Unified	0.935	0.198	0.023	-
Janus-Pro-7B	Unified	0.874	0.211	0.038	-
BLIP3o-NEXT-SFT-3B	Unified	0.974	0.263	0.159	-
Ovis-U1-3B	Unified	0.909	0.244	0.171	0.093
OmniGen2	Unified	0.952	0.294	0.143	0.073
UniPic2-Metaquery-9B	Unified	-	0.301	0.139	0.132
UniWorld-V1	Unified	-	0.302	0.138	0.054
BAGEL-7B	Unified	0.937	0.295	0.169	0.098
BAGEL-7B w/ CoT	Unified	0.968	0.341	0.276	0.140
Qwen-Image	Unified	-	0.429	0.224	-
Qwen-Image-Edit	Unified	-	-	-	0.158
Gemini-2.5-Flash-Image	Unified	-	0.593	0.650	0.343
GPT-Image-1	Unified	-	0.689	0.622	0.351

Resources

Paper

PDF

Code & Data

GitHub

Contact

liyaowei01 [at] gmail.com

lihxxxxxx [at] gmail.com

BibTeX

Ready


                @article{li2025gir-bench,
                    title={GIR-Bench: Versatile Benchmark for Generating Images with Reasoning},
                    author={Hongxiang Li, Yaowei Li, Bin Lin, Yuwei Niu, Yuhang Yang, Xiaoshuang Huang, Jiayin Cai, Xiaolong Jiang, Yao Hu, Long Chen},
                    journal={arXiv preprint arXiv:2510.11026},
                    year={2025}
                  }