GIR-Bench: Versatile Benchmark for Generating Images with Reasoning
Reasoning-centric evaluation of multimodal unified models across Understanding – Generation Consistency (UGC), Text-to-Image, and Editing, revealing the persistent gap between reasoning and faithful generation.
Highlights
- Three views: Reasoning-driven Understanding ↔ Generation Consistency / T2I / Editing
- Objective, task-specific, interpretable evaluation metrics, less LMM-as-a-Judge bias
- 21 models and meaningful analysis; shows reasoning → generation/editing transfer bottleneck
What’s inside
- Understanding ↔ Generation Consistency: 300 real-world entities across zoology, botany, and geograph
- T2I: 300 logical reasoning entities across numerical, layout, and text rendering
- Editing: 370 planning and reasoning-driven entities across puzzle, sudoku, and reasoning perception
Motivation
Most existing evaluations focus on shallow text–image alignment and lean on “LMM-as-a-Judge.” GIR-Bench instead probes complex, implicit reasoning in unified multimodal models across Understanding–Generation Consistency, Text-to-Image, and Editing, using objective, task-specific, interpretable metrics for a more comprehensive and fair assessment. Large-scale analyses show that while unified models outperform generation-only systems on reasoning-heavy tasks, they still struggle to reliably translate reasoning into faithful visual outputs.
Benchmark Overview
Three sub-benchmarks cover parsing knowledge and constraints, and reliably injecting reasoning into generation/editing.
Goal: Compare understanding vs generation for the same real-world entity. Implicit descriptions trigger generation; real images are used for understanding evaluation.
- Domains: zoology, botany, geography (world knowledge)
- Metric: DINOv3 similarity (generation); VQA accuracy (understanding)
Goal: Reasoning-driven T2I (numerical, layout, text rendering).
- Numerical: logic/counting; exact-match on detected counts
- Layout: verify rules via detected boxes (e.g., left↔right)
- Text: implicit slogan → OCR substring score (
swc
)
Goal: Reasoning-driven editing (puzzle reassembly, sudoku completion, reasoning perception segmentation).
- Puzzle: normalized FID
- Sudoku: OCR digits then grid-level accuracy
- Perception: IoU of predicted mask vs ground truth

Qualitative Results
Preview typical cases for the three sub-benchmarks.

Understanding ↔ Generation
- Implicit prompt triggers generation
- 300 real-world entities from Internet and open datasets
- Verify if models use shared knowledge for understanding and generation
- Domains: zoology, botany, geography
Evaluation
- Real image used for understanding
- DINOv3 similarity (generation)
- VQA accuracy (understanding)

Reasoning-driven T2I
- 300 complex logical reasoning entities
- Not only requires retrieving world knowledge
- But also applying precise logical reasoning to meet specified constraints
Evaluation
- Numerical: exact-match on detected counts
- Layout: verify rules via detected boxes (e.g., left↔right)
- Text: implicit slogan → OCR substring score

Reasoning-driven Editing
- 370 planning and reasoning-driven entities
- Evaluate spatial perception, logical reasoning, and complex reasoning chains
Evaluation
- Puzzle: normalized FID
- Sudoku: OCR digits → grid accuracy
- Perception: IoU of prediction vs GT
Quantitative Results
Unified models outperform generaion-only baselines overall, yet reliable transfer from reasoning to generation remains the key bottleneck.
Leaderboard
Model ⬍ | Type ⬍ | UCG-U ⬍ | UCG-G ⬍ | T2I ⬍ | Edit ⬍ |
---|---|---|---|---|---|
Qwen2.5VL-7B | Und | 0.978 | - | - | - |
Qwen2.5-VL-32B | Und | 0.976 | - | - | - |
GPT-5 | Und | 0.994 | - | - | - |
Gemini-2.5-Flash | Und | 0.997 | - | - | - |
SD-3.5-Large | Gen | - | 0.288 | 0.134 | - |
HiDream-I1-Full | Gen | - | 0.378 | 0.157 | - |
FLUX.1-schnell | Gen | - | 0.292 | 0.159 | - |
FLUX.1-Kontext-dev | Edit | - | - | - | 0.105 |
ICEdit | Edit | - | - | - | 0.095 |
Step1X-Edit | Edit | - | - | - | 0.071 |
Show-o2-7B | Unified | 0.935 | 0.198 | 0.023 | - |
Janus-Pro-7B | Unified | 0.874 | 0.211 | 0.038 | - |
BLIP3o-NEXT-SFT-3B | Unified | 0.974 | 0.263 | 0.159 | - |
Ovis-U1-3B | Unified | 0.909 | 0.244 | 0.171 | 0.093 |
OmniGen2 | Unified | 0.952 | 0.294 | 0.143 | 0.073 |
UniPic2-Metaquery-9B | Unified | - | 0.301 | 0.139 | 0.132 |
UniWorld-V1 | Unified | - | 0.302 | 0.138 | 0.054 |
BAGEL-7B | Unified | 0.937 | 0.295 | 0.169 | 0.098 |
BAGEL-7B w/ CoT | Unified | 0.968 | 0.341 | 0.276 | 0.140 |
Qwen-Image | Unified | - | 0.429 | 0.224 | - |
Qwen-Image-Edit | Unified | - | - | - | 0.158 |
Gemini-2.5-Flash-Image | Unified | - | 0.593 | 0.650 | 0.343 |
GPT-Image-1 | Unified | - | 0.689 | 0.622 | 0.351 |
Resources
BibTeX
@article{li2025gir-bench,
title={GIR-Bench: Versatile Benchmark for Generating Images with Reasoning},
author={Hongxiang Li, Yaowei Li, Bin Lin, Yuwei Niu, Yuhang Yang, Xiaoshuang Huang, Jiayin Cai, Xiaolong Jiang, Yao Hu, Long Chen},
journal={arXiv preprint arXiv:2510.11026},
year={2025}
}