Benchmark Project Page

GIR-Bench: Versatile Benchmark for Generating Images with Reasoning

Reasoning-centric evaluation of multimodal unified models across Understanding – Generation Consistency (UGC), Text-to-Image, and Editing, revealing the persistent gap between reasoning and faithful generation.

Highlights

  • Three views: Reasoning-driven Understanding ↔ Generation Consistency / T2I / Editing
  • Objective, task-specific, interpretable evaluation metrics, less LMM-as-a-Judge bias
  • 21 models and meaningful analysis; shows reasoning → generation/editing transfer bottleneck

What’s inside

  • Understanding ↔ Generation Consistency: 300 real-world entities across zoology, botany, and geograph
  • T2I: 300 logical reasoning entities across numerical, layout, and text rendering
  • Editing: 370 planning and reasoning-driven entities across puzzle, sudoku, and reasoning perception

Motivation

Most existing evaluations focus on shallow text–image alignment and lean on “LMM-as-a-Judge.” GIR-Bench instead probes complex, implicit reasoning in unified multimodal models across Understanding–Generation Consistency, Text-to-Image, and Editing, using objective, task-specific, interpretable metrics for a more comprehensive and fair assessment. Large-scale analyses show that while unified models outperform generation-only systems on reasoning-heavy tasks, they still struggle to reliably translate reasoning into faithful visual outputs.

Benchmark Overview

Three sub-benchmarks cover parsing knowledge and constraints, and reliably injecting reasoning into generation/editing.

Goal: Compare understanding vs generation for the same real-world entity. Implicit descriptions trigger generation; real images are used for understanding evaluation.

  • Domains: zoology, botany, geography (world knowledge)
  • Metric: DINOv3 similarity (generation); VQA accuracy (understanding)

Goal: Reasoning-driven T2I (numerical, layout, text rendering).

  • Numerical: logic/counting; exact-match on detected counts
  • Layout: verify rules via detected boxes (e.g., left↔right)
  • Text: implicit slogan → OCR substring score (swc)

Goal: Reasoning-driven editing (puzzle reassembly, sudoku completion, reasoning perception segmentation).

  • Puzzle: normalized FID
  • Sudoku: OCR digits then grid-level accuracy
  • Perception: IoU of predicted mask vs ground truth
Overview Figure
Click to enlarge the image.

Qualitative Results

Preview typical cases for the three sub-benchmarks.

UGC: understanding vs generation case
Click to enlarge the image.

Understanding ↔ Generation

  • Implicit prompt triggers generation
  • 300 real-world entities from Internet and open datasets
  • Verify if models use shared knowledge for understanding and generation
  • Domains: zoology, botany, geography

Evaluation

  • Real image used for understanding
  • DINOv3 similarity (generation)
  • VQA accuracy (understanding)
T2I: numerical, layout, and text rendering
Click to enlarge the image.

Reasoning-driven T2I

  • 300 complex logical reasoning entities
  • Not only requires retrieving world knowledge
  • But also applying precise logical reasoning to meet specified constraints

Evaluation

  • Numerical: exact-match on detected counts
  • Layout: verify rules via detected boxes (e.g., left↔right)
  • Text: implicit slogan → OCR substring score
Editing: puzzle, sudoku, and perception
Click to enlarge the image.

Reasoning-driven Editing

  • 370 planning and reasoning-driven entities
  • Evaluate spatial perception, logical reasoning, and complex reasoning chains

Evaluation

  • Puzzle: normalized FID
  • Sudoku: OCR digits → grid accuracy
  • Perception: IoU of prediction vs GT

Quantitative Results

Unified models outperform generaion-only baselines overall, yet reliable transfer from reasoning to generation remains the key bottleneck.

Leaderboard

Click headers to sort UCG = Understanding – Generation Consistency UCG-U = understanding, UCG-G = generation
Model ⬍ Type ⬍ UCG-U ⬍ UCG-G ⬍ T2I ⬍ Edit ⬍
Qwen2.5VL-7B Und 0.978 - - -
Qwen2.5-VL-32B Und 0.976 - - -
GPT-5 Und 0.994 - - -
Gemini-2.5-Flash Und 0.997 - - -
SD-3.5-Large Gen - 0.288 0.134 -
HiDream-I1-Full Gen - 0.378 0.157 -
FLUX.1-schnell Gen - 0.292 0.159 -
FLUX.1-Kontext-dev Edit - - - 0.105
ICEdit Edit - - - 0.095
Step1X-Edit Edit - - - 0.071
Show-o2-7B Unified 0.935 0.198 0.023 -
Janus-Pro-7B Unified 0.874 0.211 0.038 -
BLIP3o-NEXT-SFT-3B Unified 0.974 0.263 0.159 -
Ovis-U1-3B Unified 0.909 0.244 0.171 0.093
OmniGen2 Unified 0.952 0.294 0.143 0.073
UniPic2-Metaquery-9B Unified - 0.301 0.139 0.132
UniWorld-V1 Unified - 0.302 0.138 0.054
BAGEL-7B Unified 0.937 0.295 0.169 0.098
BAGEL-7B w/ CoT Unified 0.968 0.341 0.276 0.140
Qwen-Image Unified - 0.429 0.224 -
Qwen-Image-Edit Unified - - - 0.158
Gemini-2.5-Flash-Image Unified - 0.593 0.650 0.343
GPT-Image-1 Unified - 0.689 0.622 0.351

Resources

BibTeX

Ready

                @article{li2025gir-bench,
                    title={GIR-Bench: Versatile Benchmark for Generating Images with Reasoning},
                    author={Hongxiang Li, Yaowei Li, Bin Lin, Yuwei Niu, Yuhang Yang, Xiaoshuang Huang, Jiayin Cai, Xiaolong Jiang, Yao Hu, Long Chen},
                    journal={arXiv preprint arXiv:2510.11026},
                    year={2025}
                  }