SPAR
Semantic-Pixel Self-Alignment and Adaptive Routing
for Unified Multimodal Models
1The Hong Kong University of Science and Technology 2Xiaohongshu Inc.
*Equal contribution †Corresponding author
Semantic encoders are lossy: they discard high-frequency detail that pixel reconstruction needs, producing blurry artifacts in the generative space.
A lightweight semantic stream anchors discriminative features while a Transformer-augmented pixel stream recovers fine-grained details โ decoupled by design.
The optimized tokenizer becomes its own alignment teacher for the diffusion model โ no external feature network required.
Each token adaptively aggregates multi-layer MLLM features based on its distinct semantic role, better harnessing hierarchical representations for generation.
Abstract
Multimodal Large Language Models (MLLMs) have achieved remarkable success in visual understanding but remain constrained in visual generation due to the fundamental feature discrepancy between semantic perception and pixel-level reconstruction. Bridging this gap requires overcoming two core challenges: (1) endowing semantic encoders with high-fidelity reconstruction capabilities, and (2) effectively aligning generative models with semantic spaces without relying on external teachers.
To this end, we propose a novel unified multimodal framework featuring Semantic-Pixel self-alignment and Adaptive Routing (SPAR). First, to reconcile semantic perception with pixel-level reconstruction, we introduce an asymmetric dual-stream unified tokenizer. A lightweight semantic stream anchors discriminative features, while a Transformer-augmented pixel stream recovers fine-grained visual details into a unified compact latent space. Second, to eliminate external dependencies, we propose a self-aligned generation paradigm that natively leverages this optimized tokenizer as an internal alignment teacher for the diffusion model. Furthermore, we introduce Dynamic Token Routing, which enables each token to adaptively aggregate multi-layer MLLM features based on its distinct semantic demands. Extensive experiments demonstrate that SPAR establishes the state-of-the-art for unified architectures, achieving exceptional generation and reconstruction quality while preserving foundational visual understanding capabilities.
Method
Results
SPAR sets a new state-of-the-art among unified tokenizers and is highly competitive with generation-only tokenizers, reaching rFID 0.27, PSNR 26.65, and SSIM 0.856.
| Model | Ratio | rFID โ | PSNR โ | SSIM โ |
|---|---|---|---|---|
| Generative Only Tokenizer | ||||
| VAR | 16 | 1.00 | 22.63 | 0.755 |
| RAE | 16 | 0.49 | 19.23 | 0.620 |
| SD-VAE | 16 | 2.64 | 22.13 | 0.590 |
| DC-AE | 32 | 0.69 | 23.85 | 0.660 |
| VA-VAE | 16 | 0.28 | 27.96 | 0.790 |
| Unified Tokenizer | ||||
| VILA-U | 16 | 1.80 | โ | โ |
| TokenFlow | 16 | 1.37 | 21.41 | 0.687 |
| DualToken | 16 | 0.54 | 23.56 | 0.742 |
| EMU2 | 14 | 3.27 | 13.49 | 0.420 |
| UniLIP | 32 | 0.79 | 22.99 | 0.747 |
| SPAR | 32 | 0.27 | 26.65 | 0.856 |
Even with the visual encoder unfrozen during training, SPAR shows no degradation in understanding โ and even improves over the InternVL3 baseline thanks to the dual-stream self-alignment.
| Model | LLM | MME-P | MMB | MMMU | MM-Vet | SEED | MMVP |
|---|---|---|---|---|---|---|---|
| Understanding Only | |||||||
| LLaVA-OV | 1B | 1238 | 52.1 | 31.4 | 29.1 | 65.5 | โ |
| InternVL3-1B | 1B | 1492 | 72.6 | 43.4 | 59.5 | 71.1 | 67.3 |
| InternVL3-2B | 2B | 1633 | 80.6 | 48.2 | 62.2 | 75.0 | 72.7 |
| Qwen2.5-VL-3B | 3B | โ | 79.1 | 53.1 | 61.8 | โ | โ |
| Emu3-Chat-8B | 8B | 1244 | 58.5 | 31.6 | 37.2 | 68.2 | 36.6 |
| Understanding & Generation | |||||||
| Janus-Pro-7B | 7B | 1567 | 79.2 | 41.0 | 50.0 | 72.1 | โ |
| BAGEL-7B | 3B | 1610 | 79.2 | 43.2 | 48.2 | โ | 54.7 |
| BLIP3-o-4B | 4B | 1528 | 78.6 | 46.6 | 60.1 | 73.8 | โ |
| Tar-7B | 7B | 1571 | 74.4 | 39.0 | โ | 73.0 | โ |
| SPAR-1B | 1B | 1500 | 73.0 | 43.2 | 59.8 | 71.5 | 68.9 |
| SPAR-3B | 2B | 1638 | 80.7 | 48.7 | 62.2 | 75.1 | 73.3 |
SPAR-3B reaches an overall GenEval 0.91 (ranking first) and WISE 0.64, surpassing even much larger generation-only models.
| Model | Params | GenEval | WISE | ||||
|---|---|---|---|---|---|---|---|
| Count | Pos. | Overall | Cultural | Biology | Overall | ||
| Generation Only | |||||||
| FLUX.1-dev | 12B | 0.75 | 0.68 | 0.82 | 0.48 | 0.42 | 0.50 |
| SD3-Medium | 2B | 0.72 | 0.33 | 0.74 | 0.42 | 0.39 | 0.42 |
| Understanding & Generation | |||||||
| Janus-Pro | 7B | 0.59 | 0.79 | 0.80 | 0.30 | 0.36 | 0.35 |
| BAGEL | 7B | 0.81 | 0.64 | 0.82 | 0.44 | 0.44 | 0.52 |
| OpenUni-L | 3B | 0.77 | 0.75 | 0.85 | 0.51 | 0.48 | 0.52 |
| Tar | 7B | 0.83 | 0.80 | 0.84 | โ | โ | โ |
| SPAR-1B | 1B | 0.83 | 0.85 | 0.89 | 0.54 | 0.51 | 0.57 |
| SPAR-3B | 3B | 0.84 | 0.87 | 0.91 | 0.67 | 0.61 | 0.64 |
SPAR-3B achieves an overall 4.01, the highest among open-source methods and closely approaching the proprietary GPT-4o (4.20), with a +0.58 improvement over the second-best OmniGen2.
| Model | Add | Adj. | Ext. | Repl. | Rmv. | Bkg. | Style | Hyb. | Act. | Overall |
|---|---|---|---|---|---|---|---|---|---|---|
| GPT-4o | 4.61 | 4.33 | 2.90 | 4.35 | 3.66 | 4.57 | 4.93 | 3.96 | 4.89 | 4.20 |
| Open-source | ||||||||||
| Step1X-Edit | 3.88 | 3.14 | 1.76 | 3.40 | 2.41 | 3.16 | 4.63 | 2.64 | 2.52 | 3.06 |
| BAGEL | 3.56 | 3.31 | 1.70 | 3.30 | 2.62 | 3.24 | 4.49 | 2.38 | 4.17 | 3.20 |
| UniWorld-V1 | 3.82 | 3.64 | 2.27 | 3.47 | 3.24 | 2.99 | 4.21 | 2.96 | 2.74 | 3.26 |
| OmniGen2 | 3.57 | 3.06 | 1.77 | 3.74 | 3.20 | 3.57 | 4.81 | 2.52 | 4.68 | 3.44 |
| SPAR-3B | 4.31 | 3.93 | 2.32 | 4.52 | 4.15 | 4.20 | 4.87 | 3.12 | 4.69 | 4.01 |
Qualitative Gallery
BibTeX
@inproceedings{li2026spar,
title = {SPAR: Semantic-Pixel Self-Alignment and Adaptive Routing
for Unified Multimodal Models},
author = {Li, Hongxiang and Chen, Hongxu and Zhu, Chenyang and
Huang, Xiaoshuang and Cai, Jiayin and Jiang, Xiaolong and
Hu, Yao and Chen, Long},
booktitle = {European Conference on Computer Vision (ECCV)},
year = {2026}
}