SPAR
Semantic-Pixel Self-Alignment and Adaptive Routing
for Unified Multimodal Models

Hongxiang Li^1,*, Hongxu Chen^1,*, Chenyang Zhu¹, Xiaoshuang Huang², Jiayin Cai², Xiaolong Jiang², Yao Hu², Long Chen^1,†

¹The Hong Kong University of Science and Technology ²Xiaohongshu Inc.

^*Equal contribution ^†Corresponding author

Paper Cite

🔥🔥🔥 Why SPAR? 🔥🔥🔥

🚫 Semantics ≠ Pixels

Semantic encoders are lossy: they discard high-frequency detail that pixel reconstruction needs, producing blurry artifacts in the generative space.

🧩 Asymmetric dual-stream

A lightweight semantic stream anchors discriminative features while a Transformer-augmented pixel stream recovers fine-grained details — decoupled by design.

🚀 Self-aligned generation

The optimized tokenizer becomes its own alignment teacher for the diffusion model — no external feature network required.

🔭 Dynamic token routing

Each token adaptively aggregates multi-layer MLLM features based on its distinct semantic role, better harnessing hierarchical representations for generation.

SPAR motivation overview — **(a) Image Reconstruction:** existing methods that model directly in the semantic representation space suffer from lossy compression and struggle to preserve high-frequency detail; SPAR effectively recovers pixel-level details. **(b) Representation Alignment Paradigm:** unlike approaches that rely on external semantic encoders to guide the generative model, SPAR natively employs the unified tokenizer itself as the alignment teacher.

Abstract

Multimodal Large Language Models (MLLMs) have achieved remarkable success in visual understanding but remain constrained in visual generation due to the fundamental feature discrepancy between semantic perception and pixel-level reconstruction. Bridging this gap requires overcoming two core challenges: (1) endowing semantic encoders with high-fidelity reconstruction capabilities, and (2) effectively aligning generative models with semantic spaces without relying on external teachers.

To this end, we propose a novel unified multimodal framework featuring Semantic-Pixel self-alignment and Adaptive Routing (SPAR). First, to reconcile semantic perception with pixel-level reconstruction, we introduce an asymmetric dual-stream unified tokenizer. A lightweight semantic stream anchors discriminative features, while a Transformer-augmented pixel stream recovers fine-grained visual details into a unified compact latent space. Second, to eliminate external dependencies, we propose a self-aligned generation paradigm that natively leverages this optimized tokenizer as an internal alignment teacher for the diffusion model. Furthermore, we introduce Dynamic Token Routing, which enables each token to adaptively aggregate multi-layer MLLM features based on its distinct semantic demands. Extensive experiments demonstrate that SPAR establishes the state-of-the-art for unified architectures, achieving exceptional generation and reconstruction quality while preserving foundational visual understanding capabilities.

Method

Semantic-Pixel Self-Aligned Unified Tokenizer

Our tokenizer explicitly decouples semantic preservation from pixel reconstruction. A lightweight semantic stream maps features into a compact latent space, strictly anchored by the frozen encoder to prevent catastrophic forgetting (\(\mathcal{L}_{s}\)). Concurrently, a Transformer-augmented pixel stream bridges the dimensional gap by aligning with the native pixel latent space (\(\mathcal{L}_{p}\)) to recover high-frequency spatial details. Both streams are fused and decoded to reconstruct the image (\(\mathcal{L}_{r}\)).

Unified Multimodal Model

The frozen MLLM processes multimodal inputs and learnable queries. Dynamic Token Routing (DTR) adaptively aggregates multi-layer MLLM hidden states based on distinct token semantics to condition the DiT. The optimized tokenizer serves as an internal alignment teacher, establishing a self-alignment paradigm that eliminates reliance on external learners.

Results

Image Reconstruction — ImageNet 50k (256×256)

SPAR sets a new state-of-the-art among unified tokenizers and is highly competitive with generation-only tokenizers, reaching rFID 0.27, PSNR 26.65, and SSIM 0.856.

Model	Ratio	rFID ↓	PSNR ↑	SSIM ↑
Generative Only Tokenizer
VAR	16	1.00	22.63	0.755
RAE	16	0.49	19.23	0.620
SD-VAE	16	2.64	22.13	0.590
DC-AE	32	0.69	23.85	0.660
VA-VAE	16	0.28	27.96	0.790
Unified Tokenizer
VILA-U	16	1.80	–	–
TokenFlow	16	1.37	21.41	0.687
DualToken	16	0.54	23.56	0.742
EMU2	14	3.27	13.49	0.420
UniLIP	32	0.79	22.99	0.747
SPAR	32	0.27	26.65	0.856

Multimodal Understanding

Even with the visual encoder unfrozen during training, SPAR shows no degradation in understanding — and even improves over the InternVL3 baseline thanks to the dual-stream self-alignment.

Model	LLM	MME-P	MMB	MMMU	MM-Vet	SEED	MMVP
Understanding Only
LLaVA-OV	1B	1238	52.1	31.4	29.1	65.5	–
InternVL3-1B	1B	1492	72.6	43.4	59.5	71.1	67.3
InternVL3-2B	2B	1633	80.6	48.2	62.2	75.0	72.7
Qwen2.5-VL-3B	3B	–	79.1	53.1	61.8	–	–
Emu3-Chat-8B	8B	1244	58.5	31.6	37.2	68.2	36.6
Understanding & Generation
Janus-Pro-7B	7B	1567	79.2	41.0	50.0	72.1	–
BAGEL-7B	3B	1610	79.2	43.2	48.2	–	54.7
BLIP3-o-4B	4B	1528	78.6	46.6	60.1	73.8	–
Tar-7B	7B	1571	74.4	39.0	–	73.0	–
SPAR-1B	1B	1500	73.0	43.2	59.8	71.5	68.9
SPAR-3B	2B	1638	80.7	48.7	62.2	75.1	73.3

Text-to-Image Generation — GenEval & WISE

SPAR-3B reaches an overall GenEval 0.91 (ranking first) and WISE 0.64, surpassing even much larger generation-only models.

Model	Params	GenEval			WISE
Model	Params	Count	Pos.	Overall	Cultural	Biology	Overall
Generation Only
FLUX.1-dev	12B	0.75	0.68	0.82	0.48	0.42	0.50
SD3-Medium	2B	0.72	0.33	0.74	0.42	0.39	0.42
Understanding & Generation
Janus-Pro	7B	0.59	0.79	0.80	0.30	0.36	0.35
BAGEL	7B	0.81	0.64	0.82	0.44	0.44	0.52
OpenUni-L	3B	0.77	0.75	0.85	0.51	0.48	0.52
Tar	7B	0.83	0.80	0.84	–	–	–
SPAR-1B	1B	0.83	0.85	0.89	0.54	0.51	0.57
SPAR-3B	3B	0.84	0.87	0.91	0.67	0.61	0.64

Image Editing — ImgEdit-Bench

SPAR-3B achieves an overall 4.01, the highest among open-source methods and closely approaching the proprietary GPT-4o (4.20), with a +0.58 improvement over the second-best OmniGen2.

Model	Add	Adj.	Ext.	Repl.	Rmv.	Bkg.	Style	Hyb.	Act.	Overall
GPT-4o	4.61	4.33	2.90	4.35	3.66	4.57	4.93	3.96	4.89	4.20
Open-source
Step1X-Edit	3.88	3.14	1.76	3.40	2.41	3.16	4.63	2.64	2.52	3.06
BAGEL	3.56	3.31	1.70	3.30	2.62	3.24	4.49	2.38	4.17	3.20
UniWorld-V1	3.82	3.64	2.27	3.47	3.24	2.99	4.21	2.96	2.74	3.26
OmniGen2	3.57	3.06	1.77	3.74	3.20	3.57	4.81	2.52	4.68	3.44
SPAR-3B	4.31	3.93	2.32	4.52	4.15	4.20	4.87	3.12	4.69	4.01

Qualitative Gallery

Text-to-Image Generation

Qualitative results of SPAR image generation: accurate structure, fine-grained textures, and faithful prompt following.

BibTeX

🥺 Cite this work if it's helpful