SPAR

SPAR
Semantic-Pixel Self-Alignment and Adaptive Routing
for Unified Multimodal Models

Hongxiang Li1,*, Hongxu Chen1,*, Chenyang Zhu1, Xiaoshuang Huang2, Jiayin Cai2, Xiaolong Jiang2, Yao Hu2, Long Chen1,†

1The Hong Kong University of Science and Technology   2Xiaohongshu Inc.

*Equal contribution    Corresponding author

๐Ÿ”ฅ๐Ÿ”ฅ๐Ÿ”ฅ Why SPAR? ๐Ÿ”ฅ๐Ÿ”ฅ๐Ÿ”ฅ
๐Ÿšซ Semantics โ‰  Pixels

Semantic encoders are lossy: they discard high-frequency detail that pixel reconstruction needs, producing blurry artifacts in the generative space.

๐Ÿงฉ Asymmetric dual-stream

A lightweight semantic stream anchors discriminative features while a Transformer-augmented pixel stream recovers fine-grained details โ€” decoupled by design.

๐Ÿš€ Self-aligned generation

The optimized tokenizer becomes its own alignment teacher for the diffusion model โ€” no external feature network required.

๐Ÿ”ญ Dynamic token routing

Each token adaptively aggregates multi-layer MLLM features based on its distinct semantic role, better harnessing hierarchical representations for generation.

(a) Image Reconstruction: existing methods that model directly in the semantic representation space suffer from lossy compression and struggle to preserve high-frequency detail; SPAR effectively recovers pixel-level details. (b) Representation Alignment Paradigm: unlike approaches that rely on external semantic encoders to guide the generative model, SPAR natively employs the unified tokenizer itself as the alignment teacher.

Abstract

Multimodal Large Language Models (MLLMs) have achieved remarkable success in visual understanding but remain constrained in visual generation due to the fundamental feature discrepancy between semantic perception and pixel-level reconstruction. Bridging this gap requires overcoming two core challenges: (1) endowing semantic encoders with high-fidelity reconstruction capabilities, and (2) effectively aligning generative models with semantic spaces without relying on external teachers.

To this end, we propose a novel unified multimodal framework featuring Semantic-Pixel self-alignment and Adaptive Routing (SPAR). First, to reconcile semantic perception with pixel-level reconstruction, we introduce an asymmetric dual-stream unified tokenizer. A lightweight semantic stream anchors discriminative features, while a Transformer-augmented pixel stream recovers fine-grained visual details into a unified compact latent space. Second, to eliminate external dependencies, we propose a self-aligned generation paradigm that natively leverages this optimized tokenizer as an internal alignment teacher for the diffusion model. Furthermore, we introduce Dynamic Token Routing, which enables each token to adaptively aggregate multi-layer MLLM features based on its distinct semantic demands. Extensive experiments demonstrate that SPAR establishes the state-of-the-art for unified architectures, achieving exceptional generation and reconstruction quality while preserving foundational visual understanding capabilities.

Method

Semantic-Pixel Self-Aligned Unified Tokenizer
Our tokenizer explicitly decouples semantic preservation from pixel reconstruction. A lightweight semantic stream maps features into a compact latent space, strictly anchored by the frozen encoder to prevent catastrophic forgetting (\(\mathcal{L}_{s}\)). Concurrently, a Transformer-augmented pixel stream bridges the dimensional gap by aligning with the native pixel latent space (\(\mathcal{L}_{p}\)) to recover high-frequency spatial details. Both streams are fused and decoded to reconstruct the image (\(\mathcal{L}_{r}\)).
Unified Multimodal Model
The frozen MLLM processes multimodal inputs and learnable queries. Dynamic Token Routing (DTR) adaptively aggregates multi-layer MLLM hidden states based on distinct token semantics to condition the DiT. The optimized tokenizer serves as an internal alignment teacher, establishing a self-alignment paradigm that eliminates reliance on external learners.

Results

Image Reconstruction โ€” ImageNet 50k (256ร—256)

SPAR sets a new state-of-the-art among unified tokenizers and is highly competitive with generation-only tokenizers, reaching rFID 0.27, PSNR 26.65, and SSIM 0.856.

Model Ratio rFID โ†“ PSNR โ†‘ SSIM โ†‘
Generative Only Tokenizer
VAR161.0022.630.755
RAE160.4919.230.620
SD-VAE162.6422.130.590
DC-AE320.6923.850.660
VA-VAE160.2827.960.790
Unified Tokenizer
VILA-U161.80โ€“โ€“
TokenFlow161.3721.410.687
DualToken160.5423.560.742
EMU2143.2713.490.420
UniLIP320.7922.990.747
SPAR320.2726.650.856
Multimodal Understanding

Even with the visual encoder unfrozen during training, SPAR shows no degradation in understanding โ€” and even improves over the InternVL3 baseline thanks to the dual-stream self-alignment.

Model LLM MME-P MMB MMMU MM-Vet SEED MMVP
Understanding Only
LLaVA-OV1B123852.131.429.165.5โ€“
InternVL3-1B1B149272.643.459.571.167.3
InternVL3-2B2B163380.648.262.275.072.7
Qwen2.5-VL-3B3Bโ€“79.153.161.8โ€“โ€“
Emu3-Chat-8B8B124458.531.637.268.236.6
Understanding & Generation
Janus-Pro-7B7B156779.241.050.072.1โ€“
BAGEL-7B3B161079.243.248.2โ€“54.7
BLIP3-o-4B4B152878.646.660.173.8โ€“
Tar-7B7B157174.439.0โ€“73.0โ€“
SPAR-1B1B150073.043.259.871.568.9
SPAR-3B2B163880.748.762.275.173.3
Text-to-Image Generation โ€” GenEval & WISE

SPAR-3B reaches an overall GenEval 0.91 (ranking first) and WISE 0.64, surpassing even much larger generation-only models.

Model Params GenEval WISE
CountPos.Overall CulturalBiologyOverall
Generation Only
FLUX.1-dev12B0.750.680.820.480.420.50
SD3-Medium2B0.720.330.740.420.390.42
Understanding & Generation
Janus-Pro7B0.590.790.800.300.360.35
BAGEL7B0.810.640.820.440.440.52
OpenUni-L3B0.770.750.850.510.480.52
Tar7B0.830.800.84โ€“โ€“โ€“
SPAR-1B1B0.830.850.890.540.510.57
SPAR-3B3B0.840.870.910.670.610.64
Image Editing โ€” ImgEdit-Bench

SPAR-3B achieves an overall 4.01, the highest among open-source methods and closely approaching the proprietary GPT-4o (4.20), with a +0.58 improvement over the second-best OmniGen2.

ModelAddAdj.Ext.Repl. Rmv.Bkg.StyleHyb.Act.Overall
GPT-4o4.614.332.904.353.664.574.933.964.894.20
Open-source
Step1X-Edit3.883.141.763.402.413.164.632.642.523.06
BAGEL3.563.311.703.302.623.244.492.384.173.20
UniWorld-V13.823.642.273.473.242.994.212.962.743.26
OmniGen23.573.061.773.743.203.574.812.524.683.44
SPAR-3B4.313.932.324.524.154.204.873.124.694.01

BibTeX

๐Ÿฅบ Cite this work if it's helpful
@inproceedings{li2026spar,
  title     = {SPAR: Semantic-Pixel Self-Alignment and Adaptive Routing
               for Unified Multimodal Models},
  author    = {Li, Hongxiang and Chen, Hongxu and Zhu, Chenyang and
               Huang, Xiaoshuang and Cai, Jiayin and Jiang, Xiaolong and
               Hu, Yao and Chen, Long},
  booktitle = {European Conference on Computer Vision (ECCV)},
  year      = {2026}
}