AVTok: 1D Unified Tokenization for Holistic Audio-Video Generation

AVTok: 1D Unified Tokenization for Holistic
Audio-Video Generation

¹The Hong Kong University of Science and Technology

ECCV 2026

Abstract

Audio-video generation has recently gained unprecedented research attention, aiming to synthesize high-quality sounding video content with fine-grained synchronization and semantic alignment between the auditory and visual components. The preceding methods predominantly adopt a dual-branch design with separate tokenization and generation modules per modality, neglecting the representation gap while necessitating intensive computational resources for proper training. Inspired by recent advancements in one-dimensional visual tokenization, we present AVTok, a novel unified tokenizer designated for holistic audio-video generation. AVTok features a dual-stream transformer-based architecture with shared encoder-decoder and modal-specific learnable queries to efficiently and effectively encode an audio-video pair into a compact one-dimensional latent representation with a unified codebook. To cope with the heterogeneous information imbalance that hinders AVTok from exploiting aligned audio-visual information, we devise a hierarchical training strategy to progressively realize reconstruction capabilities for each modality. Extensive experiments demonstrate that AVTok excels both in audio-video reconstruction and when integrated into downstream pipelines for audio-to-video, video-to-audio, and class-conditional joint audio-video generation. AVTok paves the way for the challenge of joint audio-video tokenization and provides a potential direction to build unified large multimodal models for audio-video generation.

Motivation

Motivation illustration. Left: Previous audio-video generation models typically adopt a separate pretrained tokenizer per modality and omit the representation gap between their learned embedding spaces. Right: We instead design a unified tokenizer that jointly encodes both modalities into a shared token space. Video and audio embeddings are colored by their respective classes.

Highlights

1 1D Unified Audio-Video Tokenization. We propose a novel task that aims at jointly encoding both auditory and visual components into a single 1D latent representation, facilitating efficient and effective audio-video reconstruction and downstream generation.

2 Dual-stream Tokenizer. A video is patchified into spatio-temporal patch embeddings, while audio is first converted to a normalized mel-spectrogram and patchified like a grayscale image. Each modality uses its own learnable holistic and patch queries plus normalization layers, but shares the encoder-decoder weights — capturing modal-specific information while still fusing the two streams implicitly.

3 Video-First-Audio-Later (VFAL). Because visual information dominates auditory information, training from scratch suppresses the audio stream. VFAL instead trains the video stream first to establish a strong latent space, then attaches and trains the audio-specific modules, and finally fine-tunes the decoders for refined unified reconstruction. This is complemented by a representation alignment objective and an AR prior that together promote cross-modal correspondence and an AR-friendly token space.

4 Downstream Generation. Extensive experiments highlight that AVTok excels not only in unified audio-video reconstruction but also in downstream tasks, including audio-to-video, video-to-audio, and class-conditional joint audio-video generation.

Method

Method illustration. (a) AVTok features a dual-stream transformer-based architecture; each stream's forward pass is shown in (b). It jointly learns video (blue stream) and audio (green stream) reconstruction in a unified holistic scheme, using separate learnable queries and normalization layers per modality while sharing the remaining parameters for implicit cross-modal interaction. Beyond the reconstruction losses, a representation-alignment objective aligns AVTok's continuous tokens with an audio-visual foundation model, and an AR prior model encourages an AR-friendly discrete latent space — enabling downstream (c) audio-to-video, (d) video-to-audio, and (e) class-conditional joint audio-video generation.

Reconstruction Results

Quantitative comparison of reconstruction. Results are grouped into video-only (VO), audio-only (AO), and joint audio-video (AV) tokenization. W/M denote waveform/mel-spectrogram audio input. Best and second-best are highlighted.

Qualitative comparison of reconstruction results.

Generation Results

Comparison of generation results across audio-to-video, video-to-audio, and class-conditional joint audio-video generation. Diff / AR / FM denote diffusion, autoregressive, and flow-matching paradigms. Best and second-best are highlighted.

Qualitative results for downstream generation tasks.

@inproceedings{pham2026avtok, title = {AVTok: 1D Unified Tokenization for Holistic Audio-Video Generation}, author = {Pham, Kien T. and Chen, I Chieh and Chen, Qifeng and Chen, Long}, booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)}, year = {2026} }