AVTok: 1D Unified Tokenization for Holistic
Audio-Video Generation

1The Hong Kong University of Science and Technology
ECCV 2026
AVTok highlights
Highlights. (a) AVTok is a unified tokenizer with a dual-stream transformer architecture that jointly encodes an audio-video pair into a single compact 1D latent representation. (b) It is competitive with state-of-the-art unimodal 1D video tokenizers (top) and audio codecs (bottom). (c) AVTok plugs into AR generative models for audio-to-video, video-to-audio, and joint audio-video generation.

Abstract

Audio-video generation has recently gained unprecedented research attention, aiming to synthesize high-quality sounding video content with fine-grained synchronization and semantic alignment between the auditory and visual components. The preceding methods predominantly adopt a dual-branch design with separate tokenization and generation modules per modality, neglecting the representation gap while necessitating intensive computational resources for proper training. Inspired by recent advancements in one-dimensional visual tokenization, we present AVTok, a novel unified tokenizer designated for holistic audio-video generation. AVTok features a dual-stream transformer-based architecture with shared encoder-decoder and modal-specific learnable queries to efficiently and effectively encode an audio-video pair into a compact one-dimensional latent representation with a unified codebook. To cope with the heterogeneous information imbalance that hinders AVTok from exploiting aligned audio-visual information, we devise a hierarchical training strategy to progressively realize reconstruction capabilities for each modality. Extensive experiments demonstrate that AVTok excels both in audio-video reconstruction and when integrated into downstream pipelines for audio-to-video, video-to-audio, and class-conditional joint audio-video generation. AVTok paves the way for the challenge of joint audio-video tokenization and provides a potential direction to build unified large multimodal models for audio-video generation.

Motivation

Motivation
Motivation illustration. Left: Previous audio-video generation models typically adopt a separate pretrained tokenizer per modality and omit the representation gap between their learned embedding spaces. Right: We instead design a unified tokenizer that jointly encodes both modalities into a shared token space. Video and audio embeddings are colored by their respective classes.

Highlights

  • 1 1D Unified Audio-Video Tokenization. We propose a novel task that aims at jointly encoding both auditory and visual components into a single 1D latent representation, facilitating efficient and effective audio-video reconstruction and downstream generation.
  • 2 Dual-stream Tokenizer. A video is patchified into spatio-temporal patch embeddings, while audio is first converted to a normalized mel-spectrogram and patchified like a grayscale image. Each modality uses its own learnable holistic and patch queries plus normalization layers, but shares the encoder-decoder weights — capturing modal-specific information while still fusing the two streams implicitly.
  • 3 Video-First-Audio-Later (VFAL). Because visual information dominates auditory information, training from scratch suppresses the audio stream. VFAL instead trains the video stream first to establish a strong latent space, then attaches and trains the audio-specific modules, and finally fine-tunes the decoders for refined unified reconstruction. This is complemented by a representation alignment objective and an AR prior that together promote cross-modal correspondence and an AR-friendly token space.
  • 4 Downstream Generation. Extensive experiments highlight that AVTok excels not only in unified audio-video reconstruction but also in downstream tasks, including audio-to-video, video-to-audio, and class-conditional joint audio-video generation.

Method

AVTok pipeline
Method illustration. (a) AVTok features a dual-stream transformer-based architecture; each stream's forward pass is shown in (b). It jointly learns video (blue stream) and audio (green stream) reconstruction in a unified holistic scheme, using separate learnable queries and normalization layers per modality while sharing the remaining parameters for implicit cross-modal interaction. Beyond the reconstruction losses, a representation-alignment objective aligns AVTok's continuous tokens with an audio-visual foundation model, and an AR prior model encourages an AR-friendly discrete latent space — enabling downstream (c) audio-to-video, (d) video-to-audio, and (e) class-conditional joint audio-video generation.

Reconstruction Results

Reconstruction results table
Quantitative comparison of reconstruction. Results are grouped into video-only (VO), audio-only (AO), and joint audio-video (AV) tokenization. W/M denote waveform/mel-spectrogram audio input. Best and second-best are highlighted.
Qualitative reconstruction
Qualitative comparison of reconstruction results.

Generation Results

Generation results table
Comparison of generation results across audio-to-video, video-to-audio, and class-conditional joint audio-video generation. Diff / AR / FM denote diffusion, autoregressive, and flow-matching paradigms. Best and second-best are highlighted.
Qualitative generation
Qualitative results for downstream generation tasks.

Sounding-Video Samples

BibTeX

@inproceedings{pham2026avtok,
  title     = {AVTok: 1D Unified Tokenization for Holistic Audio-Video Generation},
  author    = {Pham, Kien T. and Chen, I Chieh and Chen, Qifeng and Chen, Long},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2026}
}