Towards Unified Video-Text-to-Audio Generation
Yusheng Dai2,3, Zehua Chen1,3†, Yuxuan Jiang1,3, Baolong Gao1,3,
Qiuhong Ke2, Jun Zhu1,3†, Jianfei Cai2
1Tsinghua University, Beijing, China 2Monash University, Melbourne, Australia 3Shengshu AI, Beijing, China
Training a unified model for video-to-audio (V2A), text-to-audio (T2A), and joint video-text-to-audio (VT2A) generation offers significant flexibility but faces critical, underexplored challenges. In this paper, we identify two foundational problems:
In this work, we introduce SoundAtlas, a large-scale dataset of 470k audio-caption pairs. It is the first to significantly outperform existing datasets in quality, even surpassing human-expert annotation quality. Our construction process relies on a novel multi-turn agentic annotation pipeline powered by Gemini-2.5 Pro and Qwen-2.5-VL (Figure 2). Specifically, we employ Vision-to-Language Compression to mitigate hallucinations caused by visual bias (Figure 1), alongside a Junior-Senior Agent Handoff mechanism that achieves a 5× cost reduction followed by post-hoc filtering to ensure fidelity. Derived from VGGSound and AudioSet via this pipeline, SoundAtlas exhibits tight V-A-T alignment, delivering semantically rich captions capable of correcting errors in human benchmarks.
Building on SoundAtlas, we propose Omni2Sound, a diffusion-based unified model that supports flexible input modalities while maintaining both fine-grained audio-visual synchronization and high-fidelity generation. To address the identified cross-task and intra-task competition, we design a three-stage progressive training schedule that departs from naive joint training. This strategy first establishes a robust T2A prior and leverages high-quality VT2A data to map distinct conditional spaces into a unified joint embedding, effectively converting cross-task competition into a cooperative dynamic. Furthermore, it employs a decoupled robustness training stage with push-pull synergistic augmentations to mitigate intra-task modality bias, ensuring both A-V alignment and faithfulness in off-screen audio generation.
As a result, Omni2Sound achieves unified state-of-the-art performance across V2A, T2A, and VT2A tasks on the comprehensive VGGSound-Omni benchmark, surpassing both previous unified frameworks and specialized baselines. Extensive evaluations further demonstrate its strong generalization capabilities on external benchmarks (e.g., Kling-Audio-Eval, Video-LLaMA generated captions) .
Qualitative Demonstrations and Comparisons
We introduce SoundAtlas, the first large-scale, human-expert-level audio caption dataset, augmenting VGGSound and AudioSet with semantically rich and temporally detailed captions. It features tight visual–audio–text (V–A–T) alignment and a markedly higher text-audio faithfulness than prior datasets.
We propose Omni2Sound, a diffusion-based unified model supporting flexible input modalities while maintaining both fine-grained audio-visual synchronization and high-fidelity generation.
Video-Text-to-Audio: Joint conditioning on Video and Text for precise semantic control.