Omni2Sound

Towards Unified Video-Text-to-Audio Generation

Yusheng Dai2,3, Zehua Chen1,3†, Yuxuan Jiang1,3, Baolong Gao1,3,
Qiuhong Ke2, Jun Zhu1,3†, Jianfei Cai2

1Tsinghua University, Beijing, China    2Monash University, Melbourne, Australia    3Shengshu AI, Beijing, China

Challenges: High-Quality Data Scarcity and Joint Training Competition

Training a unified model for video-to-audio (V2A), text-to-audio (T2A), and joint video-text-to-audio (VT2A) generation offers significant flexibility but faces critical, underexplored challenges. In this paper, we identify two foundational problems:

  1. Data Scarcity and Semantic Conflict (Figure 1): The scarcity of high-quality captions with tight Audio-Visual-Text (A-V-T) alignment leads to severe semantic conflicts. Reliance on audio-only captions introduces ambiguity (e.g., confusing "fireworks" with "tennis hits"), leading to severe semantic conflict between video and text conditions. Conversely, native multimodal models suffer from visual bias, often hallucinating silent objects or ignoring off-screen sounds. These mismatches cause unstable convergence and degraded faithfulness.
  2. Cross-Task and Intra-Task Competition (Figure 3): Joint training triggers complex competitive dynamics. Cross-task competition manifests as an adverse performance trade-off between V2A and T2A due to modality heterogeneity, hindering joint optimization. Intra-task competition within VT2A creates modality bias: text bias compromises audio-visual synchronization, while video bias degrades faithfulness in off-screen scenarios like background music.
Challenges in unified audio generation

SoundAtlas: Large-scale Human-Expert-Level Audio Captions with A-V-T Alignment

SoundAtlas annotation pipeline

In this work, we introduce SoundAtlas, a large-scale dataset of 470k audio-caption pairs. It is the first to significantly outperform existing datasets in quality, even surpassing human-expert annotation quality. Our construction process relies on a novel multi-turn agentic annotation pipeline powered by Gemini-2.5 Pro and Qwen-2.5-VL (Figure 2). Specifically, we employ Vision-to-Language Compression to mitigate hallucinations caused by visual bias (Figure 1), alongside a Junior-Senior Agent Handoff mechanism that achieves a 5× cost reduction followed by post-hoc filtering to ensure fidelity. Derived from VGGSound and AudioSet via this pipeline, SoundAtlas exhibits tight V-A-T alignment, delivering semantically rich captions capable of correcting errors in human benchmarks.

Omni2Sound: Unified VT2A Foundation Model with Unified SOTA Performance

Omni2Sound model architecture

Building on SoundAtlas, we propose Omni2Sound, a diffusion-based unified model that supports flexible input modalities while maintaining both fine-grained audio-visual synchronization and high-fidelity generation. To address the identified cross-task and intra-task competition, we design a three-stage progressive training schedule that departs from naive joint training. This strategy first establishes a robust T2A prior and leverages high-quality VT2A data to map distinct conditional spaces into a unified joint embedding, effectively converting cross-task competition into a cooperative dynamic. Furthermore, it employs a decoupled robustness training stage with push-pull synergistic augmentations to mitigate intra-task modality bias, ensuring both A-V alignment and faithfulness in off-screen audio generation.

As a result, Omni2Sound achieves unified state-of-the-art performance across V2A, T2A, and VT2A tasks on the comprehensive VGGSound-Omni benchmark, surpassing both previous unified frameworks and specialized baselines. Extensive evaluations further demonstrate its strong generalization capabilities on external benchmarks (e.g., Kling-Audio-Eval, Video-LLaMA generated captions) .

Results

Qualitative Demonstrations and Comparisons

Audio Caption Quality Comparison

We introduce SoundAtlas, the first large-scale, human-expert-level audio caption dataset, augmenting VGGSound and AudioSet with semantically rich and temporally detailed captions. It features tight visual–audio–text (V–A–T) alignment and a markedly higher text-audio faithfulness than prior datasets.

Omni2Sound: Generation Quality

We propose Omni2Sound, a diffusion-based unified model supporting flexible input modalities while maintaining both fine-grained audio-visual synchronization and high-fidelity generation.

Video-Text-to-Audio: Joint conditioning on Video and Text for precise semantic control.