Latent Swap Joint Diffusion for Long-Form Audio Generation

Anonymous



Swap Forward (SaFa) can generate seamless and coherent long-form output (audio and panorama images) by multi-view joint diffusion. With a forward-only design, it employs just two step-wise latent swap operators, achieving simplicity, efficiency without additional cost, and compatibility with diffusion architectures like U-Net and DiT.

Introduction


At its core, first, the bidirectional Self-Loop Latent Swap operator performs frame-level swaps on the overlapping regions of adjacent subviews during each denoising step. The frame-level operation adapts to high-frequency variability in both the spectrum and image latents, achieving higher time-frequency resolution with less distortion compared to latent averaging.

Second, the unidirectional Reference-Guided Latent Swap occurs between reference and each subview during early steps. It leverages an independent reference denoising process to guide other subviews, providing centralized guidance to ensure consistency without repetition across subviews. Given differences in the flattening orders to 1D sequences, the Reference-Guided Latent Swap is applied along different directions for spectrum and image generation, enabling segment-wise swaps while avoiding excessive repetition.

Swap Forward demonstrates strong generalization (adapting to both spectrum and image generation, U-Net and DiT architectures, and fixed or flexible attention window sizes), simplicity (using two simple swap operators in a forward-only manner), and efficiency (up to 11 times faster without extra time or computation consumption) to generate seamless and coherent long-form outputs.

Result Gallery


Comparison

The continuous gurgling sound of bubbles in the water.

SaFa (Ours)
Casino Ambience, electronic slot machines.

SaFa (Ours)
MultiDiffusion (MD) [Bar-Tal et al. 2023]
MultiDiffusion (MD) [Bar-Tal et al. 2023]
Movie Gen (MD*) [Meta 2024]
Movie Gen (MD*) [Meta 2024]
Merge-Attend-Diffuse (MAD) [Fabio et al. 2024]
Merge-Attend-Diffuse (MAD) [Fabio et al. 2024]

The audience's enthusiastic and passionate cheers and loud whistles in the stadium.

SaFa (Ours)

The sound of waves crashing on the beach, with distant kids playing ...

SaFa (Ours)
MultiDiffusion (MD) [Bar-Tal et al. 2023]
MultiDiffusion (MD) [Bar-Tal et al. 2023]
Movie Gen (MD*) [Meta 2024]
Movie Gen (MD*) [Meta 2024]
Merge-Attend-Diffuse (MAD) [Fabio et al. 2024]
Merge-Attend-Diffuse (MAD) [Fabio et al. 2024]
Drum and bass track with rapid beats and energetic bass lines.

SaFa (Ours)
Smooth, soulful saxophone in a relaxed jazz tune for quiet evenings.

SaFa (Ours)
MultiDiffusion (MD) [Bar-Tal et al. 2023]
MultiDiffusion (MD) [Bar-Tal et al. 2023]
Merge-Attend-Diffuse (MAD) [Fabio et al. 2024]
Merge-Attend-Diffuse (MAD) [Fabio et al. 2024]

A modern synthesizer creating futuristic soundscapes.

SaFa (Ours)

Acoustic ballad with heartfelt lyrics and soft piano.

SaFa (Ours)
MultiDiffusion (MD) [Bar-Tal et al. 2023]
MultiDiffusion (MD) [Bar-Tal et al. 2023]
Merge-Attend-Diffuse (MAD) [Fabio et al. 2024]
Merge-Attend-Diffuse (MAD) [Fabio et al. 2024]
A photo of a snowy mountain peak with skiers.

SaFa (Ours)
A photo of a city skyline at night.

SaFa (Ours)
MultiDiffusion (MD) [Bar-Tal et al. 2023]
MultiDiffusion (MD) [Bar-Tal et al. 2023]
SyncDiffusion [Lee et al. 2023]
SyncDiffusion [Lee et al. 2023]
Merge-Attend-Diffuse (MAD) [Fabio et al. 2024]
Merge-Attend-Diffuse (MAD) [Fabio et al. 2024]

A photo of a mountain range at twilight.

SaFa (Ours)

Natural landscape in anime style illustration.

SaFa (Ours)
MultiDiffusion (MD) [Bar-Tal et al. 2023]
MultiDiffusion (MD) [Bar-Tal et al. 2023]
SyncDiffusion [Lee et al. 2023]
SyncDiffusion [Lee et al. 2023]
Merge-Attend-Diffuse (MAD) [Fabio et al. 2024]
Merge-Attend-Diffuse (MAD) [Fabio et al. 2024]

Audio & Music Generation

The sound of waves crashing on the beach, with distant kids playing accompanied with the sound of seagull.
A woman is speaking.
The audience's enthusiastic and passionate cheers and loud whistles in the stadium.
Casino Ambience, electronic slot machines.
Smooth, soulful saxophone in a relaxed jazz tune for quiet evenings.
This song features a harp playing the main melody. This melody is calming and relaxing. This song can be played in a meditation center. There are no other instruments in this song. There are no voices in this song.
This instrumental song features a flute playing a high pitched melody. The melody starts off with one high pitched staccato note. After a brief pause, the flute plays two ascending notes in which the second note is sustained. Then a third higher note is played followed by a descending run of four more notes and ending on one higher note. There are no other instruments in this song. There is no percussion in this song. This song has a relaxing mood. This song can be played at a meditation center.
A modern synthesizer creating futuristic soundscapes.
The continuous gurgling sound of bubbles in the water.
The dog's bark is clear and powerful, often indicating alertness or excitement, with varying pitch.
Someone is Whistling.
A synth pad is playing a drone sound in the lower mid range. Cymbals are creating atmosphere while a flute/string/brass sound is playing a melody. The whole recording is full of reverb. This song may be playing in a forest documentary.

Panorama Generation

A photo of a mountain range at twilight.
A photo of a snowy mountain peak with skier.
A photo of the dolomites.
Majestic red rock formations glowing in the sunset.
Skyline of New York City.
Create a vibrant landscape inspired by 'Qingming Riverside Scene, with riverside life, famers, tourists, mountains, and traditional buildings.
Skyline of New York City.
A photo of a city skyline at night.
A photo of Chinese ink a vibrant landscape with farmers, tourists, mountains, traditional buildings and animal.
Silhouette wallpaper of a dreamy scene with shooting stars.
Natural landscape in anime style illustration.
Cartoon panorama of spring summer beautiful nature.
A photo of a beautiful ocean with coral reef.
A photo of a lake under the northern lights.
A serene sunrise over a misty lake, with soft colors reflecting on the water's surface.
Magnificent canyon vista with sheer cliffs and winding rivers.
A photo of a forest with a misty fog.
A photo of a grassland with animals.
Community garden with raised beds and flowers.
Serene mountain valley carpeted in vibrant fall foliage.

Potential Application Scenario

Meditation background music.
Introductory or transitional background music.
Movie soundscapes.
Movie soundscapes.

BibTeX