FrameBridge: Improving Image-to-Video
Generation with Bridge Models



Yuji Wang*, Zehua Chen*, Xiaoyu Chen, Jun Zhu, Jianfei Chen

Tsinghua Univeristy

*Equal Contribution


Abstract

Image-to-video (I2V) generation is gaining increasing attention with its wide application in video synthesis. Recently, diffusion-based I2V models have achieved remarkable progress given their novel design on network architecture, cascaded framework, and motion representation. However, restricted by their noise-to-data generation process, diffusion-based methods inevitably suffer the difficulty to generate video samples with both appearance consistency and temporal coherence from an uninformative Gaussian noise, which may limit their synthesis quality. In this work, we present FrameBridge, taking the given static image as the prior of video target and establishing a tractable bridge model between them. By formulating I2V synthesis as a frames-to-frames generation task and modelling it with a data-to-data process, we fully exploit the information in input image and facilitate the generative model to learn the image animation process. In two popular settings of training I2V models, namely fine-tuning a pre-trained text-to-video (T2V) model or training from scratch, we further propose two techniques, SNR-Aligned Fine-tuning (SAF) and neural prior, which improve the fine-tuning efficiency of diffusion-based T2V models to FrameBridge and the synthesis quality of bridge-based I2V models respectively. Experiments conducted on WebVid-2M and UCF-101 demonstrate that: (1) our FrameBridge achieves superior I2V quality in comparison with the diffusion counterpart (zero-shot FVD 83 vs. 176 on MSR-VTT and non-zero-shot FVD 122 vs. 171 on UCF-101); (2) our proposed SAF and neural prior effectively enhance the ability of bridge-based I2V models in the scenarios of fine-tuning and training from scratch.



Innovations

  • FrameBridge formulates I2V synthesis as a frames-to-frames generation task and models it with a data-to-data generative framework.
  • SNR-aligned fine-tuning (SAF) aligns the marginal distribution of bridge process and diffusion process, facilitating the leverage the pre-trained text-to-video (T2V) diffusion models.
  • Neural prior finds a stronger prior than the static image for video target, further improving the performance of FrameBridge.
  • Figure 1: Overview of FrameBridge and diffusion-based I2V models. The sampling process of FrameBridge (upper) starts from the informative given image, while diffusion models (lower) synthesize videos from uninformative noisy representation.



    FrameBridge VS Diffusion Counterpart



    Figure 2: Visualization for the mean value of marginal distributions. We visualize the decoded mean value $D(\mu_t(\cdot))$ of bridge process and diffusion process, where $D$ is the pre-trained VAE decoder. As shown, the prior and target of FrameBridge are naturally fitted with I2V synthesis. The data information is preserved along bridge process, while it gradually vanishes in the forward diffusion process.

  • Training from scratch on UCF-101:
  • Sample 1: Sample 2:
    Static Image: Class Condition: Static Image: Class Condition:
    Boxing Speedbag Basketball
    Bridge: Diffusion: Bridge: Diffusion:
  • Fine-tuning pre-trained VideoCrafter-1 (Text2Video) on WebVid and Testing FrameBridge in a Zero-shot Manner:
  • Static Image: Text Condition:
    One girl is waving hand, one girl, solo, sexy pose, pensive woman, voluminous dress, intricate lace, embroidered gloves, feathered hat, curled hairdo, pale skin
    FrameBridge: DynamicCrafter: SEINE:
    Static Image: Text Condition:
    sun rising, mountain, path, masterpiece
    FrameBridge: DynamicCrafter: SEINE:


    Ablation on SNR-Aligned Fine-tuning



    Figure 3: SNR-Aligned Fine-tuning for FrameBridge. SAF technique aligns the marginal distributions of FrameBridge with that of pre-trained Gaussian diffusion models, to improve fine-tuning efficiency.

  • Zero-shot Testing on MSR-VTT:
  • Sample 1: Sample 2:
    Static Image: Text Condition: Static Image: Text Condition:
    earth from outer space electronic piano playing performance
    w. SAF: w.o. SAF: w. SAF: w.o. SAF:
  • Zero-shot Testing on UCF-101:
  • Sample 1: Sample 2:
    Static Image: Class Condition: Static Image: Class Condition:
    playing sitar band marching
    w. SAF: w.o. SAF: w. SAF: w.o. SAF:
  • Zero-shot Testing with Other Prompt:
  • Sample 1: Sample 2:
    Static Image: Text Condition: Static Image: Text Condition:
    time-lapse of a blooming flower with leaves and a stem a bonfire is lit in the middle of a field
    w. SAF: w.o. SAF: w. SAF: w.o. SAF:


    Ablation on Neural Prior



    Figure 4: Case of neural prior. The neural prior provides more motion information than a static image, which has been closer to the reference video in dataset.

  • Training from Scratch on UCF-101:
  • Sample 1: Sample 2:
    Static Image: Class Condition: Static Image: Class Condition:
    Wall Push-up Fencing
    Neural Prior: Generation Result: Neural Prior: Generation Result: