FrameBridge: Improving Image-to-Video Generation with Bridge Models
Yuji Wang*, Zehua Chen*, Xiaoyu Chen,
Jun Zhu, Jianfei Chen
Tsinghua Univeristy
*Equal Contribution
Abstract
Image-to-video (I2V) generation is gaining increasing attention with its wide application in video synthesis.
Recently, diffusion-based I2V models have achieved remarkable progress given their novel design on network architecture, cascaded framework, and motion representation.
However, restricted by their noise-to-data generation process, diffusion-based methods inevitably suffer the difficulty to generate video samples
with both appearance consistency and temporal coherence from an uninformative Gaussian noise, which may limit their synthesis quality.
In this work, we present FrameBridge, taking the given static image as the prior of video target and establishing a tractable bridge model between them.
By formulating I2V synthesis as a frames-to-frames generation task and modelling it with a data-to-data process,
we fully exploit the information in input image and facilitate the generative model to learn the image animation process.
In two popular settings of training I2V models, namely fine-tuning a pre-trained text-to-video (T2V) model or training from scratch,
we further propose two techniques, SNR-Aligned Fine-tuning (SAF) and neural prior, which improve the fine-tuning efficiency of diffusion-based T2V models to FrameBridge and the synthesis quality of bridge-based I2V models respectively.
Experiments conducted on WebVid-2M and UCF-101 demonstrate that: (1) our FrameBridge achieves superior I2V quality in comparison with the diffusion counterpart (zero-shot FVD 83 vs. 176 on MSR-VTT and non-zero-shot FVD 122 vs. 171 on UCF-101);
(2) our proposed SAF and neural prior effectively enhance the ability of bridge-based I2V models in the scenarios of fine-tuning and training from scratch.
Innovations
FrameBridge formulates I2V synthesis as a frames-to-frames generation task and models it with a data-to-data generative framework.
SNR-aligned fine-tuning (SAF) aligns the marginal distribution of bridge process and diffusion process, facilitating the leverage the pre-trained text-to-video (T2V) diffusion models.
Neural prior finds a stronger prior than the static image for video target, further improving the performance of FrameBridge.
FrameBridge VS Diffusion Counterpart
Training from scratch on UCF-101:
Sample 1:
Sample 2:
Static Image:
Class Condition:
Static Image:
Class Condition:
Boxing Speedbag
Basketball
Bridge:
Diffusion:
Bridge:
Diffusion:
Fine-tuning pre-trained VideoCrafter-1 (Text2Video) on WebVid and Testing FrameBridge in a Zero-shot Manner:
Static Image:
Text Condition:
One girl is waving hand, one girl, solo, sexy pose, pensive woman, voluminous dress, intricate lace, embroidered gloves, feathered hat, curled hairdo, pale skin
FrameBridge:
DynamicCrafter:
SEINE:
Static Image:
Text Condition:
sun rising, mountain, path, masterpiece
FrameBridge:
DynamicCrafter:
SEINE:
Ablation on SNR-Aligned Fine-tuning
Zero-shot Testing on MSR-VTT:
Sample 1:
Sample 2:
Static Image:
Text Condition:
Static Image:
Text Condition:
earth from outer space
electronic piano playing performance
w. SAF:
w.o. SAF:
w. SAF:
w.o. SAF:
Zero-shot Testing on UCF-101:
Sample 1:
Sample 2:
Static Image:
Class Condition:
Static Image:
Class Condition:
playing sitar
band marching
w. SAF:
w.o. SAF:
w. SAF:
w.o. SAF:
Zero-shot Testing with Other Prompt:
Sample 1:
Sample 2:
Static Image:
Text Condition:
Static Image:
Text Condition:
time-lapse of a blooming flower with leaves and a stem