Arxiv 2024
VidMaestro: Towards Photo-realistic and High-dynamic Video Generations
Arxiv 2024

Binbin Yang1Kangyang Xie2Xinyu Xiao3Meng Wang3Yang Liu1Jingdong Chen3Ming Yang3Liang Lin1

1Sun Yat-sen University, 2Zhejiang University, 3Antgroup

 

 

VidMaestro utilizes a reference image (the leftmost frame), and uses a pair of appearance and motion descriptions as prompts to generate a video.

 

Abstract


Recent advances in diffusion models have greatly propelled the progress of text-to-image (T2I) generation. However, generating both high-fidelity and high-dynamic videos poses greater challenges due to the high-dimensional latent space, the intricate spatial-temporal relationships, and the strong reliance on high-quality training data. Prior works have sought to extend a T2I diffusion model to a text/image-to-video model by incorporating temporal convolution modules. While the integration of temporal operations can enhance temporal consistency, they often suffer from limited object animation and unsatisfactory motion patterns. An underlying cause is the straightforward reuse of frame-wise spatial feature transformation and semantic alignment from the text-to-image backbone, which is insufficient to comprehensively capture the intricate spatial-temporal dynamics. In this work, we present VidMaestro, a video diffusion model to generate high-definition videos with controllable motion by separately guiding the appearance and motion information. Specifically, our method takes inputs as an appearance prompt comprising a reference image and a textual description, and a motion prompt detailing the movement and actions within the video. Unlike previous works that solely use spatial 2D self-attention and cross-attention to individually align each frame with the appearance prompt, our VidMaestro introduces a motion-aware 3D attention module to comprehensively capture video dynamics in the spatial-temporal space and thereby improve the semantic alignment with the input motion cues. By explicitly guiding the spatial and temporal content with these two cues, our VidMaestro exhibits the capability to generate controllable and high-dynamic motions rather than minimal animations. Extensive experiments have been conducted to demonstrate the superior spatial-temporal generative performance of our method, especially with temporal consistency and controllable motion.

 

 

Spatial 2D cross-attention v.s. Motion-aware 3D cross-attention

Spatial 2D cross-attention fails to correctly associate the "roaring" token with appropriate region. By contrast, our motion-aware 3D cross-attention successfully localizes the mouth of the mechanical white tiger and captures its moving trajectory.

 

 

2D Cross-attention                                                                                                             3D Cross-attention

 

 

 

 

 

 

 

 

 

 

 

Model Architecture


 

 

Gallery