MotionClone: Training-Free Motion Cloning for Controllable Video Generation  

Pengyang Ling*1,4 Jiazi Bu*2,4 Pan Zhang4✝ Xiaoyi Dong4 Yuhang Zang4 Tong Wu3 Huaian Chen1 Jiaqi Wang4 Yi Jin1✝
*Equal Contribution. Corresponding Author.
1University of Science and Technology of China 2Shanghai Jiao Tong University 3The Chinese University of Hong Kong 4Shanghai AI Laboratory

[Paper]     [Github]     [BibTeX]


Click to Play the Animations!

Click to Play and Loop Video

Generated with AnimateDiff and CivitAI model: Realistic Vision V5.1
The first row of videos represents the reference videos, while the second row consists of videos generated by MotionClone.

Methodology

Motion-based controllable text-to-video generation involves motions to control the video generation. Previous methods typically require the training of models to encode motion cues (e.g., VideoComposer) or the fine-tuning of video diffusion models (e.g., Tune-A-Video and VMC). However, these approaches often result in suboptimal motion generation when applied outside the trained domain. In this work, we propose MotionClone, a training-free framework that enables motion cloning from reference videos to versatile motion-controlled video generation, including text-to-video and image-to-video. Based on the observation that the dominant components in temporal-attention maps drive motion synthesis, while the rest mainly capture noisy or very subtle motions, MotionClone utilizes sparse temporal attention weights as motion representations for motion guidance, facilitating diverse motion transfer across varying scenarios. Meanwhile, MotionClone allows for the direct extraction of motion representation through a single denoising step, bypassing the cumbersome inversion processes and thus promoting both efficiency and flexibility. Extensive experiments demonstrate that MotionClone exhibits proficiency in both global camera motion and local object motion, with notable superiority in terms of motion fidelity, textual alignment, and temporal consistency.

As illustrated in the framework below, given a real reference video, the corresponding motion representation is obtained by performing a single noise-adding and denoising step. During the video generation process, an initial latent is initialized from a standard Gaussian distribution and subsequently undergoes an iterative denoising procedure via a pre-trained video diffusion model, advised by both classifier-free guidance and the proposed motion guidance. Given that image structure is determined in the early steps of the denoising process, whereas motion fidelity primarily depends on the structure of each frame, motion guidance only involves the early denoising steps, allowing for sufficient flexibility for semantic adjustment and thus empowering premium video generation with compelling motion fidelity and precise textual alignment.


Object Motion Cloning Gallery

Here we demonstrate our results for object motion cloning.
Click to play the following animations.

Click to Play and Loop Video

Click to Play and Loop Video

Click to Play and Loop Video

Camera Motion Cloning Gallery

Here we demonstrate our results for camera motion cloning.
Click to play the following animations.

Click to Play and Loop Video

Click to Play and Loop Video

Click to Play and Loop Video

Versatile Applications of MotionClone

Here we demonstrate our results for more applications including Image-to-Video (I2V) and Sktech-to-Video.
Click to play the following animations.

Click to Play and Loop Video Click to Play and Loop Video

BibTeX

@article{ling2024motionclone,
  title={MotionClone: Training-Free Motion Cloning for Controllable Video Generation},
  author={Ling, Pengyang and Bu, Jiazi and Zhang, Pan and Dong, Xiaoyi and Zang, Yuhang and Wu, Tong and Chen, Huaian and Wang, Jiaqi and Jin, Yi},
  journal={arXiv preprint arXiv:2406.05338},
  year={2024}
}

Project page template is borrowed from DreamBooth.