Generated with
AnimateDiff
and CivitAI model:
Realistic Vision V5.1
The first row of videos represents the reference videos, while the second row consists of videos generated by MotionClone.
Methodology
Motion-based controllable text-to-video generation involves motions to control the video generation. Previous methods typically require the training of models to
encode motion cues (e.g., VideoComposer) or the fine-tuning of video diffusion models (e.g., Tune-A-Video and VMC). However, these approaches often result in suboptimal motion generation when applied outside the trained domain.
In this work, we propose MotionClone, a training-free framework that enables motion cloning from reference videos to versatile motion-controlled video generation, including text-to-video and image-to-video. Based on the observation
that the dominant components in temporal-attention maps drive motion synthesis, while the rest mainly capture noisy or very subtle motions, MotionClone utilizes sparse temporal attention weights as motion representations for motion guidance,
facilitating diverse motion transfer across varying scenarios. Meanwhile, MotionClone allows for the direct extraction of motion representation through a single denoising step, bypassing the cumbersome inversion processes and thus promoting
both efficiency and flexibility. Extensive experiments demonstrate that MotionClone exhibits proficiency in both global camera motion and local object motion, with notable superiority in terms of motion fidelity, textual alignment, and temporal consistency.
As illustrated in the framework below, given a real reference video, the corresponding motion representation is obtained by performing a single noise-adding and denoising step.
During the video generation process, an initial latent is initialized from a standard Gaussian distribution and subsequently undergoes an iterative denoising procedure via a pre-trained video diffusion
model, advised by both classifier-free guidance and the proposed motion guidance. Given that image structure is determined in the early steps of the denoising process, whereas
motion fidelity primarily depends on the structure of each frame, motion guidance only involves the early denoising steps, allowing for sufficient flexibility for semantic adjustment and thus empowering premium video generation with compelling motion fidelity and precise textual alignment.
Object Motion Cloning Gallery
Here we demonstrate our results for object motion cloning. Click to play the following animations.
Click to Play and Loop Video
Click to Play and Loop Video
Click to Play and Loop Video
Camera Motion Cloning Gallery
Here we demonstrate our results for camera motion cloning. Click to play the following animations.
Click to Play and Loop Video
Click to Play and Loop Video
Click to Play and Loop Video
Versatile Applications of MotionClone
Here we demonstrate our results for more applications including Image-to-Video (I2V) and Sktech-to-Video. Click to play the following animations.
Click to Play and Loop VideoClick to Play and Loop Video
BibTeX
@article{ling2024motionclone,
title={MotionClone: Training-Free Motion Cloning for Controllable Video Generation},
author={Ling, Pengyang and Bu, Jiazi and Zhang, Pan and Dong, Xiaoyi and Zang, Yuhang and Wu, Tong and Chen, Huaian and Wang, Jiaqi and Jin, Yi},
journal={arXiv preprint arXiv:2406.05338},
year={2024}
}
Project page template is borrowed from DreamBooth.