Authors:
(1) Yuwei Guo, The Chinese University of Hong Kong;
(2) Ceyuan Yang, Shanghai Artificial Intelligence Laboratory with Corresponding Author;
(3) Anyi Rao, Stanford University;
(4) Zhengyang Liang, Shanghai Artificial Intelligence Laboratory;
(5) Yaohui Wang, Shanghai Artificial Intelligence Laboratory;
(6) Yu Qiao, Shanghai Artificial Intelligence Laboratory;
(7) Maneesh Agrawala, Stanford University;
(8) Dahua Lin, Shanghai Artificial Intelligence Laboratory;
(9) Bo Dai, The Chinese University of Hong Kong and The Chinese University of Hong Kong.
Table of Links
Abstract and 1 Introduction
2 Work Related
3 Preliminary
- AnimateDiff
4.1 Alleviate Negative Effects from Training Data with Domain Adapter
4.2 Learn Motion Priors with Motion Module
4.3 Adapt to New Motion Patterns with MotionLora
4.4 AnimateDiff in Practice
5 Experiments and 5.1 Qualitative Results
5.2 Qualitative Comparison
5.3 Ablative Study
5.4 Controllable Generation
6 Conclusion
7 Ethics Statement
8 Reproducibility Statement, Acknowledgement and References
4.3 ADAPT TO NEW MOTION PATTERNS WITH MOTIONLORA
While the pre-trained motion module captures general motion priors, a question arises when we need to effectively adapt it to new motion patterns such as camera zooming, panning and rolling, etc., with a small number of reference videos and training iterations. Such efficiency is essential for users who cannot afford expensive pre-training costs but would like to fine-tune the motion module for specific effects. Here comes the last stage of AnimateDiff, also dubbed as MotionLoRA (Fig. 3), an efficient fine-tuning approach for motion personalization. Considering the architecture of the motion module and the limited number of reference videos, we add LoRA layers to the self-attention layers of the motion module in the inflated model described in Sec. 4.2, then train these LoRA layers on the reference videos of new motion patterns.
We experiment with several shot types and get the reference videos via rule-based data augmentation. For instance, to get videos with zooming effects, we augment the videos by gradually reducing (zoom-in) or enlarging (zoom-out) the cropping area of video frames along the temporal axis. We demonstrate that our MotionLoRA can achieve promising results even with as few as 20 ∼ 50 reference videos, 2,000 training iterations (around 1 ∼ 2 hours) as well as about 30M storage space, enabling efficient model tuning and sharing among users. Benefited by the low-rank property, MotionLoRA also has the composition capability. Namely, individually trained MotionLoRA models can be combined to achieve composed motion effects at inference time.