Table of Links
Abstract and 1. Introduction
2 Related Work
3 Preliminaries
4 Method
4.1 Key Sample and Joint Editing
4.2 Edit Propagation Via TokenFlow
5 Results
5.1 Qualitative Evaluation and 5.2 Quantitative Evaluation
5.3 Ablation Study
6 Discussion
7 Acknowledgement and References
A Implementation Details
ABSTRACT
The generative AI revolution has recently expanded to videos. Nevertheless, current state-of-the-art video models are still lagging behind image models in terms of visual quality and user control over the generated content. In this work, we present a framework that harnesses the power of a text-to-image diffusion model for the task of text-driven video editing. Specifically, given a source video and a target text-prompt, our method generates a high-quality video that adheres to the target text, while preserving the spatial layout and motion of the input video. Our method is based on a key observation that consistency in the edited video can be obtained by enforcing consistency in the diffusion feature space. We achieve this by explicitly propagating diffusion features based on inter-frame correspondences, readily available in the model. Thus, our framework does not require any training or fine-tuning, and can work in conjunction with any off-the-shelf text-toimage editing method. We demonstrate state-of-the-art editing results on a variety of real-world videos.
1 INTRODUCTION
The evolution of text-to-image models has recently facilitated advances in image editing and content creation, allowing users to control various proprieties of both generated and real images. Nevertheless, expanding this exciting progress to video is still lagging behind. A surge of large-scale text-to-video generative models has emerged, demonstrating impressive results in generating clips solely from textual descriptions. However, despite the progress made in this area, existing video models are still in their infancy, being limited in resolution, video length, or the complexity of video dynamics they can represent. In this paper, we harness the power of a state-of-the-art pre-trained text-to-image model for the task of text-driven editing of natural videos. Specifically, our goal is to generate high-quality videos that adhere to the target edit expressed by an input text prompt, while preserving the spatial layout and motion of the original video. The main challenge in leveraging an image diffusion model for video editing is to ensure that the edited content is consistent across all video frames – ideally, each physical point in the 3D world undergoes coherent modifications across time. Existing and concurrent video editing methods that are based on image diffusion models have demonstrated that global appearance coherency across the edited frames can be achieved by extending the self-attention module to include multiple frames (Wu et al., 2022; Khachatryan et al., 2023b; Ceylan et al., 2023; Qi et al., 2023). Nevertheless, this approach is insufficient for achieving the desired level of temporal consistency, as motion in the video is only implicitly preserved through the attention module. Consequently, professionals or semi-professionals users often resort to elaborate video editing pipelines that entail additional manual work. In this work, we propose a framework to tackle this challenge by explicitly enforcing the original inter-frame correspondences on the edit. Intuitively, natural videos contain redundant information across frames, e.g., depict similar appearance and shared visual elements. Our key observation is that the internal representation of the video in the diffusion model exhibits similar properties. That is, the level of redundancy and temporal consistency of the frames in the RGB space and in the diffusion feature space are tightly correlated. Based on this observation, the pillar of our approach is to achieve consistent edit by ensuring that the features of the edited video are consistent across frames. Specifically, we enforce that the edited features convey the same inter-frame correspondences and redundancy as the original video features. To do so, we leverage the original inter-frame feature correspondences, which are readily available by the model. This leads to an effective method that directly propagates the edited diffusion features based on the original video dynamics. This approach allows us to harness the generative prior of state-of-the-art image diffusion model without additional training or fine-tuning, and can work in conjunction with an off-the-shelf diffusion-based image editing method (e.g., Meng et al. (2022); Hertz et al. (2022); Zhang & Agrawala (2023); Tumanyan et al. (2023)).
To summarize, we make the following key contributions:
• A technique, dubbed TokenFlow, that enforces semantic correspondences of diffusion features across frames, allowing to significantly increase temporal consistency in videos generated by a text-to-image diffusion model.
• Novel empirical analysis studying the proprieties of diffusion features across a video.
• State-of-the-art editing results on diverse videos, depicting complex motions.
Authors:
(1) Michal Geyer, Weizmann Institute of Science and Indicates equal contribution;
(2) Omer Bar-Tal, Weizmann Institute of Science and Indicates equal contribution;
(3) Shai Bagon, Weizmann Institute of Science;
(4) Tali Dekel, Weizmann Institute of Science.