Authors:
(1) Dan Kondratyuk, Google Research and with Equal contribution;
(2) Lijun Yu, Google Research, Carnegie Mellon University and with Equal contribution;
(3) Xiuye Gu, Google Research and with Equal contribution;
(4) Jose Lezama, Google Research and with Equal contribution;
(5) Jonathan Huang, Google Research and with Equal contribution;
(6) Grant Schindler, Google Research;
(7) Rachel Hornung, Google Research;
(8) Vighnesh Birodkar, Google Research;
(9) Jimmy Yan, Google Research;
(10) Krishna Somandepalli, Google Research;
(11) Hassan Akbari, Google Research;
(12) Yair Alon, Google Research;
(13) Yong Cheng, Google DeepMind;
(14) Josh Dillon, Google Research;
(15) Agrim Gupta, Google Research;
(16) Meera Hahn, Google Research;
(17) Anja Hauth, Google Research;
(18) David Hendon, Google Research;
(19) Alonso Martinez, Google Research;
(20) David Minnen, Google Research;
(21) Mikhail Sirotenko, Google Research;
(22) Kihyuk Sohn, Google Research;
(23) Xuan Yang, Google Research;
(24) Hartwig Adam, Google Research;
(25) Ming-Hsuan Yang, Google Research;
(26) Irfan Essa, Google Research;
(27) Huisheng Wang, Google Research;
(28) David A. Ross, Google Research;
(29) Bryan Seybold, Google Research and with Equal contribution;
(30) Lu Jiang, Google Research and with Equal contribution.
Table of Links
Abstract and 1 Introduction
2. Related Work
3. Model Overview and 3.1. Tokenization
3.2. Language Model Backbone and 3.3. Super-Resolution
4. LLM Pretraining for Generation
4.1. Task Prompt Design
4.2. Training Strategy
5. Experiments
5.1. Experimental Setup
5.2. Pretraining Task Analysis
5.3. Comparison with the State-of-the-Art
5.4. LLM’s Diverse Capabilities in Video Generation and 5.5. Limitations
6. Conclusion, Acknowledgements, and References
A. Appendix
4.1. Task Prompt Design
We design a pretraining task mixture, each with a defined prefix input and output. The model conditions on the prefix, applying the loss solely to the output. Fig. 2 shows a typical input-output sequence layout. For each task, the input sequence may include three types of values: text embeddings (T5), visual tokens(MAGVIT-v2), and audio tokens (SoundStream). The model outputs two types of tokens: visual and audio tokens. To facilitate training, VideoPoet employs special tokens, as listed in Appendix Table 4. In the following, we describe key designs for the task prompts.
Pretraining tasks. We consider the following tasks. Unconditioned video generation: Generate video frames without conditioning on an input. Text-to-video (T2V): Generate video from a text prompt. Video future prediction (FP): Given an input video of variable length, predict future frames. Image-to-video (I2V): Given the first frame of a video as an input image, predict the future frames. Video inpainting/outpainting (Painting): Given a masked video, predict the video with the masked contents filled in. Video stylization: Given text, optical flow, and depth, predict the video frames (Section 4.1). Audio-to-video: Given an input audio waveform, predict the corresponding video. Videoto-audio: Given an input video, predict the corresponding audio waveform. Audio-video continuation (AVCont) given an input frame and its audio, predict the rest of the video and audio.
To indicate the type of task, we condition on the <task> token, which has a unique value for each unique output. We note that not all input variations need a new <task>; the model adapts to different context signals for identical outputs. For instance, text-to-video, image-to-video, and unconditioned video generation share the same . If a modality is absent in a task, related input/output tokens and special tokens are excluded, shortening the sequence.
Representing an image as a video. In text-to-image pretraining, we omit the <eos> and <eos_o> tokens from the input sequence, enabling continuous token generation for inference of longer videos. This approach blurs the boundary between video and image generation tasks, enhancing cross-modality information sharing. This design leads to the prediction of higher-quality initial frames and reduces errors and artifacts in subsequent frames.
Video token format. We generate video tokens at two resolutions, 128×128 and 128×224, each available in two lengths: 17 frames and 41 frames, both encoded at 8 frames per second. Special conditioning tokens are used to signal the desired resolutions and durations for video generation. Images are a special case of a 1-frame video, which we tokenize at 128×128 resolution.
Video stylization. For video stylization, we adopt a method motivated by (Zhang et al., 2023b; Chen et al., 2023b; Esser et al., 2023), predicting videos from text, optical flow, and depth signals. The training task for stylization is to reconstruct the ground truth video from the given optical flow, depth, and text information, but during inference, we apply optical flow and depth estimation on an input video but then vary the text prompt to generate a new style, e.g. “cartoon.” Similar to (Esser et al., 2023), text dictates the output “content” or appearance, while optical flow and depth guide its “structure.”