Authors:
(1) Dan Kondratyuk, Google Research and with Equal contribution;
(2) Lijun Yu, Google Research, Carnegie Mellon University and with Equal contribution;
(3) Xiuye Gu, Google Research and with Equal contribution;
(4) Jose Lezama, Google Research and with Equal contribution;
(5) Jonathan Huang, Google Research and with Equal contribution;
(6) Grant Schindler, Google Research;
(7) Rachel Hornung, Google Research;
(8) Vighnesh Birodkar, Google Research;
(9) Jimmy Yan, Google Research;
(10) Krishna Somandepalli, Google Research;
(11) Hassan Akbari, Google Research;
(12) Yair Alon, Google Research;
(13) Yong Cheng, Google DeepMind;
(14) Josh Dillon, Google Research;
(15) Agrim Gupta, Google Research;
(16) Meera Hahn, Google Research;
(17) Anja Hauth, Google Research;
(18) David Hendon, Google Research;
(19) Alonso Martinez, Google Research;
(20) David Minnen, Google Research;
(21) Mikhail Sirotenko, Google Research;
(22) Kihyuk Sohn, Google Research;
(23) Xuan Yang, Google Research;
(24) Hartwig Adam, Google Research;
(25) Ming-Hsuan Yang, Google Research;
(26) Irfan Essa, Google Research;
(27) Huisheng Wang, Google Research;
(28) David A. Ross, Google Research;
(29) Bryan Seybold, Google Research and with Equal contribution;
(30) Lu Jiang, Google Research and with Equal contribution.
Table of Links
Abstract and 1 Introduction
2. Related Work
3. Model Overview and 3.1. Tokenization
3.2. Language Model Backbone and 3.3. Super-Resolution
4. LLM Pretraining for Generation
4.1. Task Prompt Design
4.2. Training Strategy
5. Experiments
5.1. Experimental Setup
5.2. Pretraining Task Analysis
5.3. Comparison with the State-of-the-Art
5.4. LLM’s Diverse Capabilities in Video Generation and 5.5. Limitations
6. Conclusion, Acknowledgements, and References
A. Appendix
3.2. Language Model Backbone
After converting the image, video, and audio modalities into discrete tokens within a shared vocabulary, we can directly leverage a language model to generate videos and audios in the token space. We use a prefix language model with a decoder-only architecture as the backbone. By constructing different patterns of input tokens to output tokens during training, we can control the tasks the model is able to perform as explained in Section 4.
3.3. Super-Resolution
Generating high-resolution (HR) videos with an autoregressive transformer entails heavy computational costs due to the increase in sequence length. To illustrate this with an example, the video tokenizer of Section 3.1 operating on a 17 × 896 × 512 video produces a sequence of 35, 840 tokens, making autoregressive sampling highly impractical. Aiming at efficient and high-quality generative video upsampling, we develop a custom spatial super-resolution (SR) non-autoregressive video transformer (Yu et al., 2023a) to operate in token space on top of the language model output. To mitigate the computational requirements of the very long sequences involved, and in particular the quadratic memory of the self-attention layers, our design incorporates windowed local attention (Gupta et al., 2022). Specifically, our SR transformer is composed of blocks of three transformer layers, each of which performs self-attention in a local window aligned with one of three axes (Tu et al., 2022): spatial vertical, spatial horizontal and temporal. The cross-attention layers attend to the low-resolution (LR) token sequence and are also divided into local windows, isomorphic to those of the self-attention layers. All blocks also include cross-attention to T5 XL text embeddings. See Fig. 3 for a schematic representation of the custom transformer architecture.
Similar to (Yu et al., 2023c), we train the SR transformer using token factorization (with k = 2 factors) to account for the large vocabulary size. The LR token sequences are obtained by tokenizing bicubic-downsampled versions of the ground truth videos and applying noise augmentation (Ho et al., 2022a) in the discrete latent space. Specifically, we randomly resample the value of a random subset of the LR tokens and independently drop the LR condition and text embeddings for 10% of the training samples. During inference, we use non-autoregressive sampling (Chang et al., 2022; Yu et al., 2023a) with classifier-free guidance independently on both the LR condition and the text embeddings (Brooks et al., 2023). We use a cascade of two 2× stages to generate videos of 896×512 resolution from the 224×128 base output of VideoPoet. More implementation details can be found in the appendix.