Language Model Backbone And Super-Resolution | HackerNoon

Authors:

(1) Dan Kondratyuk, Google Research and with Equal contribution;

(2) Lijun Yu, Google Research, Carnegie Mellon University and with Equal contribution;

(3) Xiuye Gu, Google Research and with Equal contribution;

(4) Jose Lezama, Google Research and with Equal contribution;

(5) Jonathan Huang, Google Research and with Equal contribution;

(6) Grant Schindler, Google Research;

(7) Rachel Hornung, Google Research;

(8) Vighnesh Birodkar, Google Research;

(9) Jimmy Yan, Google Research;

(10) Krishna Somandepalli, Google Research;

(11) Hassan Akbari, Google Research;

(12) Yair Alon, Google Research;

(13) Yong Cheng, Google DeepMind;

(14) Josh Dillon, Google Research;

(15) Agrim Gupta, Google Research;

(16) Meera Hahn, Google Research;

(17) Anja Hauth, Google Research;

(18) David Hendon, Google Research;

(19) Alonso Martinez, Google Research;

(20) David Minnen, Google Research;

(21) Mikhail Sirotenko, Google Research;

(22) Kihyuk Sohn, Google Research;

(23) Xuan Yang, Google Research;

(24) Hartwig Adam, Google Research;

(25) Ming-Hsuan Yang, Google Research;

(26) Irfan Essa, Google Research;

(27) Huisheng Wang, Google Research;

(28) David A. Ross, Google Research;

(29) Bryan Seybold, Google Research and with Equal contribution;

(30) Lu Jiang, Google Research and with Equal contribution.

Table of Links

Abstract and 1 Introduction

2. Related Work

3. Model Overview and 3.1. Tokenization

3.2. Language Model Backbone and 3.3. Super-Resolution

4. LLM Pretraining for Generation

4.1. Task Prompt Design

4.2. Training Strategy

5. Experiments

5.1. Experimental Setup

5.2. Pretraining Task Analysis

5.3. Comparison with the State-of-the-Art

5.4. LLM’s Diverse Capabilities in Video Generation and 5.5. Limitations

6. Conclusion, Acknowledgements, and References

A. Appendix

3.2. Language Model Backbone

After converting the image, video, and audio modalities into discrete tokens within a shared vocabulary, we can directly leverage a language model to generate videos and audios in the token space. We use a prefix language model with a decoder-only architecture as the backbone. By constructing different patterns of input tokens to output tokens during training, we can control the tasks the model is able to perform as explained in Section 4.

3.3. Super-Resolution

Generating high-resolution (HR) videos with an autoregressive transformer entails heavy computational costs due to the increase in sequence length. To illustrate this with an example, the video tokenizer of Section 3.1 operating on a 17 × 896 × 512 video produces a sequence of 35, 840 tokens, making autoregressive sampling highly impractical. Aiming at efficient and high-quality generative video upsampling, we develop a custom spatial super-resolution (SR) non-autoregressive video transformer (Yu et al., 2023a) to operate in token space on top of the language model output. To mitigate the computational requirements of the very long sequences involved, and in particular the quadratic memory of the self-attention layers, our design incorporates windowed local attention (Gupta et al., 2022). Specifically, our SR transformer is composed of blocks of three transformer layers, each of which performs self-attention in a local window aligned with one of three axes (Tu et al., 2022): spatial vertical, spatial horizontal and temporal. The cross-attention layers attend to the low-resolution (LR) token sequence and are also divided into local windows, isomorphic to those of the self-attention layers. All blocks also include cross-attention to T5 XL text embeddings. See Fig. 3 for a schematic representation of the custom transformer architecture.

Similar to (Yu et al., 2023c), we train the SR transformer using token factorization (with k = 2 factors) to account for the large vocabulary size. The LR token sequences are obtained by tokenizing bicubic-downsampled versions of the ground truth videos and applying noise augmentation (Ho et al., 2022a) in the discrete latent space. Specifically, we randomly resample the value of a random subset of the LR tokens and independently drop the LR condition and text embeddings for 10% of the training samples. During inference, we use non-autoregressive sampling (Chang et al., 2022; Yu et al., 2023a) with classifier-free guidance independently on both the LR condition and the text embeddings (Brooks et al., 2023). We use a cascade of two 2× stages to generate videos of 896×512 resolution from the 224×128 base output of VideoPoet. More implementation details can be found in the appendix.

Language Model Backbone and Super-Resolution | HackerNoon

Table of Links

3.2. Language Model Backbone

3.3. Super-Resolution

Leave a Reply Cancel reply

Stay Connected

Latest News

Why the Dreame X50 Ultra is the ultimate cleaning companion

FIMI releases Manta VTOL fixed-wing drone with vertical takeoff and landing feature · TechNode

Nikola Jokic and Russell Westbrook combine to make NBA history

China carmakers to be dominant globally despite tariffs: AlixPartnersTechNode

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

Topics

Sign Up for Our Newsletter

Table of Links

3.2. Language Model Backbone

3.3. Super-Resolution

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Stay Connected

Latest News