VideoGPT: Video Generation using VQ-VAE and Transformers

Related tags

Deep LearningVideoGPT
Overview

VideoGPT: Video Generation using VQ-VAE and Transformers

[Paper][Website][Colab][Gradio Demo]

We present VideoGPT: a conceptually simple architecture for scaling likelihood based generative modeling to natural videos. VideoGPT uses VQ-VAE that learns downsampled discrete latent representations of a raw video by employing 3D convolutions and axial self-attention. A simple GPT-like architecture is then used to autoregressively model the discrete latents using spatio-temporal position encodings. Despite the simplicity in formulation and ease of training, our architecture is able to generate samples competitive with state-of-the-art GAN models for video generation on the BAIR Robot dataset, and generate high fidelity natural images from UCF-101 and Tumbler GIF Dataset (TGIF). We hope our proposed architecture serves as a reproducible reference for a minimalistic implementation of transformer based video generation models.

Approach

VideoGPT

Installation

Change the cudatoolkit version compatible to your machine.

$ conda install --yes -c pytorch pytorch=1.7.1 torchvision cudatoolkit=11.0
$ pip install git+https://github.com/wilson1yan/VideoGPT.git

Sparse Attention (Optional)

For limited compute scenarios, it may be beneficial to use sparse attention.

$ sudo apt-get install llvm-9-dev
$ DS_BUILD_SPARSE_ATTN=1 pip install deepspeed

After installng deepspeed, you can train a sparse transformer by setting the flag --attn_type sparse in scripts/train_videogpt.py. The default support sparsity configuration is an N-d strided sparsity layout, however, you can write your own arbitrary layouts to use.

Dataset

The default code accepts data as an HDF5 file with the specified format in videogpt/data.py, and a directory format with the follow structure:

video_dataset/
    train/
        class_0/
            video1.mp4
            video2.mp4
            ...
        class_1/
            video1.mp4
            ...
        ...
        class_n/
            ...
    test/
        class_0/
            video1.mp4
            video2.mp4
            ...
        class_1/
            video1.mp4
            ...
        ...
        class_n/
            ...

An example of such a dataset can be constructed from UCF-101 data by running the script

sh scripts/preprocess/create_ucf_dataset.sh datasets/ucf101

You may need to install unrar and unzip for the code to work correctly.

If you do not care about classes, the class folders are not necessary and the dataset file structure can be collapsed into train and test directories of just videos.

Using Pretrained VQ-VAEs

There are four available pre-trained VQ-VAE models. All strides listed with each model are downsampling amounts across THW for the encoders.

  • bair_stride4x2x2: trained on 16 frame 64 x 64 videos from the BAIR Robot Pushing dataset
  • ucf101_stride4x4x4: trained on 16 frame 128 x 128 videos from UCF-101
  • kinetics_stride4x4x4: trained on 16 frame 128 x 128 videos from Kinetics-600
  • kinetics_stride2x4x4: trained on 16 frame 128 x 128 videos from Kinetics-600, with 2x larger temporal latent codes (achieves slightly better reconstruction)
from torchvision.io import read_video
from videogpt import load_vqvae
from videogpt.data import preprocess

video_filename = 'path/to/video_file.mp4'
sequence_length = 16
resolution = 128
device = torch.device('cuda')

vqvae = load_vqvae('kinetics_stride2x4x4')
video = read_video(video_filename, pts_unit='sec')[0]
video = preprocess(video, resolution, sequence_length).unsqueeze(0).to(device)

encodings = vqvae.encode(video)
video_recon = vqvae.decode(encodings)

Training VQ-VAE

Use the scripts/train_vqvae.py script to train a VQ-VAE. Execute python scripts/train_vqvae.py -h for information on all available training settings. A subset of more relevant settings are listed below, along with default values.

VQ-VAE Specific Settings

  • --embedding_dim: number of dimensions for codebooks embeddings
  • --n_codes 2048: number of codes in the codebook
  • --n_hiddens 240: number of hidden features in the residual blocks
  • --n_res_layers 4: number of residual blocks
  • --downsample 4 4 4: T H W downsampling stride of the encoder

Training Settings

  • --gpus 2: number of gpus for distributed training
  • --sync_batchnorm: uses SyncBatchNorm instead of BatchNorm3d when using > 1 gpu
  • --gradient_clip_val 1: gradient clipping threshold for training
  • --batch_size 16: batch size per gpu
  • --num_workers 8: number of workers for each DataLoader

Dataset Settings

  • --data_path : path to an hdf5 file or a folder containing train and test folders with subdirectories of videos
  • --resolution 128: spatial resolution to train on
  • --sequence_length 16: temporal resolution, or video clip length

Training VideoGPT

You can download a pretrained VQ-VAE, or train your own. Afterwards, use the scripts/train_videogpt.py script to train an VideoGPT model for sampling. Execute python scripts/train_videogpt.py -h for information on all available training settings. A subset of more relevant settings are listed below, along with default values.

VideoGPT Specific Settings

  • --vqvae kinetics_stride4x4x4: path to a vqvae checkpoint file, OR a pretrained model name to download. Available pretrained models are: bair_stride4x2x2, ucf101_stride4x4x4, kinetics_stride4x4x4, kinetics_stride2x4x4. BAIR was trained on 64 x 64 videos, and the rest on 128 x 128 videos
  • --n_cond_frames 0: number of frames to condition on. 0 represents a non-frame conditioned model
  • --class_cond: trains a class conditional model if activated
  • --hidden_dim 576: number of transformer hidden features
  • --heads 4: number of heads for multihead attention
  • --layers 8: number of transformer layers
  • --dropout 0.2': dropout probability applied to features after attention and positionwise feedforward layers
  • --attn_type full: full or sparse attention. Refer to the Installation section for install sparse attention
  • --attn_dropout 0.3: dropout probability applied to the attention weight matrix

Training Settings

  • --gpus 2: number of gpus for distributed training
  • --sync_batchnorm: uses SyncBatchNorm instead of BatchNorm3d when using > 1 gpu
  • --gradient_clip_val 1: gradient clipping threshold for training
  • --batch_size 16: batch size per gpu
  • --num_workers 8: number of workers for each DataLoader

Dataset Settings

  • --data_path : path to an hdf5 file or a folder containing train and test folders with subdirectories of videos
  • --resolution 128: spatial resolution to train on
  • --sequence_length 16: temporal resolution, or video clip length

Sampling VideoGPT

After training, the VideoGPT model can be sampled using the scripts/sample_videogpt.py. You may need to install ffmpeg: sudo apt-get install ffmpeg

Reproducing Paper Results

Note that this repo is primarily designed for simplicity and extending off of our method. Reproducing the full paper results can be done using code found at a separate repo. However, be aware that the code is not as clean.

Citation

Please consider using the follow citation when using our code:

@misc{yan2021videogpt,
      title={VideoGPT: Video Generation using VQ-VAE and Transformers}, 
      author={Wilson Yan and Yunzhi Zhang and Pieter Abbeel and Aravind Srinivas},
      year={2021},
      eprint={2104.10157},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}
Owner
Wilson Yan
1st year PhD interested in unsupervised learning and reinforcement learning
Wilson Yan
UMich 500-Level Mobile Robotics Course

MOBILE ROBOTICS: METHODS & ALGORITHMS - WINTER 2022 University of Michigan - NA 568/EECS 568/ROB 530 For slides, lecture notes, and example codes, see

393 Dec 29, 2022
Patch-Diffusion Code (AAAI2022)

Patch-Diffusion This is an official PyTorch implementation of "Patch Diffusion: A General Module for Face Manipulation Detection" in AAAI2022. Require

H 7 Nov 02, 2022
Building Ellee — A GPT-3 and Computer Vision Powered Talking Robotic Teddy Bear With Human Level Conversation Intelligence

Using an object detection and facial recognition system built on MobileNetSSDV2 and Dlib and running on an NVIDIA Jetson Nano, a GPT-3 model, Google Speech Recognition, Amazon Polly and servo motors,

24 Oct 26, 2022
TransZero++: Cross Attribute-guided Transformer for Zero-Shot Learning

TransZero++ This repository contains the testing code for the paper "TransZero++: Cross Attribute-guided Transformer for Zero-Shot Learning" submitted

Shiming Chen 6 Aug 16, 2022
A curated list of the latest breakthroughs in AI (in 2021) by release date with a clear video explanation, link to a more in-depth article, and code.

2021: A Year Full of Amazing AI papers- A Review 📌 A curated list of the latest breakthroughs in AI by release date with a clear video explanation, l

Louis-François Bouchard 2.9k Dec 31, 2022
Code for Parameter Prediction for Unseen Deep Architectures (NeurIPS 2021)

Parameter Prediction for Unseen Deep Architectures (NeurIPS 2021) authors: Boris Knyazev, Michal Drozdzal, Graham Taylor, Adriana Romero-Soriano Overv

Facebook Research 462 Jan 03, 2023
A TensorFlow implementation of SOFA, the Simulator for OFfline LeArning and evaluation.

SOFA This repository is the implementation of SOFA, the Simulator for OFfline leArning and evaluation. Keeping Dataset Biases out of the Simulation: A

22 Nov 23, 2022
ParaGen is a PyTorch deep learning framework for parallel sequence generation

ParaGen is a PyTorch deep learning framework for parallel sequence generation. Apart from sequence generation, ParaGen also enhances various NLP tasks, including sequence-level classification, extrac

Bytedance Inc. 169 Dec 22, 2022
A collection of educational notebooks on multi-view geometry and computer vision.

Multiview notebooks This is a collection of educational notebooks on multi-view geometry and computer vision. Subjects covered in these notebooks incl

Max 65 Dec 09, 2022
BOVText: A Large-Scale, Multidimensional Multilingual Dataset for Video Text Spotting

BOVText: A Large-Scale, Bilingual Open World Dataset for Video Text Spotting Updated on December 10, 2021 (Release all dataset(2021 videos)) Updated o

weijiawu 47 Dec 26, 2022
MVGCN: a novel multi-view graph convolutional network (MVGCN) framework for link prediction in biomedical bipartite networks.

MVGCN MVGCN: a novel multi-view graph convolutional network (MVGCN) framework for link prediction in biomedical bipartite networks. Developer: Fu Hait

13 Dec 01, 2022
A PyTorch implementation of Implicit Q-Learning

IQL-PyTorch This repository houses a minimal PyTorch implementation of Implicit Q-Learning (IQL), an offline reinforcement learning algorithm, along w

Garrett Thomas 30 Dec 12, 2022
Implementation of FitVid video prediction model in JAX/Flax.

FitVid Video Prediction Model Implementation of FitVid video prediction model in JAX/Flax. If you find this code useful, please cite it in your paper:

Google Research 62 Nov 25, 2022
Occlusion robust 3D face reconstruction model in CFR-GAN (WACV 2022)

Occlusion Robust 3D face Reconstruction Yeong-Joon Ju, Gun-Hee Lee, Jung-Ho Hong, and Seong-Whan Lee Code for Occlusion Robust 3D Face Reconstruction

Yeongjoon 31 Dec 19, 2022
Can we visualize a large scientific data set with a surrogate model? We're building a GAN for the Earth's Mantle Convection data set to see if we can!

EarthGAN - Earth Mantle Surrogate Modeling Can a surrogate model of the Earth’s Mantle Convection data set be built such that it can be readily run in

Tim 0 Dec 09, 2021
Official PyTorch implementation of "Camera Distance-aware Top-down Approach for 3D Multi-person Pose Estimation from a Single RGB Image", ICCV 2019

PoseNet of "Camera Distance-aware Top-down Approach for 3D Multi-person Pose Estimation from a Single RGB Image" Introduction This repo is official Py

Gyeongsik Moon 677 Dec 25, 2022
Generalizing Gaze Estimation with Outlier-guided Collaborative Adaptation

Generalizing Gaze Estimation with Outlier-guided Collaborative Adaptation Our paper is accepted by ICCV2021. Picture: Overview of the proposed Plug-an

Yunfei Liu 32 Dec 10, 2022
NALSM: Neuron-Astrocyte Liquid State Machine

NALSM: Neuron-Astrocyte Liquid State Machine This package is a Tensorflow implementation of the Neuron-Astrocyte Liquid State Machine (NALSM) that int

Computational Brain Lab 4 Nov 28, 2022
Microscopy Image Cytometry Toolkit

Cytokit Cytokit is a collection of tools for quantifying and analyzing properties of individual cells in large fluorescent microscopy datasets with a

Hammer Lab 106 Jan 06, 2023
PyTorch implementation of Deep HDR Imaging via A Non-Local Network (TIP 2020).

NHDRRNet-PyTorch This is the PyTorch implementation of Deep HDR Imaging via A Non-Local Network (TIP 2020). 0. Differences between Original Paper and

Yutong Zhang 1 Mar 01, 2022