Using VideoBERT to tackle video prediction

Overview

VideoBERT

This repo reproduces the results of VideoBERT (https://arxiv.org/pdf/1904.01766.pdf). Inspiration was taken from https://github.com/MDSKUL/MasterProject, but this repo tackles video prediction rather than captioning and masked language modeling. On a side note, since this model is extremely small, the results that are displayed here are extremely basic. Feel free to increase the model size per your computational resources and change the inference file to include temperature if necessary (As of now I have not implemented temperature). Here are all the steps taken:

Step 1: Download 47k videos from the HowTo100M dataset

Using the HowTo100M dataset https://www.di.ens.fr/willow/research/howto100m/, filter out the cooking videos and download them for feature extraction. The dataset is also used for extracting images for each feature vector. The ids for the videos are contained in the ids.txt file.

Step 2: Do feature extraction with the I3D model

The I3D model is used to extract the features for every 1.5 seconds of video while saving the median image of the 1.5 seconds of video as well. I3D model used: https://tfhub.dev/deepmind/i3d-kinetics-600/1. Note that CUDA should be used to decrease the runtime. Here is the usage for the code to run:

$ python3 VideoBERT/VideoBERT/I3D/batch_extract.py -h
usage: batch_extract.py [-h] -f FILE_LIST_PATH -r ROOT_VIDEO_PATH -s FEATURES_SAVE_PATH -i IMGS_SAVE_PATH

optional arguments:
  -h, --help            show this help message and exit
  -f FILE_LIST_PATH, --file-list-path FILE_LIST_PATH
                        path to file containing video file names
  -r ROOT_VIDEO_PATH, --root-video-path ROOT_VIDEO_PATH
                        root directory containing video files
  -s FEATURES_SAVE_PATH, --features-save-path FEATURES_SAVE_PATH
                        directory in which to save features
  -i IMGS_SAVE_PATH, --imgs-save-path IMGS_SAVE_PATH
                        directory in which to save images

Step 3: Hierarchical Minibatch K-means

To find the centroids for the feature vectors, minibatch k-means is used hierarchically to save time and memory. After this, the nearest feature vector for each centroid is found, and the corresponding image is chosen to represent tht centroid. To use the hierarchical minibatch k-means independently for another project, consider using the python package hkmeans-minibatch, which is also used in this VideoBERT project (https://github.com/ammesatyajit/hierarchical-minibatch-kmeans).

Here is the usage for the kmeans code:

$ python3 VideoBERT/VideoBERT/I3D/minibatch_hkmeans.py -h 
usage: minibatch_hkmeans.py [-h] -r ROOT_FEATURE_PATH -p FEATURES_PREFIX [-b BATCH_SIZE] -s SAVE_DIR -c CENTROID_DIR

optional arguments:
  -h, --help            show this help message and exit
  -r ROOT_FEATURE_PATH, --root-feature_path ROOT_FEATURE_PATH
                        path to folder containing all the video folders with the features
  -p FEATURES_PREFIX, --features-prefix FEATURES_PREFIX
                        prefix that is common between the desired files to read
  -b BATCH_SIZE, --batch-size BATCH_SIZE
                        batch_size to use for the minibatch kmeans
  -s SAVE_DIR, --save-dir SAVE_DIR
                        save directory for hierarchical kmeans vectors
  -c CENTROID_DIR, --centroid-dir CENTROID_DIR
                        directory to save the centroids in

Note that after this step the centroids will need to be concatenated for ease of use.

After doing kmeans, the image representing each centroid needs to be found to display the video during inference.

$ python3 VideoBERT/VideoBERT/data/centroid_to_img.py -h 
usage: centroid_to_img.py [-h] -f ROOT_FEATURES -i ROOT_IMGS -c CENTROID_FILE -s SAVE_FILE

optional arguments:
  -h, --help            show this help message and exit
  -f ROOT_FEATURES, --root-features ROOT_FEATURES
                        path to folder containing all the video folders with the features
  -i ROOT_IMGS, --root-imgs ROOT_IMGS
                        path to folder containing all the video folders with the images corresponding to the features
  -c CENTROID_FILE, --centroid-file CENTROID_FILE
                        the .npy file containing all the centroids
  -s SAVE_FILE, --save-file SAVE_FILE
                        json file to save the centroid to image dictionary in

Step 4: Label and group data

Using the centroids, videos are tokenized and text captions are punctuated. Using the timestamps for each caption, video ids are extracted and paired with the text captions in the training data file. Captions can be found here: https://www.rocq.inria.fr/cluster-willow/amiech/howto100m/.

The python file below tokenizes the videos:

$ python3 VideoBERT/VideoBERT/data/label_data.py -h     
usage: label_data.py [-h] -f ROOT_FEATURES -c CENTROID_FILE -s SAVE_FILE

optional arguments:
  -h, --help            show this help message and exit
  -f ROOT_FEATURES, --root-features ROOT_FEATURES
                        path to folder containing all the video folders with the features
  -c CENTROID_FILE, --centroid-file CENTROID_FILE
                        the .npy file containing all the centroids
  -s SAVE_FILE, --save-file SAVE_FILE
                        json file to save the labelled data to

After that the following file can be run to both punctuate text and group the text with the corresponding video. This uses the Punctuator module, which requires a .pcl model file to punctuate the data.

$ python3 VideoBERT/VideoBERT/data/punctuate_text.py -h 
usage: punctuate_text.py [-h] -c CAPTIONS_PATH -p PUNCTUATOR_MODEL -l LABELLED_DATA -f ROOT_FEATURES -s SAVE_PATH

optional arguments:
  -h, --help            show this help message and exit
  -c CAPTIONS_PATH, --captions-path CAPTIONS_PATH
                        path to filtered captions
  -p PUNCTUATOR_MODEL, --punctuator-model PUNCTUATOR_MODEL
                        path to punctuator .pcl model
  -l LABELLED_DATA, --labelled-data LABELLED_DATA
                        path to labelled data json file
  -f ROOT_FEATURES, --root-features ROOT_FEATURES
                        directory with all the video features
  -s SAVE_PATH, --save-path SAVE_PATH
                        json file to save training data to

If desired, an evaluation data file can be created by splitting the training data file.

Step 5: Training

The training data from before is used to train a next token prediction transformer. The saved model and tokenizer is used for inference in the next step. here is the usage of the train.py file.

$ python3 VideoBERT/VideoBERT/train/train.py -h
usage: train.py [-h] --output_dir OUTPUT_DIR [--should_continue] [--model_name_or_path MODEL_NAME_OR_PATH] [--train_data_path TRAIN_DATA_PATH] [--eval_data_path EVAL_DATA_PATH] [--config_name CONFIG_NAME] [--block_size BLOCK_SIZE]
                [--per_gpu_train_batch_size PER_GPU_TRAIN_BATCH_SIZE] [--gradient_accumulation_steps GRADIENT_ACCUMULATION_STEPS] [--learning_rate LEARNING_RATE] [--weight_decay WEIGHT_DECAY] [--adam_epsilon ADAM_EPSILON]
                [--max_grad_norm MAX_GRAD_NORM] [--num_train_epochs NUM_TRAIN_EPOCHS] [--max_steps MAX_STEPS] [--log_dir LOG_DIR] [--warmup_steps WARMUP_STEPS] [--local_rank LOCAL_RANK] [--logging_steps LOGGING_STEPS]
                [--save_steps SAVE_STEPS] [--save_total_limit SAVE_TOTAL_LIMIT] [--overwrite_output_dir] [--overwrite_cache] [--seed SEED]

optional arguments:
  -h, --help            show this help message and exit
  --output_dir OUTPUT_DIR
                        The output directory where the model predictions and checkpoints will be written.
  --should_continue     Whether to continue from latest checkpoint in output_dir
  --model_name_or_path MODEL_NAME_OR_PATH
                        The model checkpoint for weights initialization. Leave None if you want to train a model from scratch.
  --train_data_path TRAIN_DATA_PATH
                        The json file for training the model
  --eval_data_path EVAL_DATA_PATH
                        The json file for evaluating the model
  --config_name CONFIG_NAME
                        Optional pretrained config name or path if not the same as model_name_or_path. If both are None, initialize a new config.
  --block_size BLOCK_SIZE
                        Optional input sequence length after tokenization.The training dataset will be truncated in block of this size for training.Default to the model max input length for single sentence inputs (take into account
                        special tokens).
  --per_gpu_train_batch_size PER_GPU_TRAIN_BATCH_SIZE
                        Batch size per GPU/CPU for training.
  --gradient_accumulation_steps GRADIENT_ACCUMULATION_STEPS
                        Number of updates steps to accumulate before performing a backward/update pass.
  --learning_rate LEARNING_RATE
                        The initial learning rate for Adam.
  --weight_decay WEIGHT_DECAY
                        Weight decay if we apply some.
  --adam_epsilon ADAM_EPSILON
                        Epsilon for Adam optimizer.
  --max_grad_norm MAX_GRAD_NORM
                        Max gradient norm.
  --num_train_epochs NUM_TRAIN_EPOCHS
                        Total number of training epochs to perform.
  --max_steps MAX_STEPS
                        If > 0: set total number of training steps to perform. Override num_train_epochs.
  --log_dir LOG_DIR     Directory to store the logs.
  --warmup_steps WARMUP_STEPS
                        Linear warmup over warmup_steps.
  --local_rank LOCAL_RANK
                        For distributed training: local_rank
  --logging_steps LOGGING_STEPS
                        Log every X updates steps.
  --save_steps SAVE_STEPS
                        Save checkpoint every X updates steps.
  --save_total_limit SAVE_TOTAL_LIMIT
                        Limit the total amount of checkpoints, delete the older checkpoints in the output_dir, does not delete by default
  --overwrite_output_dir
                        Overwrite the content of the output directory
  --overwrite_cache     Overwrite the cached training and evaluation sets
  --seed SEED           random seed for initialization

Step 6: Inference

Model is used for predicting video sequences and results can be seen visually. Note that since the model does uses vector quantized images as tokens, it only understands the actions and approximate background of the scene, not the exact person or dish. Here are some samples:

out1 out2 out3 out4 out5

Here is the usage for the inference file. Feel free to modify it to suit any specific needs:

$ python3 VideoBERT/VideoBERT/evaluation/inference.py -h 
usage: inference.py [-h] [--model_name_or_path MODEL_NAME_OR_PATH] --output_dir OUTPUT_DIR [--example_id EXAMPLE_ID] [--seed SEED]

optional arguments:
  -h, --help            show this help message and exit
  --model_name_or_path MODEL_NAME_OR_PATH
                        The model checkpoint for weights initialization. Leave None if you want to train a model from scratch.
  --output_dir OUTPUT_DIR
                        The output directory where the checkpoint is.
  --example_id EXAMPLE_ID
                        The index of the eval set for evaluating the model
  --seed SEED           random seed for initialization
Official re-implementation of the Calibrated Adversarial Refinement model described in the paper Calibrated Adversarial Refinement for Stochastic Semantic Segmentation

Official re-implementation of the Calibrated Adversarial Refinement model described in the paper Calibrated Adversarial Refinement for Stochastic Semantic Segmentation

Elias Kassapis 31 Nov 22, 2022
Implementation of UNet on the Joey ML framework

Independent Research Project - Code Joey can be cloned from here https://github.com/devitocodes/joey/. Devito and other dependencies such as PyTorch a

Navjot Kukreja 1 Oct 21, 2021
CoReD: Generalizing Fake Media Detection with Continual Representation using Distillation (ACMMM'21 Oral Paper)

CoReD: Generalizing Fake Media Detection with Continual Representation using Distillation (ACMMM'21 Oral Paper) (Accepted for oral presentation at ACM

Minha Kim 1 Nov 12, 2021
Toolkit for collecting and applying prompts

PromptSource Promptsource is a toolkit for collecting and applying prompts to NLP datasets. Promptsource uses a simple templating language to programa

BigScience Workshop 998 Jan 03, 2023
Official implement of Paper:A deeply supervised image fusion network for change detection in high resolution bi-temporal remote sening images

A deeply supervised image fusion network for change detection in high resolution bi-temporal remote sensing images 深度监督影像融合网络DSIFN用于高分辨率双时相遥感影像变化检测 Of

Chenxiao Zhang 135 Dec 19, 2022
Implementation for paper LadderNet: Multi-path networks based on U-Net for medical image segmentation

Implementation for paper LadderNet: Multi-path networks based on U-Net for medical image segmentation This implementation is based on orobix implement

Juntang Zhuang 116 Sep 06, 2022
Safe Bayesian Optimization

SafeOpt - Safe Bayesian Optimization This code implements an adapted version of the safe, Bayesian optimization algorithm, SafeOpt [1], [2]. It also p

Felix Berkenkamp 111 Dec 11, 2022
NeuralWOZ: Learning to Collect Task-Oriented Dialogue via Model-based Simulation (ACL-IJCNLP 2021)

NeuralWOZ This code is official implementation of "NeuralWOZ: Learning to Collect Task-Oriented Dialogue via Model-based Simulation". Sungdong Kim, Mi

NAVER AI 31 Oct 25, 2022
This repo is the official implementation for Multi-Scale Adaptive Graph Neural Network for Multivariate Time Series Forecasting

1 MAGNN This repo is the official implementation for Multi-Scale Adaptive Graph Neural Network for Multivariate Time Series Forecasting. 1.1 The frame

SZJ 12 Nov 08, 2022
Pytorch implementation of "ARM: Any-Time Super-Resolution Method"

ARM-Net Dependencies Python 3.6 Pytorch 1.7 Results Train Data preprocessing cd data_scripts python extract_subimages_test.py python data_augmentation

Bohong Chen 55 Nov 24, 2022
Unsupervised 3D Human Mesh Recovery from Noisy Point Clouds

Unsupervised 3D Human Mesh Recovery from Noisy Point Clouds Xinxin Zuo, Sen Wang, Minglun Gong, Li Cheng Prerequisites We have tested the code on Ubun

41 Dec 12, 2022
torchbearer: A model fitting library for PyTorch

Note: We're moving to PyTorch Lightning! Read about the move here. From the end of February, torchbearer will no longer be actively maintained. We'll

632 Dec 13, 2022
codes for Self-paced Deep Regression Forests with Consideration on Ranking Fairness

Self-paced Deep Regression Forests with Consideration on Ranking Fairness This is official codes for paper Self-paced Deep Regression Forests with Con

Learning in Vision 4 Sep 11, 2022
Google AI Open Images - Object Detection Track: Open Solution

Google AI Open Images - Object Detection Track: Open Solution This is an open solution to the Google AI Open Images - Object Detection Track 😃 More c

minerva.ml 46 Jun 22, 2022
Shitty gaze mouse controller

demo.mp4 shitty_gaze_mouse_cotroller install tensofflow, cv2 run the main.py and as it starts it will collect data so first raise your left eyebrow(bo

16 Aug 30, 2022
GAN encoders in PyTorch that could match PGGAN, StyleGAN v1/v2, and BigGAN. Code also integrates the implementation of these GANs.

MTV-TSA: Adaptable GAN Encoders for Image Reconstruction via Multi-type Latent Vectors with Two-scale Attentions. This is the official code release fo

owl 37 Dec 24, 2022
Learning Dense Representations of Phrases at Scale (Lee et al., 2020)

DensePhrases DensePhrases provides answers to your natural language questions from the entire Wikipedia in real-time. While it efficiently searches th

Princeton Natural Language Processing 540 Dec 30, 2022
Decentralized Reinforcment Learning: Global Decision-Making via Local Economic Transactions (ICML 2020)

Decentralized Reinforcement Learning This is the code complementing the paper Decentralized Reinforcment Learning: Global Decision-Making via Local Ec

40 Oct 30, 2022
A Convolutional Transformer for Keyword Spotting

☢️ Audiomer ☢️ Audiomer: A Convolutional Transformer for Keyword Spotting [ arXiv ] [ Previous SOTA ] [ Model Architecture ] Results on SpeechCommands

49 Jan 27, 2022
The official code repo of "HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection"

Hierarchical Token Semantic Audio Transformer Introduction The Code Repository for "HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound

Knut(Ke) Chen 134 Jan 01, 2023