A PyTorch implementation of VIOLET

Last update: Dec 30, 2022

Overview

VIOLET: End-to-End Video-Language Transformers with Masked Visual-token Modeling

A PyTorch implementation of VIOLET

Overview

VIOLET is an implementation of
"VIOLET: End-to-End Video-Language Transformers with Masked Visual-token Modeling"
Tsu-Jui Fu, Linjie Li, Zhe Gan, Kevin Lin, William Yang Wang, Lijuan Wang, and Zicheng Liu

VIOLET contains 3 components: Video Swin Transformer (VT) computes video features; Language Embedder (LE) extracts word embeddings; Cross-modal Transformer (CT) performs cross-modal fusion. To benefit from large-scale data, we incorporate 3 pretraining tasks: Masked Language Modeling (MVM) predicts the masked word tokens; Masked Visual-token Modeling (MVM) recovers the masked video patches; Visual-Text Matching (VTM) learns the alignments between video and text modality.

Requirements

This code is implemented under Python 3.8, PyTorch 1.7, and Torchvision 0.8.

Usage

Data preprocessing

As using outer datasets (cannot be shared by us), we provide preprocessing tools to extract sparse-sampled video frames into our compressed format.

cd _tools

# We use 4 frames during pretraining and 5 frames for downstream tasks
python extract_video-frame.py --path=msrvtt --sample=5 # output: msrvtt.pkl

# We use DALL-E to extract VQ tokens for MVM pretraining
wget https://cdn.openai.com/dall-e/encoder.pkl # download trained dall-e encoder
python extract_vq.py --path=msrvtt --frame=224 # output: msrvtt_vq.pkl

# We adopt file.seek() instead of loading entire data to reduce the memory cost during distributed pretraining
python extract_tsv.py --path=msrvtt # output: msrvtt.tsv, msrvtt.lineidx

There are parital examples (WebVid2.5M, CC3M, TGIF-Action, MSVD-QA, and MSRVTT-Retrieval) to help formulate the input data.

Pretraining

Put pretrained VT in ./_snapshot. This script pretrains on both video (WebVid2.5M) and image (CC3M) data via single-node multi-gpu distributed training.

CUDA_VISIBLE_DEVICES='0,1,2,3' python -m torch.distributed.launch --nproc_per_node=4 --master_port=7122 main_pretrain.py

Here is our best pretrained checkpoint (YT180M+WebVid2.5M+CC3M).

Downstream

Multiple-Choice Question Answering (TGIF-Action, TGIF-Transition, MSRVTT-MC, and LSMDC-MC)

CUDA_VISIBLE_DEVICES='0,1,2,3' python main_qamc.py _data/args_tgif-action.json

Open-Ended Question Answering (TGIF-Frame, MSRVTT-QA, LSMDC-FiB, and MSVD-QA)

CUDA_VISIBLE_DEVICES='0,1,2,3' python main_qaoe.py _data/args_msvd-qa.json

Text-to-Video Retrieval (MSRVTT, DiDeMo, YouCook2, and LSMDC)

CUDA_VISIBLE_DEVICES='0,1,2,3' python main_retrieval.py _data/args_msrvtt-retrieval.json
CUDA_VISIBLE_DEVICES='0,1,2,3' python eval_retrieval.py _data/args_msrvtt-retrieval.json

We also provide all trained downstream checkpoints.

Citation

@inproceedings{fu2021violet, 
  author = {Tsu-Jui Fu, Linjie Li, Zhe Gan, Kevin Lin, William Yang Wang, Lijuan Wang, and Zicheng Liu}, 
  title = {VIOLET: End-to-End Video-Language Transformers with Masked Visual-token Modeling}, 
  booktitle = {arXiv:2111.1268}, 
  year = {2021} 
}

A PyTorch implementation of VIOLET

Related tags

Overview

VIOLET: End-to-End Video-Language Transformers with Masked Visual-token Modeling

Overview

Requirements

Usage

Data preprocessing

Pretraining

Downstream

Citation

Owner

Tsu-Jui Fu

Open-World Entity Segmentation

profile tools for pytorch nn models

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities

The aim of this task is to predict someone's English proficiency based on a text input.

Applied Natural Language Processing in the Enterprise - An O'Reilly Media Publication

Guide to using pre-trained large language models of source code

本插件是pcrjjc插件的重置版，可以独立于后端api运行

A collection of models for image - text generation in ACM MM 2021.

Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/

基于百度的语音识别，用python实现，pyaudio+pyqt

pytorch implementation of Attention is all you need

To be a next-generation DL-based phenotype prediction from genome mutations.

State-of-the-art NLP through transformer models in a modular design and consistent APIs.

NLP Core Library and Model Zoo based on PaddlePaddle 2.0

Residual2Vec: Debiasing graph embedding using random graphs

Code release for "COTR: Correspondence Transformer for Matching Across Images"

A cross platform OCR Library based on PaddleOCR & OnnxRuntime

🤗Transformers: State-of-the-art Natural Language Processing for Pytorch and TensorFlow 2.0.

A library that integrates huggingface transformers with the world of fastai, giving fastai devs everything they need to train, evaluate, and deploy transformer specific models.

Score-Based Point Cloud Denoising (ICCV'21)