Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning on image-text and video-text tasks

Overview

ClipBERT

Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling

Jie Lei*, Linjie Li*, Luowei Zhou, Zhe Gan, Tamara L. Berg, Mohit Bansal, Jingjing Liu

Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning on image-text and video-text tasks. It takes raw videos/images + text as inputs, and outputs task predictions. ClipBERT is designed based on 2D CNNs and transformers, and uses a sparse sampling strategy to enable efficient end-to-end video-and-language learning. In this repository, we support end-to-end pretraining and finetuning for the following tasks:

  • Image-text pretraining on COCO and VG captions.
  • Text-to-video retrieval finetuning on MSRVTT, DiDeMo, and ActivityNet Captions.
  • Video-QA finetuning on TGIF-QA and MSRVTT-QA.
  • Image-QA finetuning on VQA 2.0.

It is also feasible and easy to add other image-text or video-text tasks for pretraining and finetuning.

Requirements

We provide a Docker image for easier reproduction. Please install the following:

Our scripts require the user to have the docker group membership so that docker commands can be run without sudo. We only support Linux with NVIDIA GPUs. We test on Ubuntu 18.04 and V100 cards. We use mixed-precision training hence GPUs with Tensor Cores are recommended.

Getting Started

General

  1. Create a folder that stores pretrained models, all the data, and results.

    PATH_TO_STORAGE=/path/to/your/data/
    mkdir -p $PATH_TO_STORAGE/txt_db  # annotations
    mkdir -p $PATH_TO_STORAGE/vis_db  # image and video 
    mkdir -p $PATH_TO_STORAGE/finetune  # finetuning results
    mkdir -p $PATH_TO_STORAGE/pretrained  # pretrained models
  2. Download pretrained models.

    Our e2e pretrained ClipBERT model (849MB), can be downloaded with the following command.

    bash scripts/download_pretrained.sh $PATH_TO_STORAGE

    This pretrained model can be used for finetuning on video-text tasks and image-text tasks. For your convenience, this script will also download bert-base-uncased and grid-feat-vqa model weights, which are used as initialization for pretraining.

  3. Launch the Docker container for running the experiments.

    # docker image should be automatically pulled
    source launch_container.sh $PATH_TO_STORAGE/txt_db $PATH_TO_STORAGE/img_db \
        $PATH_TO_STORAGE/finetune $PATH_TO_STORAGE/pretrained

    The launch script respects $CUDA_VISIBLE_DEVICES environment variable. Note that the source code is mounted into the container under /clipbert instead of built into the image so that user modification will be reflected without re-building the image. (Data folders are mounted into the container separately for flexibility on folder structures.)

Downstream Task Finetuning

Text-to-Video Retrieval

Tasks: MSRVTT retrieval, DiDeMo and ActivityNet Captions paragprah-to-video retrieval, MSRVTT MC Test.

  1. Download data.

    # outside the container  
    # download videos + annotations for $DSET
    bash scripts/download_$DSET.sh $PATH_TO_STORAGE

    $DSET can be one of msrvtt, didemo, anet.

  2. Finetuning.

    # inside the container
    horovodrun -np 4 python src/tasks/run_video_retrieval.py \
        --config $CONFIG_PATH \
        --output_dir $OUTPUT_DIR
    
    # for single GPU
    python src/tasks/run_video_retrieval.py \
        --config $CONFIG_PATH \
        --output_dir $OUTPUT_DIR

    $CONFIG_PATH should be set to one of the .json config files available at src/configs prefixed with _ret. For example, you can use src/configs/msrvtt_ret_base_resnet50.json for MSRVTT retrieval.

  3. Run inference.

    # inside the container
    horovodrun -np 4 python src/tasks/run_video_retrieval.py \
      --do_inference 1 --output_dir $OUTPUT_DIR \
      --inference_split val --inference_model_step $STEP \
      --inference_txt_db $TXT_DB \
      --inference_img_db $IMG_DB --inference_batch_size 64 \
      --inference_n_clips $INFERENCE_N_CLIPS

    $STEP is an integer, it tells the script to use the checkpoint $OUTPUT_DIR/ckpt/model_step_$STEP.pt for inference. $TXT_DB and $IMG_DB are path to annotation file and video data. You can use TXT_DB=/txt/downstream/msrvtt_retrieval/msrvtt_retrieval_val.jsonl and IMG_DB=/img/msrvtt for inference on MSRVTT retrieval val split. The results will be written under $OUTPUT_DIR. You can use different $INFERENCE_N_CLIPS for inference, such as 1 or 16. Using more clips will have a large impact on inference speed and memory usage. You may want to use smaller batch sizes if larger values are set.

    After MSRVTT retrieval model is trained, you can use it for inference for the MSRVTT MC Test task as well, which is essentially a retrieval task in a multiple-choice task setup.

    # inside the container
    horovodrun -np 4 python src/tasks/run_msrvtt_mc.py \
      --do_inference 1 --output_dir $OUTPUT_DIR \
      --inference_split val --inference_model_step $STEP \
      --inference_txt_db /txt/downstream/msrvtt_retrieval_mc/msrvtt_retrieval_mc_test.jsonl \
      --inference_img_db /img/msrvtt --inference_batch_size 64 \
      --inference_n_clips $INFERENCE_N_CLIPS

Video Question Answering

Tasks: TGIF-QA action, transition, and frameQA tasks; MSRVTT-QA.

  1. Download data.

    # outside the container  
    # download MSRVTT videos, and QA + retrieval annotations
    bash scripts/download_msrvtt.sh $PATH_TO_STORAGE  
    # download TGIF-QA videos and annotations
    bash scripts/download_tgif_qa.sh $PATH_TO_STORAGE  
  2. Finetuning.

    # inside the container
    horovodrun -np 4 python src/tasks/run_video_qa.py \
        --config $CONFIG_PATH \
        --output_dir $OUTPUT_DIR

    $CONFIG_PATH should be set to one of the .json config files available at src/configs contains the substring _qa. For example, you can use src/configs/msrvtt_qa_base_resnet50.json for MSRVTT-QA.

  3. Run inference.

    # inside the container
    horovodrun -np 4 python src/tasks/run_video_qa.py \
      --do_inference 1 --output_dir $OUTPUT_DIR \
      --inference_split val --inference_model_step $STEP \
      --inference_txt_db $TXT_DB \
      --inference_img_db $IMG_DB --inference_batch_size 64 \
      --inference_n_clips $INFERENCE_N_CLIPS

    $STEP is an integer, which tells the script to use the checkpoint $OUTPUT_DIR/ckpt/model_step_$STEP.pt for inference. $TXT_DB and $IMG_DB are path to annotation file and video data. You can use TXT_DB=/txt/downstream/msrvtt_retrieval/msrvtt_qa_val.jsonl and IMG_DB=/img/msrvtt for inference on MSRVTT QA val split.

    The results will be written under $OUTPUT_DIR. You can use different $INFERENCE_N_CLIPS for inference, such as 1 or 16. Using more clips will have a large impact on inference speed and memory usage. You may want to use smaller batch sizes if larger values are set.

Image Question Answering (VQA)

  1. Download data

    # outside the container
    # download COCO and VG data
    bash scripts/download_coco_vg.sh $PATH_TO_STORAGE
    # download VQA annotations
    bash scripts/download_vqa.sh $PATH_TO_STORAGE
  2. Finetuning

    # inside the container
    horovodrun -np 4 python src/tasks/run_vqa.py \
        --config src/configs/vqa_base_resnet50.json \
        --output_dir $OUTPUT_DIR
  3. Inference

    # inside the container
    horovodrun -np 4 python src/tasks/run_vqa.py \
      --do_inference 1 --output_dir $OUTPUT_DIR \
      --inference_split val --inference_model_step $STEP \
      --inference_txt_db $TXT_DB \
      --inference_img_db $IMG_DB \
      --inference_batch_size 64

Pretraining

  1. Download data

    # outside the container
    bash scripts/download_coco_vg.sh $PATH_TO_STORAGE
  2. Pretraining

    #inside the container
    horovodrun -np 8 python src/pretrain/run_pretrain.py \
        --config src/configs/pretrain_indomain_base_resnet50_mlm_itm.json \
        --output_dir $OUTPUT_DIR 

Data Preprocessing

ClipBERT takes raw video and text as inputs, there is no need to do feature extraction. However, to improve data loading speed, we use LMDB to store the raw image and video files. You can use the following script to convert a list of videos with file extensions mp4 and avi into LMDB:

# outside the container
python src/preprocessing/file2lmdb.py \
    --data_root /path/to/videos \
    --lmdb_save_dir /path/to/save/lmdb \
    --ext avi mp4 \
    --file_type video 

For images, use appropriate file extensions for --ext and --file_type image. Text annotation files are reorganized into jsonl files, see example preprocessed files downloaded by the scripts in scripts/.

Citation

If you find this code useful for your research, please consider citing:

@article{lei2021less,
  title={Less is More: ClipBERT for Video-and-Language Learningvia Sparse Sampling},
  author={Lei, Jie and Li, Linjie and Zhou, Luowei and Gan, Zhe and Berg, Tamara L. and Bansal, Mohit and Liu, Jingjing},
  journal={arXiv},
  year={2021}
}

Acknowledgement

We thank Yen-Chun Chen and Ruotian Luo for suggestions on the implementation. We also thank other members and interns at Microsoft Multimodal AI for their helpful discussions.

This code used resources from transformers, UNITER, HERO, grid-feats-vqa, SlowFast, Detectron2. The code is implemented using PyTorch, with multi-GPU support from Horovod and mixed precision support from apex. We thank the authors for open-sourcing their awesome projects.

License

MIT

Owner
Jie Lei 雷杰
UNC CS PhD student, vision+language.
Jie Lei 雷杰
A Chinese to English Neural Model Translation Project

ZH-EN NMT Chinese to English Neural Machine Translation This project is inspired by Stanford's CS224N NMT Project Dataset used in this project: News C

Zhenbang Feng 29 Nov 26, 2022
Twewy-discord-chatbot - Build a Discord AI Chatbot that Speaks like Your Favorite Character

Build a Discord AI Chatbot that Speaks like Your Favorite Character! This is a Discord AI Chatbot that uses the Microsoft DialoGPT conversational mode

Lynn Zheng 231 Dec 30, 2022
Library for Russian imprecise rhymes generation

TOM RHYMER Library for Russian imprecise rhymes generation. Quick Start Generate rhymes by any given rhyme scheme (aabb, abab, aaccbb, etc ...): from

Alexey Karnachev 6 Oct 18, 2022
Searching keywords in PDF file folders

keyword_searching Steps to use this Python scripts: (1)Paste this script into the file folder containing the PDF files you need to search from; (2)Thi

1 Nov 08, 2021
Chinese Grammatical Error Diagnosis

nlp-CGED Chinese Grammatical Error Diagnosis 中文语法纠错研究 基于序列标注的方法 所需环境 Python==3.6 tensorflow==1.14.0 keras==2.3.1 bert4keras==0.10.6 笔者使用了开源的bert4keras

12 Nov 25, 2022
Japanese NLP Library

Japanese NLP Library Back to Home Contents 1 Requirements 1.1 Links 1.2 Install 1.3 History 2 Libraries and Modules 2.1 Tokenize jTokenize.py 2.2 Cabo

Pulkit Kathuria 144 Dec 27, 2022
SpikeX - SpaCy Pipes for Knowledge Extraction

SpikeX is a collection of pipes ready to be plugged in a spaCy pipeline. It aims to help in building knowledge extraction tools with almost-zero effort.

Erre Quadro Srl 384 Dec 12, 2022
The proliferation of disinformation across social media has led the application of deep learning techniques to detect fake news.

Fake News Detection Overview The proliferation of disinformation across social media has led the application of deep learning techniques to detect fak

Kushal Shingote 1 Feb 08, 2022
GrammarTagger — A Neural Multilingual Grammar Profiler for Language Learning

GrammarTagger — A Neural Multilingual Grammar Profiler for Language Learning GrammarTagger is an open-source toolkit for grammatical profiling for lan

Octanove Labs 27 Jan 05, 2023
Python module (C extension and plain python) implementing Aho-Corasick algorithm

pyahocorasick pyahocorasick is a fast and memory efficient library for exact or approximate multi-pattern string search meaning that you can find mult

Wojciech Muła 763 Dec 27, 2022
A PyTorch-based model pruning toolkit for pre-trained language models

English | 中文说明 TextPruner是一个为预训练语言模型设计的模型裁剪工具包,通过轻量、快速的裁剪方法对模型进行结构化剪枝,从而实现压缩模型体积、提升模型速度。 其他相关资源: 知识蒸馏工具TextBrewer:https://github.com/airaria/TextBrewe

Ziqing Yang 231 Jan 08, 2023
Higher quality textures for the Metal Gear Solid series.

Metal Gear Solid: HD Textures Higher quality textures for the Metal Gear Solid series. The goal is to maximize the quality of assets that the engine w

Samantha 6 Dec 06, 2022
NLP: SLU tagging

NLP: SLU tagging

北海若 3 Jan 14, 2022
Official Pytorch implementation of Test-Agnostic Long-Tailed Recognition by Test-Time Aggregating Diverse Experts with Self-Supervision.

This repository is the official Pytorch implementation of Test-Agnostic Long-Tailed Recognition by Test-Time Aggregating Diverse Experts with Self-Supervision.

vanint 101 Dec 30, 2022
Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.

NeuroNER NeuroNER is a program that performs named-entity recognition (NER). Website: neuroner.com. This page gives step-by-step instructions to insta

Franck Dernoncourt 1.6k Dec 27, 2022
⚖️ A Statutory Article Retrieval Dataset in French.

A Statutory Article Retrieval Dataset in French This repository contains the Belgian Statutory Article Retrieval Dataset (BSARD), as well as the code

Maastricht Law & Tech Lab 19 Nov 17, 2022
Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

Fairseq(-py) is a sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization, language mod

13.2k Jul 07, 2021
Word Bot for JKLM Bomb Party

Word Bot for JKLM Bomb Party A bot for Bomb Party on https://www.jklm.fun (Only English) Requirements pynput pyperclip pyautogui Usage: Step 1: Run th

Nicolas 7 Oct 30, 2022
A Python/Pytorch app for easily synthesising human voices

Voice Cloning App A Python/Pytorch app for easily synthesising human voices Documentation Discord Server Video guide Voice Sharing Hub FAQ's System Re

Ben Andrew 840 Jan 04, 2023
iBOT: Image BERT Pre-Training with Online Tokenizer

Image BERT Pre-Training with iBOT Official PyTorch implementation and pretrained models for paper iBOT: Image BERT Pre-Training with Online Tokenizer.

Bytedance Inc. 435 Jan 06, 2023