Align and Prompt: Video-and-Language Pre-training with Entity Prompts

Overview

ALPRO

Align and Prompt: Video-and-Language Pre-training with Entity Prompts [Paper]

Dongxu Li, Junnan Li, Hongdong Li, Juan Carlos Niebles, Steven C.H. Hoi

Official PyTorch code for ALPRO. This repository supports pre-training as well as finetuning on

  • Text-Video Retrieval on MSRVTT and DiDeMo.
  • Video Question Anwsering on MSRVTT and MSVD.

Requirements

Our implementation is tested on Ubuntu 20.04.1 with NVIDIA A100 GPUs. Supports for other platforms and hardwares are possible with no warrant. To install the required packages:

cd env && bash install_pkg.sh

Data Preparation

  1. Download Annotations and Pre-trained Checkpoints

  2. Download raw videos of downstream datasets.

    • MSRVTT:
      • download train_val_videos.zip and test_videos.zip from e.g. here.

      • check md5sum:

        51f2394d279cf84f1642defd9a651e6f  train_val_videos.zip
        0af68454cec9d586e92805739f3911d0  test_videos.zip
      • unzip all the videos into data/msrvtt_ret/videos (10k in total).

      • create the following soft link:

        ln -s data/msrvtt_ret/videos data/msrvtt_qa/videos```
    • MSVD:
      • download from official release:

        wget -nc https://www.cs.utexas.edu/users/ml/clamp/videoDescription/YouTubeClips.tar
      • check md5sum:

        9bdb20fcf14d59524a6febca9f6a8d89  YouTubeClips.tar
      • unzip all the videos to data/msvd_qa/videos (1,970 videos in total).

        mkdir data/msvd_qa/videos/ 
        tar xvf YouTubeClips.tar -C data/msvd_qa/videos --strip-components=1
    • DiDeMo:
      • Following instructions and download from the official release here;
      • unzip all the videos into data/didemo_ret/videos.
      • Note there might be a couple videos missing. See here to download. However, as they account for a small portion of training set, you may feel safe to ignore.
      • Convert all the DiDeMo videos into *.mp4 format using e.g. ffmpeg.
      • We obtained 10,463 videos following these steps (with one video [email protected]_5753455690_1e04ccb364 missing).
  3. The directory is expected to be in the structure below:

    .
    |-config_release  # configuration files
    |-data  # text annotations and raw videos
    |---didemo_ret
    |-----txt
    |-----videos
    |---msrvtt_qa/...
    |---msrvtt_ret/...
    |---msvd_qa/...
    |-env  # scripts to install packages
    |-ext  # external resources, e.g. bert tokenizer
    |-output  # checkpoints for pre-trained/finetuned models
    |---downstreams
    |-----didemo_ret
    |-------public
    |---------ckpt # official finetuned checkpoints
    |---------log # inference log
    |---------results_test
    |-----------step_best_1_mean
    |-----msrvtt_qa/...
    |-----msrvtt_ret/...
    |-----msvd_qa/...
    |-run_scripts  # bash scripts to launch experiments
    |-src  # source code

Inference with Official Checkpoints

cd run_scripts
bash inf_msrvtt_ret.sh
# {'text2video': {'r1': 33.9, 'r5': 60.7, 'r10': 73.2, 'medianR': 3.0, 'meanR': 27.404}}
bash inf_didemo_ret.sh
# {'text2video': {'r1': 35.9, 'r5': 67.5, 'r10': 78.8, 'medianR': 3.0, 'meanR': 19.125}}
bash inf_msrvtt_qa.sh
# {'ratios': {'what_ratio': [68.48, 49872], 'who_ratio': [27.99, 20385], 'how_ratio': [2.25, 1640], 'where_ratio': [0.34, 250], 'when_ratio': [0.93, 677]}, 'overall_acc': 42.12, 'what_acc': 36.05, 'who_acc': 52.24, 'how_acc': 85.67, 'where_acc': 42.8, 'when_acc': 78.88}
bash inf_msvd_qa.sh
# {'ratios': {'what_ratio': [61.93, 8150], 'who_ratio': [34.6, 4554], 'how_ratio': [2.81, 370], 'where_ratio': [0.21, 28], 'when_ratio': [0.44, 58]}, 'overall_acc': 45.91, 'what_acc': 37.02, 'who_acc': 58.59, 'how_acc': 81.62, 'where_acc': 46.43, 'when_acc': 72.41}

Downstream Task Finetuning

  • To finetune on downstream tasks with the pre-trained checkpoint output/pretrain/alpro_pretrained_ckpt.pt

    cd run_scripts
    bash ft_msrvtt_ret.sh
    bash ft_didemo_ret.sh
    bash ft_msrvtt_qa.sh
    bash ft_msvd_qa.sh

    For example, with MSRVTT retrieval:

    cd ALPRO/
    
    export PYTHONPATH="$PYTHONPATH:$PWD"
    echo $PYTHONPATH
    
    CONFIG_PATH='config_release/msrvtt_ret.json'
    
    horovodrun -np 8 python src/tasks/run_video_retrieval.py \ # change -np to GPUs numbers.
        --config $CONFIG_PATH \
        --output_dir /export/home/workspace/experiments/alpro/finetune/msrvtt_ret/$(date '+%Y%m%d%H%M%S')  # change to your local path to store finetuning ckpts and logs 
  • Run inference with locally-finetuned checkpoints.

     cd ALPRO/
    
     export PYTHONPATH="$PYTHONPATH:$PWD"
     echo $PYTHONPATH
    
     STEP='best'
    
     CONFIG_PATH='config_release/msrvtt_ret.json'
     OUTPUT_DIR='[INPUT_YOUR_OUTPUT_PATH_HERE]'
    
     TXT_DB='data/msrvtt_ret/txt/test.jsonl'
     IMG_DB='data/msrvtt_ret/videos'
    
     horovodrun -np 8 python src/tasks/run_video_retrieval.py \
         --do_inference 1 \
         --inference_split test \
         --inference_model_step $STEP \
         --inference_txt_db $TXT_DB \
         --inference_img_db $IMG_DB \
         --inference_batch_size 64 \
         --output_dir $OUTPUT_DIR \
         --config $CONFIG_PATH
    • OUTPUT_DIR is the path after the --output_dir option in the finetuning script.
    • $STEP is a string, which tells the script to use the checkpoint $OUTPUT_DIR/ckpt/model_step_$STEP.pt for inference.

Pretraining

  1. Download WebVid2M and CC-3M.

    • Put WebVid2M videos under data/webvid2m;
    • 💡 we downsample webvid2m videos to 10% of the original FPS to speed-up video loading;
    • change data/cc3m/txt/cc3m.json with local image paths.
  2. Training Prompter:

    cd run_scripts && bash pt_prompter.sh
  3. Training video-language model:

    cd run_scripts && bash pt_alpro.sh

    If you would like to use custom prompter weight, please change teacher_weights_path in config_release/pretrain_alpro.json

  4. To finetune with pre-trained checkpoints, please change e2e_weights_path in the finetuning config files, e.g. config_release/msrvtt_ret.json.

Citation

If you find ALPRO useful for your research, please consider citing:

  @inproceedings{li2021align,
    title={Align and Prompt: Video-and-Language Pre-training with Entity Prompts},
    author={Dongxu Li, Junnan Li, Hongdong Li, Juan Carlos Niebles, Steven C.H. Hoi},
    booktitle={arxiv},
    year={2021}
  }

Acknowledgement

We thank members at Salesforce Research for their helpful discussions.

The implementation of ALPRO relies on resources from ClipBERT, transformers, TimeSformer, The code is implemented using PyTorch, with multi-GPU support from Horovod and gradient-checkpoint. We thank the original authors for their open-sourcing and encourage ALPRO users to cite their works when applicable.

Owner
Salesforce
A variety of vendor agnostic projects which power Salesforce
Salesforce
Transfer-Learn is an open-source and well-documented library for Transfer Learning.

Transfer-Learn is an open-source and well-documented library for Transfer Learning. It is based on pure PyTorch with high performance and friendly API. Our code is pythonic, and the design is consist

THUML @ Tsinghua University 2.2k Jan 03, 2023
This code is the implementation of the paper "Coherence-Based Distributed Document Representation Learning for Scientific Documents".

Introduction This code is the implementation of the paper "Coherence-Based Distributed Document Representation Learning for Scientific Documents". If

tsc 0 Jan 11, 2022
Multivariate Boosted TRee

Multivariate Boosted TRee What is MBTR MBTR is a python package for multivariate boosted tree regressors trained in parameter space. The package can h

SUPSI-DACD-ISAAC 61 Dec 19, 2022
Neural models of common sense. 🤖

Unicorn on Rainbow Neural models of common sense. This repository is for the paper: Unicorn on Rainbow: A Universal Commonsense Reasoning Model on a N

AI2 60 Jan 05, 2023
Does MAML Only Work via Feature Re-use? A Data Set Centric Perspective

Does-MAML-Only-Work-via-Feature-Re-use-A-Data-Set-Centric-Perspective Does MAML Only Work via Feature Re-use? A Data Set Centric Perspective Installin

2 Nov 07, 2022
A Lighting Pytorch Framework for Recommendation System, Easy-to-use and Easy-to-extend.

Torch-RecHub A Lighting Pytorch Framework for Recommendation Models, Easy-to-use and Easy-to-extend. 安装 pip install torch-rechub 主要特性 scikit-learn风格易用

Mincai Lai 67 Jan 04, 2023
ML for NLP and Computer Vision.

Sparrow is our open-source ML product. It runs on Skipper MLOps infrastructure.

Katana ML 2 Nov 28, 2021
Companion repo of the UCC 2021 paper "Predictive Auto-scaling with OpenStack Monasca"

Predictive Auto-scaling with OpenStack Monasca Giacomo Lanciano*, Filippo Galli, Tommaso Cucinotta, Davide Bacciu, Andrea Passarella 2021 IEEE/ACM 14t

Giacomo Lanciano 0 Dec 07, 2022
ALFRED - A Benchmark for Interpreting Grounded Instructions for Everyday Tasks

ALFRED A Benchmark for Interpreting Grounded Instructions for Everyday Tasks Mohit Shridhar, Jesse Thomason, Daniel Gordon, Yonatan Bisk, Winson Han,

ALFRED 204 Dec 15, 2022
Making a music video with Wav2CLIP and VQGAN-CLIP

music2video Overview A repo for making a music video with Wav2CLIP and VQGAN-CLIP. The base code was derived from VQGAN-CLIP The CLIP embedding for au

Joel Jang | 장요엘 163 Dec 26, 2022
Learn other languages ​​using artificial intelligence with python.

The main idea of ​​the project is to facilitate the learning of other languages. We created a simple AI that will interact with you. Just ask questions that if she knows, she will answer.

Pedro Rodrigues 2 Jun 07, 2022
A flag generation AI created using DeepAIs API

Vex AI or Vexiology AI is an Artifical Intelligence created to generate custom made flag design texts. It uses DeepAIs API. Please be aware that you must include your own DeepAI API key. See instruct

Bernie 10 Apr 06, 2022
Learning Spatio-Temporal Transformer for Visual Tracking

STARK The official implementation of the paper Learning Spatio-Temporal Transformer for Visual Tracking Hiring research interns for visual transformer

Multimedia Research 484 Dec 29, 2022
A framework for joint super-resolution and image synthesis, without requiring real training data

SynthSR This repository contains code to train a Convolutional Neural Network (CNN) for Super-resolution (SR), or joint SR and data synthesis. The met

83 Jan 01, 2023
ByteTrack超详细教程!训练自己的数据集&&摄像头实时检测跟踪

ByteTrack超详细教程!训练自己的数据集&&摄像头实时检测跟踪

Double-zh 45 Dec 19, 2022
Local Similarity Pattern and Cost Self-Reassembling for Deep Stereo Matching Networks

Local Similarity Pattern and Cost Self-Reassembling for Deep Stereo Matching Networks Contributions A novel pairwise feature LSP to extract structural

31 Dec 06, 2022
BOOKSUM: A Collection of Datasets for Long-form Narrative Summarization

BOOKSUM: A Collection of Datasets for Long-form Narrative Summarization Authors: Wojciech Kryściński, Nazneen Rajani, Divyansh Agarwal, Caiming Xiong,

Salesforce 125 Dec 31, 2022
CycleTransGAN-EVC: A CycleGAN-based Emotional Voice Conversion Model with Transformer

CycleTransGAN-EVC CycleTransGAN-EVC: A CycleGAN-based Emotional Voice Conversion Model with Transformer Demo emotion CycleTransGAN CycleTransGAN Cycle

24 Dec 15, 2022
[NeurIPS 2021] Low-Rank Subspaces in GANs

Low-Rank Subspaces in GANs Figure: Image editing results using LowRankGAN on StyleGAN2 (first three columns) and BigGAN (last column). Low-Rank Subspa

112 Dec 28, 2022
Repo for FUZE project. I will also publish some Linux kernel LPE exploits for various real world kernel vulnerabilities here. the samples are uploaded for education purposes for red and blue teams.

Linux_kernel_exploits Some Linux kernel exploits for various real world kernel vulnerabilities here. More exploits are yet to come. This repo contains

Wei Wu 472 Dec 21, 2022