[ICCV 2021 Oral] Just Ask: Learning to Answer Questions from Millions of Narrated Videos

Overview

Just Ask: Learning to Answer Questions from Millions of Narrated Videos

WebpageDemoPaper

PWC PWC PWC PWC PWC

This repository provides the code for our paper, including:

  • Data downloading instructions, including our released iVQA and HowToVQA69M datasets
  • Data preprocessing and feature extraction scripts, as well as preprocessed data and features
  • VideoQA automatic generation pipeline
  • Training scripts and pretrained checkpoints, both for pretraining and downstream VideoQA datasets
  • Evaluation scripts

Paths and Requirements

Fill the empty paths in the file global_parameters.py.

To install requirements, run:

pip install -r requirements.txt

Quick Start

If you wish to start VideoQA training or inference quickly.

For downstream datasets

To download pretrained checkpoints, pre-processed data and features, run:

bash download/download_checkpoints.sh <DEFAULT_CKPT_DIR>
bash download/download_downstream.sh <DEFAULT_DATASET_DIR>

This requires having about 8Gb free in DEFAULT_CKPT_DIR and 3.6Gb free in DEFAULT_DATASET_DIR.

For HowToVQA69M Pretraining

If you want to reproduce the pretraining, download HowToVQA69M:

bash download/download_howtovqa.sh <DEFAULT_DATASET_DIR>

This requires having about 6Gb free in DEFAULT_DATASET_DIR. You will also need to download features for videos from HowTo100M from the data providers in HOWTO_FEATURES_PATH.

Long Start

If you wish to reproduce the data preprocessing, video feature extraction or HowToVQA69M generation procedure.

Download Raw Data

Click for details...

The following folders should be created in DEFAULT_DATASET_DIR, and should also contain a video subfolder containing the videos downloaded from each dataset.

HowToVQA69M: We provide the HowToVQA69M dataset at this link. The HowToVQA69M folder should contain howtovqa.pkl, train_howtovqa.csv and val_howtovqa.csv.

iVQA: We provide the iVQA dataset at this link. The iVQA folder should contain train.csv, val.csv and test.csv.

MSRVTT-QA: Download it from the data providers. The MSRVTT-QA folder should contain train_qa.json, val_qa.json, test_qa.json, and also train_val_videodatainfo.json and test_videodatainfo.json. The two last files are from the MSR-VTT dataset, and are used to filter out video IDs in HowTo100M that are in the validation and test sets of MSRVTT-QA.

MSVD-QA: Download it from the data providers. The MSVD-QA folder should contain train_qa.json, val_qa.json, test_qa.json and youtube_mapping.txt. The last file is used to filter out videos IDs in HowTo100M that are in the validation and test sets of MSVD-QA.

ActivityNet-QA: Download it from the data providers. The ActivityNet-QA folder should contain train_q.json, train_a.json, val_q.json, val_a.json, test_q.json and test_a.json.

How2QA: Download it from the data providers. The How2QA folder should contain how2QA_train_release.csv and how2QA_val_release.csv.

HowTo100M: Download it from the data providers. The HowTo100M folder should contain caption_howto100m_with_stopwords.pkl and s3d_features.csv. Note that for the VQA-T pretraining on HowTo100M baseline, we also do zero-shot validation on YouCook2 and MSR-VTT video retrieval. We followed MIL-NCE for the preprocessing of these datasets. You should have in the YouCook2 folder a pickle file with processed data and features youcook_unpooled_val.pkl, and in the MSR-VTT folder a file of processed data MSRVTT_JSFUSION_test.csv and a file of features msrvtt_test_unpooled_s3d_features.pth.

Data Preprocessing

Click for details...

VideoQA: To process data for each VideoQA dataset, use:

python preproc/preproc_ivqa.py
python preproc/preproc_msrvttqa.py
python preproc/preproc_msvdqa.py
python preproc/preproc_activitynetqa.py
python preproc/preproc_how2qa.py

This will save train, validation and test dataframe files (train.csv, val.csv, test.csv), and the vocabulary map (vocab.json) in the open-ended setting, in each dataset folder. Note that the How2QA preprocessing script should be used after feature extraction (see below) and will also merge features into one file.

HowTo100M: To preprocess HowTo100M by removing potential intersection with the validation and test sets of VideoQA datasets, and removing repetition in the ASR data, use:

python preproc/howto100m_remove_intersec.py
python preproc/howto100m_remove_repet.py

This will save caption_howto100m_sw_nointersec.pickle, caption_howto100m_sw_nointersec_norepeat.pickle and s3d_features_nointersec.csv in HOWTO_PATH.

Extract video features

Click for details...

We provide in the extract folder the code to extract features with the S3D feature extractor. It requires downloading the S3D model weights available at this repository. The s3d_howto100m.pth checkpoint and s3d_dict.npy dictionary should be in DEFAULT_MODEL_DIR.

Extraction: You should prepare for each dataset a csv with columns video_path (typically in the form of <dataset_path>/video/<video_path>), and feature_path (typically in the form of <dataset_path>/features/<video_path>.npy). Then use (you may launch this script on multiple GPUs to fasten the extraction process):

python extract/extract.py --csv <csv_path>

Merging: To merge the extracted features into a single file for each VideoQA dataset, use (for ActivityNet-QA that contains long videos, add --pad 120):

python extract/merge_features.py --folder <features_path> \
--output_path <DEFAULT_DATASET_DIR>/s3d.pth --dataset <dataset>

For HowTo100M, the features should be stored in HOWTO_FEATURES_PATH, one file per video. SSD_PATH should preferably on a SSD disk for optimized on-the-fly reading operation time during pretraining.

HowToVQA69M Generation

Click for details...

This requires downloading the pretrained BRNN model weights from Punctuator2. The INTERSPEECH-T-BRNN.pcl file should be in DEFAULT_MODEL_DIR.

Punctuating: First, we punctuate the speech data at the video level and split the video into clips temporally aligned with infered sentences (you may launch this script on multiple CPUs to fasten the process):

python videoqa_generation/punctuate.py

Merging infered speech sentences: Second, we merge the punctuated data into one file:

python videoqa_generation/merge_punctuations.py

Extracting answers: Third, we extract answers from speech transcripts. This requires having cloned this repository in QG_REPO_DIR. Then use (you may launch this script on multiple GPUs to fasten the process):

python videoqa_generation/extract_answers.py

Merging extracted answers: Fourth, we merge the extracted answers into one file:

python videoqa_generation/merge_answers.py

Generating questions: Fifth, we generate questions pairs from speech and extracted answers. Use (you may launch this script on multiple GPUs to fasten the process):

python videoqa_generation/generate_questions.py

Merging generated question-answer pairs: Finally, we merge the generated question-answer pairs into one file (this will save howtovqa.pkl, train_howtovqa.csv and val_howtovqa.csv):

python videoqa_generation/merge_qas.py

Training

Pretraining

DistilBERT tokenizer and model checkpoints will be automatically downloaded from Hugging Face in DEFAULT_MODEL_DIR/transformers.

Training VQA-T on HowToVQA69M: To train on HowToVQA69M with contrastive loss and MLM loss (it takes less than 48H on 8 NVIDIA Tesla V100), run:

python main_howtovqa.py --dataset="howtovqa" --epochs=10 --checkpoint_dir="pthowtovqa" \
--batch_size=128 --batch_size_val=256 --n_pair=32 --freq_display=10

Note that it runs a validation once per epoch, which consists in retrieving answer within the batch, given video and question.

Baselines: The pretraining of QA-T on HowToVQA69M is done with the previous command complemented with --baseline qa. To train VQA-T on HowTo100M with MLM and cross-modal matching objectives (it takes less than 2 days on 8 NVIDIA Tesla V100), run:

python main_htm.py --dataset="howto100m" --epochs=10 --checkpoint_dir="pthtm" \ 
--batch_size=128 --batch_size_val=3500 --n_pair=32 --freq_display=10

Note that the previous command runs a zero-shot video retrieval validation on YouCook2 and MSR-VTT once per epoch.

Training on downstream VideoQA datasets

Finetuning: To finetune a pretrained model on a downstream VideoQA dataset (for MSRVTT-QA, which is the largest downstream dataset, it takes less than 4 hours on 4 NVIDIA Tesla V100), run:

python main_videoqa.py --checkpoint_dir=ft<dataset> --dataset=<dataset> --lr=0.00001 \ 
--pretrain_path=<CKPT_PATH>

Training from scratch: VQA-T trained from scratch is simply obtained by running the previous script with no pretrain_path set.

Available checkpoints

Training data iVQA MSRVTT-QA MSVD-QA ActivityNet-QA How2QA url size
HowToVQA69M 12.2 2.9 7.5 12.2 51.1 Drive 600MB
HowToVQA69M + iVQA 35.4 Drive 600MB
HowToVQA69M + MSRVTT-QA 41.5 Drive 600MB
HowToVQA69M + MSVD-QA 43.6 Drive 600MB
HowToVQA69M + ActivityNet-QA 38.9 Drive 600MB
HowToVQA69M + How2QA 84.4 Drive 600MB

Inference

Evaluating on downstream VideoQA datasets

VQA-T To evaluate VQA-T on a downstream VideoQA dataset, run (for zero-shot VideoQA, simply use the checkpoint trained on HowToVQA69M only):

python main_videoqa.py --checkpoint_dir=ft<dataset> --dataset=<dataset> \ 
--pretrain_path=<CKPT_PATH> --test 1

Baselines In the case of QA-T, use the command above with the corresponding checkpoint and add --baseline qa. In the case of Zero-Shot VideoQA for VQA-T pretrained on HowTo100M, run:

python eval_videoqa_cm.py --checkpoint_dir=pthtmzeroshot<dataset> --dataset=<dataset> \ 
--pretrain_path=<CKPT_PATH>

Detailed evaluation

Using a trained checkpoint, to perform evaluation segmented per question type and answer quartile, use:

python eval_videoqa.py --dataset <dataset> --pretrain_path <CKPT_PATH>

VideoQA Demo

Using a trained checkpoint, you can also run a VideoQA example with a video file of your choice, and the question of your choice. For that, use (the dataset indicated here is only used for the definition of the answer vocabulary):

python demo_videoqa.py --dataset <dataset> --pretrain_path <CKPT_PATH> \ 
--question_example <question> --video_example <video_path>

Note that we also host an online demo at this link.

Misc.

In the folder misc, you can find a notebook with code for the plots and data statistics showed in the paper.

You can also find there the html code used for iVQA data collection on Amazon Mechanical Turk.

Moreover, you can find the manually evaluated samples from generated data at this link.

Finally, you can find the html and python code for the online demo.

Acknowledgements

The video feature extraction code is inspired by this repository. The model implementation of our multi-modal transformer (as well as the masked language modeling setup) is inspired by Hugging Face. The comparison with Heilman et al was done using the original Java implementation.

Citation

If you found this work useful, consider giving this repository a star and citing our paper as followed:

@InProceedings{Yang_2021_ICCV,
    author    = {Yang, Antoine and Miech, Antoine and Sivic, Josef and Laptev, Ivan and Schmid, Cordelia},
    title     = {Just Ask: Learning To Answer Questions From Millions of Narrated Videos},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    month     = {October},
    year      = {2021},
    pages     = {1686-1697}
}
Owner
Antoine Yang
PhD Student in Computer Vision and Machine Learning, focusing on learning multimodal video representations using vision and language
Antoine Yang
Detectron2 is FAIR's next-generation platform for object detection and segmentation.

Detectron2 is Facebook AI Research's next generation software system that implements state-of-the-art object detection algorithms. It is a ground-up r

Facebook Research 23.3k Jan 08, 2023
DIT is a DTLS MitM proxy implemented in Python 3. It can intercept, manipulate and suppress datagrams between two DTLS endpoints and supports psk-based and certificate-based authentication schemes (RSA + ECC).

DIT - DTLS Interception Tool DIT is a MitM proxy tool to intercept DTLS traffic. It can intercept, manipulate and/or suppress DTLS datagrams between t

52 Nov 30, 2022
Easy-to-use micro-wrappers for Gym and PettingZoo based RL Environments

SuperSuit introduces a collection of small functions which can wrap reinforcement learning environments to do preprocessing ('microwrappers'). We supp

Farama Foundation 357 Jan 06, 2023
codebase for "A Theory of the Inductive Bias and Generalization of Kernel Regression and Wide Neural Networks"

Eigenlearning This repo contains code for replicating the experiments of the paper A Theory of the Inductive Bias and Generalization of Kernel Regress

Jamie Simon 45 Dec 02, 2022
Learning Neural Painters Fast! using PyTorch and Fast.ai

The Joy of Neural Painting Learning Neural Painters Fast! using PyTorch and Fast.ai Blogpost with more details: The Joy of Neural Painting The impleme

Libre AI 72 Nov 10, 2022
Official PyTorch implementation of the paper Image-Based CLIP-Guided Essence Transfer.

TargetCLIP- official pytorch implementation of the paper Image-Based CLIP-Guided Essence Transfer This repository finds a global direction in StyleGAN

Hila Chefer 221 Dec 13, 2022
This repository contains pre-trained models and some evaluation code for our paper Towards Unsupervised Dense Information Retrieval with Contrastive Learning

Contriever: Towards Unsupervised Dense Information Retrieval with Contrastive Learning This repository contains pre-trained models and some evaluation

Meta Research 207 Jan 08, 2023
Python wrapper class for OpenVINO Model Server. User can submit inference request to OVMS with just a few lines of code

Python wrapper class for OpenVINO Model Server. User can submit inference request to OVMS with just a few lines of code.

Yasunori Shimura 7 Jul 27, 2022
Code for Private Recommender Systems: How Can Users Build Their Own Fair Recommender Systems without Log Data? (SDM 2022)

Private Recommender Systems: How Can Users Build Their Own Fair Recommender Systems without Log Data? (SDM 2022) We consider how a user of a web servi

joisino 20 Aug 21, 2022
Food recognition model using convolutional neural network & computer vision

Food recognition model using convolutional neural network & computer vision. The goal is to match or beat the DeepFood Research Paper

Hemanth Chandran 1 Jan 13, 2022
CondLaneNet: a Top-to-down Lane Detection Framework Based on Conditional Convolution

CondLaneNet: a Top-to-down Lane Detection Framework Based on Conditional Convolution This is the official implementation code of the paper "CondLaneNe

Alibaba Cloud 311 Dec 30, 2022
CLOOB: Modern Hopfield Networks with InfoLOOB Outperform CLIP

CLOOB: Modern Hopfield Networks with InfoLOOB Outperform CLIP Andreas Fürst* 1, Elisabeth Rumetshofer* 1, Viet Tran1, Hubert Ramsauer1, Fei Tang3, Joh

Institute for Machine Learning, Johannes Kepler University Linz 133 Jan 04, 2023
Official PyTorch Implementation of Unsupervised Learning of Scene Flow Estimation Fusing with Local Rigidity

UnRigidFlow This is the official PyTorch implementation of UnRigidFlow (IJCAI2019). Here are two sample results (~10MB gif for each) of our unsupervis

Liang Liu 28 Nov 16, 2022
Image classification for projects and researches

This is a tool to help you quickly solve classification problems including: data analysis, training, report results and model explanation.

Nguyễn Trường Lâu 2 Dec 27, 2021
MASA-SR: Matching Acceleration and Spatial Adaptation for Reference-Based Image Super-Resolution (CVPR2021)

MASA-SR Official PyTorch implementation of our CVPR2021 paper MASA-SR: Matching Acceleration and Spatial Adaptation for Reference-Based Image Super-Re

DV Lab 126 Dec 20, 2022
This is an official implementation for the WTW Dataset in "Parsing Table Structures in the Wild " on table detection and table structure recognition.

WTW-Dataset This is an official implementation for the WTW Dataset in "Parsing Table Structures in the Wild " on ICCV 2021. Here, you can download the

109 Dec 29, 2022
NEATEST: Evolving Neural Networks Through Augmenting Topologies with Evolution Strategy Training

NEATEST: Evolving Neural Networks Through Augmenting Topologies with Evolution Strategy Training

Göktuğ Karakaşlı 16 Dec 05, 2022
SC-GlowTTS: an Efficient Zero-Shot Multi-Speaker Text-To-Speech Model

SC-GlowTTS: an Efficient Zero-Shot Multi-Speaker Text-To-Speech Model Edresson Casanova, Christopher Shulby, Eren Gölge, Nicolas Michael Müller, Frede

Edresson Casanova 92 Dec 09, 2022
Controlling the MicriSpotAI robot from scratch

Project-MicroSpot-AI Controlling the MicriSpotAI robot from scratch Colaborators Alexander Dennis Components from MicroSpot The MicriSpotAI has the fo

Dennis Núñez-Fernández 5 Oct 20, 2022
"MST++: Multi-stage Spectral-wise Transformer for Efficient Spectral Reconstruction" (CVPRW 2022) & (Winner of NTIRE 2022 Challenge on Spectral Reconstruction from RGB)

MST++: Multi-stage Spectral-wise Transformer for Efficient Spectral Reconstruction (CVPRW 2022) Yuanhao Cai, Jing Lin, Zudi Lin, Haoqian Wang, Yulun Z

Yuanhao Cai 274 Jan 05, 2023