Official code for "Bridging Video-text Retrieval with Multiple Choice Questions", CVPR 2022 (Oral).

Related tags

Computer VisionMCQ
Overview

Bridging Video-text Retrieval with Multiple Choice Questions, CVPR 2022 (Oral)

Paper | Project Page | Pre-trained Model | CLIP-Initialized Pre-trained Model image

News

2022-04-17 We release the pre-trained model initialized from CLIP (ViT-B/32) and its usage (text-to-video retrieval and video feature extraction).

2022-04-08 We release the pre-training and downstream evaluation code, and the pre-trained model.

Main Results on Downstream Tasks

Text-to-video Retrieval on MSR-VTT

image

Text-to-video Retrieval on MSVD, LSMDC and DiDeMo

image

Visualization

Answer Noun Questions

We visualize cross-modality attention between the text tokens of noun questions and video tokens from BridgeFormer. In the second and fifth column, the noun phrase marked in blue (Q1) is erased as the question, and in the third and sixth column, the noun phrase marked in green (Q2) is erased as the question. BridgeFormer attends to video patches with specific object information to answer noun questions.

image

Answer Verb Questions

We visualize cross-modality attention between the text tokens of verb questions and video tokens from BridgeFormer. Three frames sampled from a video are shown and the verb phrase marked in blue (Q) is erased as the question. BridgeFormer focuses on object motions of video tokens to answer verb questions.

image

Dependencies and Installation

Installation

  1. Clone repo

    git clone https://github.com/TencentARC/MCQ.git
    cd MCQ
  2. Install dependent packages

    pip install -r requirements.txt
  3. Download the DistilBERT base model from Hugging Face in hugging face or in distilbert-base-uncased. Put "distilbert-base-uncased" under the directory of this repo.

Data Preparation

Please refer to DATA.md for pre-training and downstream evaluation datasets.

Pre-training

We adopt the curriculum learning to train the model, which pre-trains the model on the image dataset CC3M and video dataset WebVid-2M using 1 frame, and then on the video dataset WebVid-2M using 4 frames.

  1. For 1-frame pre-training, since a single frame does not contain temporal dynamics to correspond to verb phrases, we train the model to answer only noun questions.

    bash sctripts/train_1frame_mask_noun.sh
    

    When the training loss converges, we get model "MCQ_1frame.pth".

  2. For 4-frame pre-training, to save computation cost to enable a comparatively large batch size for contrastive learning, we train the model to anwer noun and verb questions sequentially. We first train the model to answer noun questions with "MCQ_1frame.pth" loaded in "configs/dist-4frame-mask-noun.json".

    bash sctripts/train_4frame_mask_noun.sh
    

    When the training loss converges, we get model "MCQ_4frame_noun.pth". We then train the model to answer verb questions with "MCQ_4frame_noun.pth" loaded in "configs/dist-4frame-mask-verb.json".

    bash sctripts/train_4frame_mask_verb.sh
    

    When the training loss converges, we get the final model.

  3. Our repo adopts Multi-Machine and Multi-GPU training, with 32 A100 GPU for 1-frame pre-training and 40 A100 GPU for 4-frame pre-training.

Pre-trained Model

Our pre-trained model can be downloaded in Pre-trained Model, which contains the weights of VideoFormer, TextFormer and BridgeFormer. For downstream evaluation, you only need to load the weights of VideoFormer and TextFormer, with BridgeFormer removed.

Downstream Retrieval (Zero-shot on MSR-VTT)

  1. Download our pre-trained model in Pre-trained Model (Or use your own pre-traind model).

  2. Load the pre-trained model in "configs/zero_msrvtt_4f_i21k.json".

    bash sctripts/test_retrieval.sh
    

CLIP-initialized Pre-trained Model

We also initialize our model from CLIP weights to pre-train a model with MCQ. Specifically, we use the pre-trained CLIP (ViT-B/32) as the backbone of VideoFormer and TextFormer, and randomly initialize BridgeFormer. Our VideoFormer does not incur any additional parameters compared to the ViT of CLIP, with a parameter-free modification to allow for the input of video frames with variable length.

To evaluate the performance of the CLIP-initialized pre-trained model on text-to-video retrieval,

  1. Download the model in CLIP-Initialized Pre-trained Model.

  2. Load the pre-trained model in "configs/zero_msrvtt_4f_i21k_clip.json".

    bash sctripts/test_retrieval_CLIP.sh
    

We also provide a script to extract video features of any given videos from the CLIP-initialized pre-trained model,

python extract_video_features_clip.py

To Do

  • Release pre-training code
  • Release pre-trained model
  • Release downstream evaluation code
  • Release CLIP-initialized model
  • Release video representation extraction code

License

MCQ is released under BSD 3-Clause License.

Acknowledgement

Our code is based on the implementation of "Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval" https://github.com/m-bain/frozen-in-time.git.

Citation

If our code is helpful to your work, please cite:

@article{ge2022bridgeformer,
  title={BridgeFormer: Bridging Video-text Retrieval with Multiple Choice Questions},
  author={Ge, Yuying and Ge, Yixiao and Liu, Xihui and Li, Dian and Shan, Ying and Qie, Xiaohu and Luo, Ping},
  journal={arXiv preprint arXiv:2201.04850},
  year={2022}
}
Owner
Applied Research Center (ARC), Tencent PCG
Applied Research Center (ARC), Tencent PCG
Detect the mathematical formula from the given picture and the same formula is extracted and converted into the latex code

Mathematical formulae extractor The goal of this project is to create a learning based system that takes an image of a math formula and returns corres

6 May 22, 2022
Simple app for visual editing of Page XML files

Name nw-page-editor - Simple app for visual editing of Page XML files. Version: 2021.02.22 Description nw-page-editor is an application for viewing/ed

Mauricio Villegas 27 Jun 20, 2022
kaldi-asr/kaldi is the official location of the Kaldi project.

Kaldi Speech Recognition Toolkit To build the toolkit: see ./INSTALL. These instructions are valid for UNIX systems including various flavors of Linux

Kaldi 12.3k Jan 05, 2023
Read Japanese manga inside browser with selectable text.

mokuro Read Japanese manga with selectable text inside a browser. See demo: https://kha-white.github.io/manga-demo mokuro_demo.mp4 Demo contains excer

Maciej Budyś 170 Dec 27, 2022
A simple component to display annotated text in Streamlit apps.

Annotated Text Component for Streamlit A simple component to display annotated text in Streamlit apps. For example: Installation First install Streaml

Thiago Teixeira 312 Dec 30, 2022
Learn computer graphics by writing GPU shaders!

This repo contains a selection of projects designed to help you learn the basics of computer graphics. We'll be writing shaders to render interactive two-dimensional and three-dimensional scenes.

Eric Zhang 1.9k Jan 02, 2023
The official code for the ICCV-2021 paper "Speech Drives Templates: Co-Speech Gesture Synthesis with Learned Templates".

SpeechDrivesTemplates The official repo for the ICCV-2021 paper "Speech Drives Templates: Co-Speech Gesture Synthesis with Learned Templates". [arxiv

Qian Shenhan 53 Dec 23, 2022
a Deep Learning Framework for Text

DeLFT DeLFT (Deep Learning Framework for Text) is a Keras and TensorFlow framework for text processing, focusing on sequence labelling (e.g. named ent

Patrice Lopez 350 Dec 19, 2022
Code for the AAAI 2018 publication "SEE: Towards Semi-Supervised End-to-End Scene Text Recognition"

SEE: Towards Semi-Supervised End-to-End Scene Text Recognition Code for the AAAI 2018 publication "SEE: Towards Semi-Supervised End-to-End Scene Text

Christian Bartz 572 Jan 05, 2023
Source code of RRPN ---- Arbitrary-Oriented Scene Text Detection via Rotation Proposals

Paper source Arbitrary-Oriented Scene Text Detection via Rotation Proposals https://arxiv.org/abs/1703.01086 News We update RRPN in pytorch 1.0! View

428 Nov 22, 2022
Read-only mirror of https://gitlab.gnome.org/GNOME/ocrfeeder

================================= OCRFeeder - A Complete OCR Suite ================================= OCRFeeder is a complete Optical Character Recogn

GNOME Github Mirror 81 Dec 23, 2022
CUTIE (TensorFlow implementation of Convolutional Universal Text Information Extractor)

CUTIE TensorFlow implementation of the paper "CUTIE: Learning to Understand Documents with Convolutional Universal Text Information Extractor." Xiaohu

Zhao,Xiaohui 147 Dec 20, 2022
This is the implementation of the paper "Gated Recurrent Convolution Neural Network for OCR"

Gated Recurrent Convolution Neural Network for OCR This project is an implementation of the GRCNN for OCR. For details, please refer to the paper: htt

90 Dec 22, 2022
Crop regions in napari manually

napari-crop Crop regions in napari manually Usage Create a new shapes layer to annotate the region you would like to crop: Use the rectangle tool to a

Robert Haase 4 Sep 29, 2022
Go package for OCR (Optical Character Recognition), by using Tesseract C++ library

gosseract OCR Golang OCR package, by using Tesseract C++ library. OCR Server Do you just want OCR server, or see the working example of this package?

Hiromu OCHIAI 1.9k Dec 28, 2022
A simple Security Camera created using Opencv in Python where images gets saved in realtime in your Dropbox account at every 5 seconds

Security Camera using Opencv & Dropbox This is a simple Security Camera created using Opencv in Python where images gets saved in realtime in your Dro

Arpit Rath 1 Jan 31, 2022
Convert PDF/Image to TXT using EasyOcr - the best OCR engine available!

PDFImage2TXT - DOWNLOAD INSTALLER HERE What can you do with it? Convert scanned PDFs to TXT. Convert scanned Documents to TXT. No coding required!! In

Hans Alemão 2 Feb 22, 2022
This repo contains several opencv projects done while learning opencv in python.

opencv-projects-python This repo contains both several opencv projects done while learning opencv by python and opencv learning resources [Basic conce

Fatin Shadab 2 Nov 03, 2022
Perspective recovery of text using transformed ellipses

unproject_text Perspective recovery of text using transformed ellipses. See full writeup at https://mzucker.github.io/2016/10/11/unprojecting-text-wit

Matt Zucker 111 Nov 13, 2022
PSENet - Shape Robust Text Detection with Progressive Scale Expansion Network.

News Python3 implementations of PSENet [1], PAN [2] and PAN++ [3] are released at https://github.com/whai362/pan_pp.pytorch. [1] W. Wang, E. Xie, X. L

1.1k Dec 24, 2022