Official code for "Bridging Video-text Retrieval with Multiple Choice Questions", CVPR 2022 (Oral).

Related tags

Computer VisionMCQ
Overview

Bridging Video-text Retrieval with Multiple Choice Questions, CVPR 2022 (Oral)

Paper | Project Page | Pre-trained Model | CLIP-Initialized Pre-trained Model image

News

2022-04-17 We release the pre-trained model initialized from CLIP (ViT-B/32) and its usage (text-to-video retrieval and video feature extraction).

2022-04-08 We release the pre-training and downstream evaluation code, and the pre-trained model.

Main Results on Downstream Tasks

Text-to-video Retrieval on MSR-VTT

image

Text-to-video Retrieval on MSVD, LSMDC and DiDeMo

image

Visualization

Answer Noun Questions

We visualize cross-modality attention between the text tokens of noun questions and video tokens from BridgeFormer. In the second and fifth column, the noun phrase marked in blue (Q1) is erased as the question, and in the third and sixth column, the noun phrase marked in green (Q2) is erased as the question. BridgeFormer attends to video patches with specific object information to answer noun questions.

image

Answer Verb Questions

We visualize cross-modality attention between the text tokens of verb questions and video tokens from BridgeFormer. Three frames sampled from a video are shown and the verb phrase marked in blue (Q) is erased as the question. BridgeFormer focuses on object motions of video tokens to answer verb questions.

image

Dependencies and Installation

Installation

  1. Clone repo

    git clone https://github.com/TencentARC/MCQ.git
    cd MCQ
  2. Install dependent packages

    pip install -r requirements.txt
  3. Download the DistilBERT base model from Hugging Face in hugging face or in distilbert-base-uncased. Put "distilbert-base-uncased" under the directory of this repo.

Data Preparation

Please refer to DATA.md for pre-training and downstream evaluation datasets.

Pre-training

We adopt the curriculum learning to train the model, which pre-trains the model on the image dataset CC3M and video dataset WebVid-2M using 1 frame, and then on the video dataset WebVid-2M using 4 frames.

  1. For 1-frame pre-training, since a single frame does not contain temporal dynamics to correspond to verb phrases, we train the model to answer only noun questions.

    bash sctripts/train_1frame_mask_noun.sh
    

    When the training loss converges, we get model "MCQ_1frame.pth".

  2. For 4-frame pre-training, to save computation cost to enable a comparatively large batch size for contrastive learning, we train the model to anwer noun and verb questions sequentially. We first train the model to answer noun questions with "MCQ_1frame.pth" loaded in "configs/dist-4frame-mask-noun.json".

    bash sctripts/train_4frame_mask_noun.sh
    

    When the training loss converges, we get model "MCQ_4frame_noun.pth". We then train the model to answer verb questions with "MCQ_4frame_noun.pth" loaded in "configs/dist-4frame-mask-verb.json".

    bash sctripts/train_4frame_mask_verb.sh
    

    When the training loss converges, we get the final model.

  3. Our repo adopts Multi-Machine and Multi-GPU training, with 32 A100 GPU for 1-frame pre-training and 40 A100 GPU for 4-frame pre-training.

Pre-trained Model

Our pre-trained model can be downloaded in Pre-trained Model, which contains the weights of VideoFormer, TextFormer and BridgeFormer. For downstream evaluation, you only need to load the weights of VideoFormer and TextFormer, with BridgeFormer removed.

Downstream Retrieval (Zero-shot on MSR-VTT)

  1. Download our pre-trained model in Pre-trained Model (Or use your own pre-traind model).

  2. Load the pre-trained model in "configs/zero_msrvtt_4f_i21k.json".

    bash sctripts/test_retrieval.sh
    

CLIP-initialized Pre-trained Model

We also initialize our model from CLIP weights to pre-train a model with MCQ. Specifically, we use the pre-trained CLIP (ViT-B/32) as the backbone of VideoFormer and TextFormer, and randomly initialize BridgeFormer. Our VideoFormer does not incur any additional parameters compared to the ViT of CLIP, with a parameter-free modification to allow for the input of video frames with variable length.

To evaluate the performance of the CLIP-initialized pre-trained model on text-to-video retrieval,

  1. Download the model in CLIP-Initialized Pre-trained Model.

  2. Load the pre-trained model in "configs/zero_msrvtt_4f_i21k_clip.json".

    bash sctripts/test_retrieval_CLIP.sh
    

We also provide a script to extract video features of any given videos from the CLIP-initialized pre-trained model,

python extract_video_features_clip.py

To Do

  • Release pre-training code
  • Release pre-trained model
  • Release downstream evaluation code
  • Release CLIP-initialized model
  • Release video representation extraction code

License

MCQ is released under BSD 3-Clause License.

Acknowledgement

Our code is based on the implementation of "Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval" https://github.com/m-bain/frozen-in-time.git.

Citation

If our code is helpful to your work, please cite:

@article{ge2022bridgeformer,
  title={BridgeFormer: Bridging Video-text Retrieval with Multiple Choice Questions},
  author={Ge, Yuying and Ge, Yixiao and Liu, Xihui and Li, Dian and Shan, Ying and Qie, Xiaohu and Luo, Ping},
  journal={arXiv preprint arXiv:2201.04850},
  year={2022}
}
Owner
Applied Research Center (ARC), Tencent PCG
Applied Research Center (ARC), Tencent PCG
A Tensorflow model for text recognition (CNN + seq2seq with visual attention) available as a Python package and compatible with Google Cloud ML Engine.

Attention-based OCR Visual attention-based OCR model for image recognition with additional tools for creating TFRecords datasets and exporting the tra

Ed Medvedev 933 Dec 29, 2022
Code for the paper "DewarpNet: Single-Image Document Unwarping With Stacked 3D and 2D Regression Networks" (ICCV '19)

DewarpNet This repository contains the codes for DewarpNet training. Recent Updates [May, 2020] Added evaluation images and an important note about Ma

<a href=[email protected]"> 354 Jan 01, 2023
Autonomous Driving project for Euro Truck Simulator 2

hope-autonomous-driving Autonomous Driving project for Euro Truck Simulator 2 Video: How is it working ? In this video, the program processes the imag

Umut Görkem Kocabaş 36 Nov 06, 2022
An application of high resolution GANs to dewarp images of perturbed documents

Docuwarp This project is focused on dewarping document images through the usage of pix2pixHD, a GAN that is useful for general image to image translat

Thomas Huang 97 Dec 25, 2022
https://arxiv.org/abs/1904.01941

Character-Region-Awareness-for-Text-Detection- https://arxiv.org/abs/1904.01941 Train You can train SynthText data use python source/train_SynthText.p

DayDayUp 120 Dec 28, 2022
零样本学习测评基准,中文版

ZeroCLUE 零样本学习测评基准,中文版 零样本学习是AI识别方法之一。 简单来说就是识别从未见过的数据类别,即训练的分类器不仅仅能够识别出训练集中已有的数据类别, 还可以对于来自未见过的类别的数据进行区分。 这是一个很有用的功能,使得计算机能够具有知识迁移的能力,并无需任何训练数据, 很符合现

CLUE benchmark 27 Dec 10, 2022
Python-based tools for document analysis and OCR

ocropy OCRopus is a collection of document analysis programs, not a turn-key OCR system. In order to apply it to your documents, you may need to do so

OCRopus 3.2k Dec 31, 2022
A python script based on opencv and paddleocr, which can automatically pick up tasks, make cookies, and receive rewards in the Destiny 2 Dawning Oven

A python script based on opencv and paddleocr, which can automatically pick up tasks, make cookies, and receive rewards in the Destiny 2 Dawning Oven

1 Dec 22, 2021
Toolbox for OCR post-correction

Ochre Ochre is a toolbox for OCR post-correction. Please note that this software is experimental and very much a work in progress! Overview of OCR pos

National Library of the Netherlands / Research 117 Nov 10, 2022
Responsive Doc. scanner using U^2-Net, Textcleaner and Tesseract

Responsive Doc. scanner using U^2-Net, Textcleaner and Tesseract Toolset U^2-Net is used for background removal Textcleaner is used for image cleaning

3 Jul 13, 2022
PyQT5 app that colorize black & white pictures using CNN(use pre-trained model which was made with OpenCV)

About PyQT5 app that colorize black & white pictures using CNN(use pre-trained model which was made with OpenCV) Colorizor Приложение для проекта Yand

1 Apr 04, 2022
Implementation of our paper 'PixelLink: Detecting Scene Text via Instance Segmentation' in AAAI2018

Code for the AAAI18 paper PixelLink: Detecting Scene Text via Instance Segmentation, by Dan Deng, Haifeng Liu, Xuelong Li, and Deng Cai. Contributions

758 Dec 22, 2022
Bu uygulamada Python ve Opencv kullanarak bilgisayar kamerasından yüz tespiti yapıyoruz.

opencv_yuz_bulma Bu uygulamada Python ve Opencv kullanarak bilgisayar kamerasından yüz tespiti yapıyoruz. Bilgisarın kendi kamerasını kullanmak için;

Ahmet Haydar Ornek 6 Apr 16, 2022
The code of "Mask TextSpotter: An End-to-End Trainable Neural Network for Spotting Text with Arbitrary Shapes"

Mask TextSpotter A Pytorch implementation of Mask TextSpotter along with its extension can be find here Introduction This is the official implementati

Pengyuan Lyu 261 Nov 21, 2022
SRA's seminar on Introduction to Computer Vision Fundamentals

Introduction to Computer Vision This repository includes basics to : Python Numpy: A python library Git Computer Vision. The aim of this repository is

Society of Robotics and Automation 147 Dec 04, 2022
Python Computer Vision from Scratch

This repository explores the variety of techniques commonly used to analyze and interpret images. It also describes challenging real-world applications where vision is being successfully used, both f

Milaan Parmar / Милан пармар / _米兰 帕尔马 221 Dec 26, 2022
PAGE XML format collection for document image page content and more

PAGE-XML PAGE XML format collection for document image page content and more For an introduction, please see the following publication: http://www.pri

PRImA Research Lab 46 Nov 14, 2022
Multi-choice answer sheet correction system using computer vision with opencv & python.

Multi choice answer correction 🔴 5 answer sheet samples with a specific solution for detecting answers and sheet correction. 🔴 By running the soluti

Reza Firouzi 7 Mar 07, 2022
Resizing Canny Countour In Python

Resizing_Canny_Countour Install Visual Studio Code , https://code.visualstudio.com/download Select Python and install with terminal( pip install openc

Walter Ng 1 Nov 07, 2021
Forked from argman/EAST for the ICPR MTWI 2018 CHALLENGE

EAST_ICPR: EAST for ICPR MTWI 2018 CHALLENGE Introduction This is a repository forked from argman/EAST for the ICPR MTWI 2018 CHALLENGE. Origin Reposi

Haozheng Li 157 Aug 23, 2022