Code and data for ImageCoDe, a contextual vison-and-language benchmark

Overview

ImageCoDe

arxiv

This repository contains code and data for ImageCoDe: Image Retrieval from Contextual Descriptions.

Example

Data

All collected descriptions for the training and validation set are under data/train_data.json and data/valid_data.json.

Image sets can be downloaded on Zenodo or GoogleDrive and should be unzipped in data/.

You can download from the commandline via:

wget https://zenodo.org/record/6518944/files/image-sets.zip

For ViLBERT experiments, you need to download a pretrained ViLBERT checkpoint from volta here, simply by clicking on ViLBERT in the table. Save the downloaded file as baselines/vilbert/vilbert-pretrained.bin. Since ViLBERT uses image features from Faster R-CNN, you also have to downloaded these for all ImageCoDe images here: Google Drive link. Save the file as data/rcnn-features36-36.lmdb. The same procedure applies for UNITER.

The format for data/train_data.json looks like this:

{
  "MSR-VTT-videoTrainValVideo_video2044-shot1_0": {
    "6": "a mom holding her babies in the middle of the picture, no other image intervenes with the image.",
    "7": "The image is fading between a woman holding a baby and a woman sitting with a red background. The hands of the woman sitting aren't visible."
  },
  "video-storytelling-videochristmas_56Nm66j-i5Q-shot14_2": {
  "..."
  }
}

And the images under data/ have the following structure. Each folder contains 10 images. If the images are video frames, the number X in imgX.jpg indicates the frame number:

  .
  ├── MSR-VTT-videoTrainValVideo_video2044-shot1_0
      │   ├── img0.jpg
      │   ├── img7.jpg
      │   ├── ...
  ├── video-storytelling-videochristmas_56Nm66j-i5Q-shot14_2
      │   ├── ...

Leaderboard

Based on this you can train your model and test on the unlabeled test set:

{
  "MSR-VTT-videoTestVideo_video7763-shot2_1": [
    "The team name on shirt is visible without a number, but all letters can be seen for team name.",
    "the player can be seen with him on the left close to the logo on the pitch on the right and can be clearly seen"
  ],
  "...":
  ["..."]
}

In order to appear on the leaderboard, please format your results in the following format:

{
  "MSR-VTT-videoTestVideo_video7763-shot2_1": [
    1,
    2
  ],
  "...":
  ["..."]
}

Where the example here with "1" and "2" represent image indices ranging from 0 to 9. You can submit to the leaderboard by sending your test set file (or a download link) to [email protected] and we will update the leaderboard quickly (max. 1-2 days). The leaderboard is maintained on the project website and might change its submission procedure at some point.

Installations

Run install.sh for running CLIP experiments. For VilBERT follow the instructions for volta.

Code

Code for CLIP is under baselines/clip and and code for ViLBERT/UNITER is under baselines/crossencoders.

For details commands to run each model variant shown in the paper, have a look at the README in baselines.

For example to train the best performing model CLIP+TemporalEmbeddings, run:

python3 contextual.py --lr 2e-6 --lr_head 1e-4 -b 36 -m ViT-B/16 --fusion mult -a gelu --logit_scale 1000 --finetuned_checkpoint_path checkpoints/CONTRA_clip_best__36_4e-06_30_1395526.pt --add_input --frozen_clip --positional

Data Analysis

Our manual annotation of various phenomena (negation, nuances, ...) in our validation set can be found under data/manual_annotation_valid.yaml

License

This work is licensed under the MIT license. See LICENSE for details. Third-party software and data sets are subject to their respective licenses.
If you want to cite our paper, please use:

@inproceedings{krojer_contextual_2022,
  address = {Online},
  title = {Image Retrieval from Contextual Descriptions},
  booktitle = {Proceedings of the 60th {Annual} {Meeting} of the {Association} for {Computational} {Linguistics},
  publisher = {Association for Computational Linguistics},
  author = {Krojer, Benno and Adlakha, Vaibhav and Vineet, Vibhav and Goyal, Yash and Ponti, Edoardo and Reddy, Siva},
  month = may,
  year = {2022},
}

Acknowledgement

Our data (specifically the image sets) are built upon 3 video dataset and Open Images:

We also the volta repository for ViLBERT and UNITER baseline variants

For questions or feedback, don't hesitate to contact the author: [email protected]

Owner
McGill NLP
Research group within McGill University and Mila focusing on various topics in natural language processing.
McGill NLP
An easy way to build PyTorch datasets. Modularly build datasets and automatically cache processed results

EasyDatas An easy way to build PyTorch datasets. Modularly build datasets and automatically cache processed results Installation pip install git+https

Ximing Yang 4 Dec 14, 2021
Get started with Machine Learning with Python - An introduction with Python programming examples

Machine Learning With Python Get started with Machine Learning with Python An engaging introduction to Machine Learning with Python TL;DR Download all

Learn Python with Rune 130 Jan 02, 2023
OpenMMLab Video Perception Toolbox. It supports Video Object Detection (VID), Multiple Object Tracking (MOT), Single Object Tracking (SOT), Video Instance Segmentation (VIS) with a unified framework.

English | 简体中文 Documentation: https://mmtracking.readthedocs.io/ Introduction MMTracking is an open source video perception toolbox based on PyTorch.

OpenMMLab 2.7k Jan 08, 2023
PyTorch Code of "Memory In Memory: A Predictive Neural Network for Learning Higher-Order Non-Stationarity from Spatiotemporal Dynamics"

Memory In Memory Networks It is based on the paper Memory In Memory: A Predictive Neural Network for Learning Higher-Order Non-Stationarity from Spati

Yang Li 12 May 30, 2022
Linear image-to-image translation

Linear (Un)supervised Image-to-Image Translation Examples for linear orthogonal transformations in PCA domain, learned without pairing supervision. Tr

Eitan Richardson 40 Aug 31, 2022
Open source implementation of AceNAS: Learning to Rank Ace Neural Architectures with Weak Supervision of Weight Sharing

AceNAS This repo is the experiment code of AceNAS, and is not considered as an official release. We are working on integrating AceNAS as a built-in st

Yuge Zhang 6 Sep 07, 2022
pyspark🍒🥭 is delicious,just eat it!😋😋

如何用10天吃掉pyspark? 🔥 🔥 《10天吃掉那只pyspark》 🚀

lyhue1991 578 Dec 30, 2022
PyTorch implementation of a collections of scalable Video Transformer Benchmarks.

PyTorch implementation of Video Transformer Benchmarks This repository is mainly built upon Pytorch and Pytorch-Lightning. We wish to maintain a colle

Xin Ma 156 Jan 08, 2023
Official implementation of VQ-Diffusion

Official implementation of VQ-Diffusion: Vector Quantized Diffusion Model for Text-to-Image Synthesis

Microsoft 592 Jan 03, 2023
😮The official implementation of "CoNeRF: Controllable Neural Radiance Fields" 😮

CoNeRF: Controllable Neural Radiance Fields This is the official implementation for "CoNeRF: Controllable Neural Radiance Fields" Project Page Paper V

Kacper Kania 61 Dec 24, 2022
A platform to display the carbon neutralization information for researchers, decision-makers, and other participants in the community.

Welcome to Carbon Insight Carbon Insight is a platform aiming to display the carbon neutralization roadmap for researchers, decision-makers, and other

Microsoft 14 Oct 24, 2022
The Environment I built to study Reinforcement Learning + Pokemon Showdown

pokemon-showdown-rl-environment The Environment I built to study Reinforcement Learning + Pokemon Showdown Been a while since I ran this. Think it is

3 Jan 16, 2022
Pytorch implementation for RelTransformer

RelTransformer Our Architecture This is a Pytorch implementation for RelTransformer The implementation for Evaluating on VG200 can be found here Requi

Vision CAIR Research Group, KAUST 21 Nov 22, 2022
NanoDet-Plus⚡Super fast and lightweight anchor-free object detection model. 🔥Only 980 KB(int8) / 1.8MB (fp16) and run 97FPS on cellphone🔥

NanoDet-Plus⚡Super fast and lightweight anchor-free object detection model. 🔥Only 980 KB(int8) / 1.8MB (fp16) and run 97FPS on cellphone🔥

4.8k Jan 07, 2023
Official code for our EMNLP2021 Outstanding Paper MindCraft: Theory of Mind Modeling for Situated Dialogue in Collaborative Tasks

MindCraft Authors: Cristian-Paul Bara*, Sky CH-Wang*, Joyce Chai This is the official code repository for the paper (arXiv link): Cristian-Paul Bara,

Situated Language and Embodied Dialogue (SLED) Research Group 14 Dec 29, 2022
Pytorch implementations of popular off-policy multi-agent reinforcement learning algorithms, including QMix, VDN, MADDPG, and MATD3.

Off-Policy Multi-Agent Reinforcement Learning (MARL) Algorithms This repository contains implementations of various off-policy multi-agent reinforceme

183 Dec 28, 2022
Code for Learning Manifold Patch-Based Representations of Man-Made Shapes, in ICLR 2021.

LearningPatches | Webpage | Paper | Video Learning Manifold Patch-Based Representations of Man-Made Shapes Dmitriy Smirnov, Mikhail Bessmeltsev, Justi

Dima Smirnov 22 Nov 14, 2022
Generic Event Boundary Detection: A Benchmark for Event Segmentation

Generic Event Boundary Detection: A Benchmark for Event Segmentation We release our data annotation & baseline codes for detecting generic event bound

47 Nov 22, 2022
PyTorch implementations of deep reinforcement learning algorithms and environments

Deep Reinforcement Learning Algorithms with PyTorch This repository contains PyTorch implementations of deep reinforcement learning algorithms and env

Petros Christodoulou 4.7k Jan 04, 2023
RIFE - Real-Time Intermediate Flow Estimation for Video Frame Interpolation

RIFE - Real-Time Intermediate Flow Estimation for Video Frame Interpolation YouTube | BiliBili 16X interpolation results from two input images: Introd

旷视天元 MegEngine 28 Dec 09, 2022