code for TCL: Vision-Language Pre-Training with Triple Contrastive Learning, CVPR 2022

Last update: Jan 02, 2023

Overview

Vision-Language Pre-Training with Triple Contrastive Learning, CVPR 2022

News

(03/16/2022) upload retrieval checkpoints finetuned on COCO and Flickr

This is the official PyTorch implementation of TCL

Requirements:

conda install pytorch==1.7.1 torchvision==0.8.2 torchaudio==0.7.2 cudatoolkit=11.0 -c pytorch
pip install transformers==4.8.1
pip install timm==0.4.9
conda install ruamel_yaml
pip install opencv-python
pip install --upgrade Pillow
pip install einops

Pre-training Datasets:

MSCOCO (2014)
Visual Genome (VG)
- Download images of part1 and part2 and combine them together
Conceptual Captions (CC3M)
- Download Train_GCC-training.tsv and Validation_GCC-1.1.0-Validation.tsv from kaggle
- Then use img2dataset to download images from downloaed tsv files
- More details
SBU Captions
- Download url from subcaptions
- Then use img2dataset to download images from url
CC12M
- Download cc12m.tsv
- Then use img2dataset to download images from the downloaed tsv file

Downstream-task Datasets:

Flickr30k
VQA v2
NLVR2
- recommend to use direct-image-download

Json Files from Pre-training and Downstream Tasks:

refer to Download in ALBEF
you need to change the image path in json files according to your downloaded images

Pre-trained checkpoint:

Pre-training:

python -m torch.distributed.launch --nproc_per_node=8 \
--use_env Pretrain.py \
--config ./configs/Pretrain.yaml \
--output_dir output/pretrain

Downstream Tasks:

Image-Text Retrieval

# zero-shot coco 
python -m torch.distributed.launch --nproc_per_node=8 \
--use_env Retrieval.py \
--config ./configs/Retrieval_coco.yaml \
--output_dir output/pretrain_e30_Retrieval_coco_zeroshot \
--checkpoint output/pretrain/checkpoint_29.pth \
--evaluate

# fine-tune flickr
python -m torch.distributed.launch --nproc_per_node=8 \
--use_env Retrieval.py \
--config ./configs/Retrieval_flickr.yaml \
--output_dir output/pretrain_e30_Retrieval_flickr \
--checkpoint output/pretrain/checkpoint_29.pth

# fine-tune coco
python -m torch.distributed.launch --nproc_per_node=8 \
--use_env Retrieval.py \
--config ./configs/Retrieval_coco.yaml \
--output_dir output/pretrain_e30_Retrieval_coco \
--checkpoint output/pretrain/checkpoint_29.pth

# zero-shot flickr 
python -m torch.distributed.launch --nproc_per_node=8 \
--use_env Retrieval.py \
--config ./configs/Retrieval_flickr.yaml \
--output_dir output/pretrain_e30_Retrieval_flickr_zeroshot \
--checkpoint output/pretrain_e30_Retrieval_coco/checkpoint_best.pth \
--evaluate

VQA

python -m torch.distributed.launch --nproc_per_node=8 \
--use_env VQA.py \
--config ./configs/VQA.yaml \
--output_dir output/pretrain_e30_vqa \
--checkpoint output/pretrain/checkpoint_29.pth

how to evaluate and interpret the results(salesforce/ALBEF#19)

Visual Entailment

python -m torch.distributed.launch --nproc_per_node=8 \
--use_env VE.py \
--config ./configs/VE.yaml \
--output_dir output/pretrain_e30_VE \
--checkpoint output/pretrain/checkpoint_29.pth

how to evaluate and interpret the results(salesforce/ALBEF#19)

NLVR2

# pre-train nlvr
python -m torch.distributed.launch --nproc_per_node=8 \
--use_env Pretrain_nlvr.py \
--config ./configs/NLVR_pretrain.yaml \
--output_dir output/pretrain_e30_NLVR_pretrain \
--checkpoint output/pretrain/checkpoint_29.pth

# fine-tune nlvr
python -m torch.distributed.launch --nproc_per_node=8 \
--use_env NLVR.py \
--config ./configs/NLVR.yaml \
--output_dir output/pretrain_e30_NLVR \
--checkpoint output/pretrain_e30_NLVR_pretrain/checkpoint_00.pth

how to evaluate and interpret the results(salesforce/ALBEF#19)

Citation:

@article{yang2022vision,
  title={Vision-Language Pre-Training with Triple Contrastive Learning},
  author={Yang, Jinyu and Duan, Jiali and Tran, Son and Xu, Yi and Chanda, Sampath and Chen, Liqun and Zeng, Belinda and Chilimbi, Trishul and Huang, Junzhou},
  booktitle={Proceedings of the IEEE conference on computer vision and pattern recognition},
  year={2022}
}

Our code is largely borrowed from ALBEF

code for TCL: Vision-Language Pre-Training with Triple Contrastive Learning, CVPR 2022

Related tags

Overview

Vision-Language Pre-Training with Triple Contrastive Learning, CVPR 2022

News

Requirements:

Pre-training Datasets:

Downstream-task Datasets:

Json Files from Pre-training and Downstream Tasks:

Pre-trained checkpoint:

Pre-training:

Downstream Tasks:

Image-Text Retrieval

VQA

Visual Entailment

NLVR2

Citation:

Owner

VoxHRNet - Whole Brain Segmentation with Full Volume Neural Network

MQBench: Towards Reproducible and Deployable Model Quantization Benchmark

PINN(s): Physics-Informed Neural Network(s) for von Karman vortex street

Lightweight, Python library for fast and reproducible experimentation :microscope:

This is a simple framework to make object detection dataset very quickly

Fantasy Points Prediction and Dream Team Formation

Code for the paper "Unsupervised Contrastive Learning of Sound Event Representations", ICASSP 2021.

A python library to build Model Trees with Linear Models at the leaves.

Space Ship Simulator using python

This is a demo app to be used in the video streaming applications

Starter kit for getting started in the Music Demixing Challenge.

MLP-Like Vision Permutator for Visual Recognition (PyTorch)

BboxToolkit is a tiny library of special bounding boxes.

The first machine learning framework that encourages learning ML concepts instead of memorizing class functions.

Code for the paper "TadGAN: Time Series Anomaly Detection Using Generative Adversarial Networks"

A multilingual version of MS MARCO passage ranking dataset

Official implementation of NPMs: Neural Parametric Models for 3D Deformable Shapes - ICCV 2021

Code and Data for the paper: Molecular Contrastive Learning with Chemical Element Knowledge Graph [AAAI 2022]

wmctrl ported to Python Ctypes

[ICCV2021] Official code for "Channel-wise Topology Refinement Graph Convolution for Skeleton-Based Action Recognition"