project page for VinVL

Last update: Jan 09, 2023

Related tags

Deep Learning VinVL

Overview

VinVL: Revisiting Visual Representations in Vision-Language Models

Updates

02/28/2021: Project page built.

Introduction

This repository is the project page for VinVL, containing necessary instructions to reproduce the results presented in the paper. We presents a detailed study of improving visual representations for vision language (VL) tasks and develops an improved object detection model to provide object-centric representations of images. Compared to the most widely used bottom-up and top-down model (code), the new model is bigger, better-designed for VL tasks, and pre-trained on much larger training corpora that combine multiple public annotated object detection datasets. Therefore, it can generate representations of a richer collection of visual objects and concepts. While previous VL research focuses mainly on improving the vision-language fusion model and leaves the object detection model improvement untouched, we show that visual features matter significantly in VL models. In our experiments we feed the visual features generated by the new object detection model into a Transformer-based VL fusion model OSCAR (code), and utilize an improved approach to pre-train the VL model and fine-tune it on a wide range of downstream VL tasks. Our results show that the new visual features significantly improve the performance across all VL tasks, creating new state-of-the-art results on seven public benchmarks.

Performance

Task	t2i	t2i	i2t	i2t	IC	IC	IC	IC	NoCaps	NoCaps	VQA	NLVR2	GQA
Metric	[email protected]	[email protected]	[email protected]	[email protected]	[email protected]	M	C	S	C	S	test-std	test-P	test-std
SoTA_S	39.2	68.0	56.6	84.5	38.9	29.2	129.8	22.4	61.5	9.2	70.92	58.80	63.17
SoTA_B	54.0	80.8	70.0	91.1	40.5	29.7	137.6	22.8	86.58	12.38	73.67	79.30	61.62
SoTA_L	57.5	82.8	73.5	92.2	41.7	30.6	140.0	24.5	-	-	74.93	81.47	-
-----	---	---	---	---	---	---	---	---	---	---	---	---	---
VinVL_B	58.1	83.2	74.6	92.6	40.9	30.9	140.6	25.1	92.46	13.07	76.12	83.08	64.65
VinVL_L	58.8	83.5	75.4	92.9	41.0	31.1	140.9	25.2	-	-	76.62	83.98	-
gain	1.3	0.7	1.9	0.6	-0.7	0.5	0.9	0.7	5.9	0.7	1.69	2.51	1.48

t2i: text-to-image retrieval; i2t: image-to-text retrieval; IC: image captioning on COCO.

Leaderboard results

VinVL has achieved top-position in several VL leaderboards, including Visual Question Answering (VQA), Microsoft COOC Image Captioning, Novel Object Captioning (nocaps), and Visual Commonsense Reasoning (VCR).

Comparison with image features from bottom-up and top-down model (code).

We observe uniform improvements on seven VL tasks by replacing visual features from bottom-up and top-down model with ours. The NoCaps baseline is from VIVO, and our results are obtained by directly replacing the visual features. The baselines for rest tasks are from OSCAR, and our results are obtained by replacing the visual features and performing OSCAR+ pre-training. All models are BERT-Base size. As analyzed in Section 5.2 in the VinVL paper, the new visual features contributes 95% of the improvement.

Task	t2i	t2i	i2t	i2t	IC	IC	IC	IC	NoCaps	NoCaps	VQA	NLVR2	GQA
metric	[email protected]	[email protected]	[email protected]	[email protected]	[email protected]	M	C	S	C	S	test-std	test-P	test-std
bottom-up and top-down model	54.0	80.8	70.0	91.1	40.5	29.7	137.6	22.8	86.58	12.38	73.16	78.07	61.62
VinVL (ours)	58.1	83.2	74.6	92.6	40.9	30.9	140.6	25.1	92.46	13.07	75.95	83.08	64.65
gain	4.1	2.4	4.6	1.5	0.4	1.2	3.0	2.3	5.9	0.7	2.79	4.71	3.03

Please see the following two figures for visual comparison.

Source code

Pretrained Faster-RCNN model and feature extraction

The pretrained X152-C4 object-attribute detection can be downloaded here. With code from our Scene Graph Benchmark Repo (to be released soon), one can extract features with following command:

python tools/test_sg_net.py --config-file sgg_configs/vgattr/vinvl_x152c4.yaml TEST.IMS_PER_BATCH 2 MODEL.WEIGHT models/vinvl/vinvl_vg_x152c4.pth MODEL.ROI_HEADS.NMS_FILTER 1 MODEL.ROI_HEADS.SCORE_THRESH 0.2 DATA_DIR "../maskrcnn-benchmark-1/datasets1" TEST.IGNORE_BOX_REGRESSION True MODEL.ATTRIBUTE_ON True TEST.OUTPUT_FEATURE True

The output feature will be encoded as base64.

Find more pretrained models in DOWNLOAD.

Pre-exacted Image Features

For ease-of-use, we make pretrained features and predictions available for all pretraining datasets and downstream tasks. Please find the instructions to download them in DOWNLOAD.

Pretraind Oscar+ models and VL downstream tasks

The code to produce all vision-language results (both pretraining and downstream task finetuning) can be found in our OSCAR repo. One can find the model zoo for vision-language tasks here.

Citations

Please consider citing this paper if you use the code:

@article{li2020oscar,
  title={Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks},
  author={Li, Xiujun and Yin, Xi and Li, Chunyuan and Hu, Xiaowei and Zhang, Pengchuan and Zhang, Lei and Wang, Lijuan and Hu, Houdong and Dong, Li and Wei, Furu and Choi, Yejin and Gao, Jianfeng},
  journal={ECCV 2020},
  year={2020}
}

@article{zhang2021vinvl,
  title={VinVL: Making Visual Representations Matter in Vision-Language Models},
  author={Zhang, Pengchuan and Li, Xiujun and Hu, Xiaowei and Yang, Jianwei and Zhang, Lei and Wang, Lijuan and Choi, Yejin and Gao, Jianfeng},
  journal={CVPR 2021},
  year={2021}
}

project page for VinVL

Related tags

Overview

VinVL: Revisiting Visual Representations in Vision-Language Models

Updates

Introduction

Performance

Leaderboard results

Comparison with image features from bottom-up and top-down model (code).

Source code

Pretrained Faster-RCNN model and feature extraction

Pre-exacted Image Features

Pretraind Oscar+ models and VL downstream tasks

Citations

Owner

TC-GNN with Pytorch integration

Official Repsoitory for "Mish: A Self Regularized Non-Monotonic Neural Activation Function" [BMVC 2020]

Tensorflow implementation of Swin Transformer model.

Advancing Self-supervised Monocular Depth Learning with Sparse LiDAR

✨风纪委员会自动投票脚本，利用Github Action帮你进行裁决操作（为了让其他风纪委员有案件可判，本程序从中午12点才开始运行，有需要请自己修改运行时间）

Regularized Frank-Wolfe for Dense CRFs: Generalizing Mean Field and Beyond

Text to image synthesis using thought vectors

Contra is a lightweight, production ready Tensorflow alternative for solving time series prediction challenges with AI

StarGAN - Official PyTorch Implementation (CVPR 2018)

Hyperparameters tuning and features selection are two common steps in every machine learning pipeline.

Unofficial PyTorch implementation of the Adaptive Convolution architecture for image style transfer

Official implementation of DreamerPro: Reconstruction-Free Model-Based Reinforcement Learning with Prototypical Representations in TensorFlow 2

Official implementation for the paper "Attentive Prototypes for Source-free Unsupervised Domain Adaptive 3D Object Detection"

All-in-one Docker container that allows a user to explore Nautobot in a lab environment.

code for CVPR paper Zero-shot Instance Segmentation

You can draw the corresponding bounding box into the image and save it according to the result file (txt format) run by the tracker.

This is the formal code implementation of the CVPR 2022 paper 'Federated Class Incremental Learning'.

GLaRA: Graph-based Labeling Rule Augmentation for Weakly Supervised Named Entity Recognition

Interactive Visualization to empower domain experts to align ML model behaviors with their knowledge.

The code succinctly shows how our ensemble learning based on deep learning CNN is used for LAM-avulsion-diagnosis.