ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-modal Knowledge Integration

Last update: Dec 23, 2022

Overview

ROSITA

News & Updates

(24/08/2021)

Release the demo to perform fine-grained semantic alignments using the pretrained ROSITA model.

(15/08/2021)

Release the basic framework for ROSITA, including the pretrained base ROSITA model, as well as the scripts to run the fine-tuning and evaluation on three downstream tasks (i.e., VQA, REC, ITR) over six datasets.

Introduction

This repository contains source code necessary to reproduce the results presented in our ACM MM paper ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-modal Knowledge Integration, which encodes the cROSs- and InTrA-model prior knowledge in a in a unified scene graph to perform knowledge-guided vision-and-language pretraining. Compared with existing counterparts, ROSITA learns better fine-grained semantic alignments across different modalities, thus improving the capability of the pretrained model.

Performance

We compare ROSITA against existing state-of-the-art VLP methods on three downstream tasks. All methods use the base model of Transformer for a fair comparison. The trained checkpoints to reproduce these results are provided in finetune.md.

^_Tasks	^_VQA	^_REC			^_ITR
^_Datasets	^{_{VQAv2 dev \| std}}	^{_{RefCOCO val \| testA \| testB}}	^{_{RefCOCO+ val \| testA \| testB}}	^{_{RefCOCOg val \| test}}	^{_{IR-COCO [email protected] \| [email protected] \| [email protected]}}	^{_{TR-COCO [email protected] \| [email protected] \| [email protected]}}	^{_{IR-Flickr [email protected] \| [email protected] \| [email protected]}}	^{_{TR-Flickr [email protected] \| [email protected] \| [email protected]}}
^_ROSITA	^{_{73.91 \| 73.97}}	^{_{84.79 \| 87.99 \| 78.28}}	^{_{76.06 \| 82.01 \| 67.40}}	^{_{78.23 \| 78.25}}	^{_{54.40 \| 80.92 \| 88.60}}	^{_{71.26 \| 91.62 \| 95.58}}	^{_{74.08 \| 92.44 \| 96.08}}	^{_{88.90 \| 98.10 \| 99.30}}
^_SoTA-base	^{_{73.59 \| 73.67}}	^{_{81.56 \| 87.40 \| 74.48}}	^{_{76.05 \| 81.65 \| 65.70}}	^{_{75.90 \| 75.93}}	^{_{54.00 \| 80.80 \| 88.50}}	^{_{70.00 \| 91.10 \| 95.50}}	^{_{74.74 \| 92.86 \| 95.82}}	^{_{86.60 \| 97.90 \| 99.20}}

Installation

Software and Hardware Requirements

We recommand a workstation with 4 GPU (>= 24GB, e.g., RTX 3090 or V100), 120GB memory and 50GB free disk space. We strongly recommend to use a SSD drive to guarantee high-speed I/O. Also, you should first install some necessary package as follows:

Python >= 3.6
PyTorch >= 1.4 with Cuda >=10.2
torchvision >= 0.5.0
Cython

# git clone
$ git clone https://github.com/MILVLG/rosita.git 

# build essential utils
$ cd rosita/rosita/utils/rec
$ python setup.py build
$ cp build/lib*/bbox.cpython*.so .

Dataset Setup

To download the required datasets to run this project, please check datasets.md for details.

Pretraining

Please check pretrain.md for the details for ROSITA pretraining. We currently only provide the pretrained model to run finetuning on downstream tasks. The codes to run pretraining will be released later.

Finetuning

Please check finetune.md for the details for finetuning on downstream tasks. Scripts to run finetuning on downstream tasks are provided. Also, we provide trained models that can be directly evaluated to reproduce the results.

Demo

We provide the Jupyter notebook scripts for reproducing the visualization results shown in our paper.

Acknowledgment

We appreciate the well-known open-source projects such as LXMERT, UNITER, OSCAR, and Huggingface, which help us a lot when writing our codes.

Yuhao Cui (@cuiyuhao1996) and Tong-An Luo (@Zoroaster97) are the main contributors to this repository. Please kindly contact them if you find any issue.

Citations

Please consider citing this paper if you use the code:

@inProceedings{cui2021rosita,
  title={ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-modal Knowledge Integration},
  author={Cui, Yuhao and Yu, Zhou and Wang, Chunqi and Zhao, Zhongzhou and Zhang, Ji and Wang, Meng and Yu, Jun},
  booktitle={Proceedings of the 29th ACM International Conference on Multimedia},
  year={2021}
}

^_Tasks	^_VQA	^_REC			^_ITR
^_Datasets	^{_{VQAv2 dev \| std}}	^{_{RefCOCO val \| testA \| testB}}	^{_{RefCOCO+ val \| testA \| testB}}	^{_{RefCOCOg val \| test}}	^{_{IR-COCO [email protected] \| [email protected] \| [email protected]}}	^{_{TR-COCO [email protected] \| [email protected] \| [email protected]}}	^{_{IR-Flickr [email protected] \| [email protected] \| [email protected]}}	^{_{TR-Flickr [email protected] \| [email protected] \| [email protected]}}
^_ROSITA	^{_{73.91 \| 73.97}}	^{_{84.79 \| 87.99 \| 78.28}}	^{_{76.06 \| 82.01 \| 67.40}}	^{_{78.23 \| 78.25}}	^{_{54.40 \| 80.92 \| 88.60}}	^{_{71.26 \| 91.62 \| 95.58}}	^{_{74.08 \| 92.44 \| 96.08}}	^{_{88.90 \| 98.10 \| 99.30}}
^_SoTA-base	^{_{73.59 \| 73.67}}	^{_{81.56 \| 87.40 \| 74.48}}	^{_{76.05 \| 81.65 \| 65.70}}	^{_{75.90 \| 75.93}}	^{_{54.00 \| 80.80 \| 88.50}}	^{_{70.00 \| 91.10 \| 95.50}}	^{_{74.74 \| 92.86 \| 95.82}}	^{_{86.60 \| 97.90 \| 99.20}}

ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-modal Knowledge Integration

Related tags

Overview

ROSITA

News & Updates

Introduction

Performance

Installation

Software and Hardware Requirements

Dataset Setup

Pretraining

Finetuning

Demo

Acknowledgment

Citations

Owner

Vision and Language Group@ MIL

一些经典的CTR算法的复现; LR, FM, FFM, AFM, DeepFM，xDeepFM, PNN, DCN, DCNv2, DIFM, AutoInt, FiBiNet,AFN,ONN,DIN, DIEN ... （pytorch, tf2.0）

Official implementation of "Learning Not to Reconstruct" (BMVC 2021)

PEPit is a package enabling computer-assisted worst-case analyses of first-order optimization methods.

一个多语言支持、易使用的 OCR 项目。An easy-to-use OCR project with multilingual support.

Code for SyncTwin: Treatment Effect Estimation with Longitudinal Outcomes (NeurIPS 2021)

buildseg is a building extraction plugin of QGIS based on PaddlePaddle.

Dataset and Code for the paper "DepthTrack: Unveiling the Power of RGBD Tracking" (ICCV2021), and "Depth-only Object Tracking" (BMVC2021)

Code for CMaskTrack R-CNN (proposed in Occluded Video Instance Segmentation)

Deep Learning Based EDM Subgenre Classification using Mel-Spectrogram and Tempogram Features"

Localization Distillation for Object Detection

GitHub repository for "Improving Video Generation for Multi-functional Applications"

FB-tCNN for SSVEP Recognition

Registration Loss Learning for Deep Probabilistic Point Set Registration

A template repository for submitting a job to the Slurm Cluster installed at the DISI - University of Bologna

Official repository for the ICCV 2021 paper: UltraPose: Synthesizing Dense Pose with 1 Billion Points by Human-body Decoupling 3D Model.

[CVPR 2021 Oral] Variational Relational Point Completion Network

A Pytorch reproduction of Range Loss, which is proposed in paper 《Range Loss for Deep Face Recognition with Long-Tailed Training Data》

(Arxiv 2021) NeRF--: Neural Radiance Fields Without Known Camera Parameters

Official Python implementation of the 'Sparse deconvolution'-v0.3.0

Face Detection & Age Gender & Expression & Recognition