ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-modal Knowledge Integration

Overview

ROSITA

News & Updates

(24/08/2021)

  • Release the demo to perform fine-grained semantic alignments using the pretrained ROSITA model.

(15/08/2021)

  • Release the basic framework for ROSITA, including the pretrained base ROSITA model, as well as the scripts to run the fine-tuning and evaluation on three downstream tasks (i.e., VQA, REC, ITR) over six datasets.

Introduction

This repository contains source code necessary to reproduce the results presented in our ACM MM paper ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-modal Knowledge Integration, which encodes the cROSs- and InTrA-model prior knowledge in a in a unified scene graph to perform knowledge-guided vision-and-language pretraining. Compared with existing counterparts, ROSITA learns better fine-grained semantic alignments across different modalities, thus improving the capability of the pretrained model.

Performance

We compare ROSITA against existing state-of-the-art VLP methods on three downstream tasks. All methods use the base model of Transformer for a fair comparison. The trained checkpoints to reproduce these results are provided in finetune.md.

Tasks VQA REC ITR
Datasets VQAv2
dev | std
RefCOCO
val | testA | testB
RefCOCO+
val | testA | testB
RefCOCOg
val | test
IR-COCO
[email protected] | [email protected] | [email protected]
TR-COCO
[email protected] | [email protected] | [email protected]
IR-Flickr
[email protected] | [email protected] | [email protected]
TR-Flickr
[email protected] | [email protected] | [email protected]
ROSITA 73.91 | 73.97 84.79 | 87.99 | 78.28 76.06 | 82.01 | 67.40 78.23 | 78.25 54.40 | 80.92 | 88.60 71.26 | 91.62 | 95.58 74.08 | 92.44 | 96.08 88.90 | 98.10 | 99.30
SoTA-base 73.59 | 73.67 81.56 | 87.40 | 74.48 76.05 | 81.65 | 65.70 75.90 | 75.93 54.00 | 80.80 | 88.50 70.00 | 91.10 | 95.50 74.74 | 92.86 | 95.82 86.60 | 97.90 | 99.20

Installation

Software and Hardware Requirements

We recommand a workstation with 4 GPU (>= 24GB, e.g., RTX 3090 or V100), 120GB memory and 50GB free disk space. We strongly recommend to use a SSD drive to guarantee high-speed I/O. Also, you should first install some necessary package as follows:

  • Python >= 3.6
  • PyTorch >= 1.4 with Cuda >=10.2
  • torchvision >= 0.5.0
  • Cython
# git clone
$ git clone https://github.com/MILVLG/rosita.git 

# build essential utils
$ cd rosita/rosita/utils/rec
$ python setup.py build
$ cp build/lib*/bbox.cpython*.so .

Dataset Setup

To download the required datasets to run this project, please check datasets.md for details.

Pretraining

Please check pretrain.md for the details for ROSITA pretraining. We currently only provide the pretrained model to run finetuning on downstream tasks. The codes to run pretraining will be released later.

Finetuning

Please check finetune.md for the details for finetuning on downstream tasks. Scripts to run finetuning on downstream tasks are provided. Also, we provide trained models that can be directly evaluated to reproduce the results.

Demo

We provide the Jupyter notebook scripts for reproducing the visualization results shown in our paper.

Acknowledgment

We appreciate the well-known open-source projects such as LXMERT, UNITER, OSCAR, and Huggingface, which help us a lot when writing our codes.

Yuhao Cui (@cuiyuhao1996) and Tong-An Luo (@Zoroaster97) are the main contributors to this repository. Please kindly contact them if you find any issue.

Citations

Please consider citing this paper if you use the code:

@inProceedings{cui2021rosita,
  title={ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-modal Knowledge Integration},
  author={Cui, Yuhao and Yu, Zhou and Wang, Chunqi and Zhao, Zhongzhou and Zhang, Ji and Wang, Meng and Yu, Jun},
  booktitle={Proceedings of the 29th ACM International Conference on Multimedia},
  year={2021}
}
Owner
Vision and Language Group@ MIL
Hangzhou Dianzi University
Vision and Language Group@ MIL
Expressive Power of Invariant and Equivaraint Graph Neural Networks (ICLR 2021)

Expressive Power of Invariant and Equivaraint Graph Neural Networks In this repository, we show how to use powerful GNN (2-FGNN) to solve a graph alig

Marc Lelarge 36 Dec 12, 2022
BABEL: Bodies, Action and Behavior with English Labels [CVPR 2021]

BABEL is a large dataset with language labels describing the actions being performed in mocap sequences. BABEL labels about 43 hours of mocap sequences from AMASS [1] with action labels.

113 Dec 28, 2022
Research on controller area network Intrusion Detection Systems

Group members information Member 1: Lixue Liang Member 2: Yuet Lee Chan Member 3: Xinruo Zhang Member 4: Yifei Han User Manual Generate Attack Packets

Roche 4 Aug 30, 2022
Framework for evaluating ANNS algorithms on billion scale datasets.

Billion-Scale ANN http://big-ann-benchmarks.com/ Install The only prerequisite is Python (tested with 3.6) and Docker. Works with newer versions of Py

Harsha Vardhan Simhadri 132 Dec 24, 2022
Udacity's CS101: Intro to Computer Science - Building a Search Engine

Udacity's CS101: Intro to Computer Science - Building a Search Engine All soluti

Phillip 0 Feb 26, 2022
Mask-invariant Face Recognition through Template-level Knowledge Distillation

Mask-invariant Face Recognition through Template-level Knowledge Distillation This is the official repository of "Mask-invariant Face Recognition thro

Fadi Boutros 35 Dec 06, 2022
Cluster-GCN: An Efficient Algorithm for Training Deep and Large Graph Convolutional Networks

Cluster-GCN: An Efficient Algorithm for Training Deep and Large Graph Convolutional Networks This repository contains a TensorFlow implementation of "

Jingwei Zheng 5 Jan 08, 2023
Traditional deepdream with VQGAN+CLIP and optical flow. Ready to use in Google Colab

VQGAN-CLIP-Video cat.mp4 policeman.mp4 schoolboy.mp4 forsenBOG.mp4

23 Oct 26, 2022
Open-source Monocular Python HawkEye for Tennis

Tennis Tracking 🎾 Objectives Track the ball Detect court lines Detect the players To track the ball we used TrackNet - deep learning network for trac

ArtLabs 188 Jan 08, 2023
Code for the paper "Curriculum Dropout", ICCV 2017

Curriculum Dropout Dropout is a very effective way of regularizing neural networks. Stochastically "dropping out" units with a certain probability dis

Pietro Morerio 21 Jan 02, 2022
[CVPR 2022] Unsupervised Image-to-Image Translation with Generative Prior

GP-UNIT - Official PyTorch Implementation This repository provides the official PyTorch implementation for the following paper: Unsupervised Image-to-

Shuai Yang 125 Jan 03, 2023
[Pedestron] Generalizable Pedestrian Detection: The Elephant In The Room. @ CVPR2021

Pedestron Pedestron is a MMdetection based repository, that focuses on the advancement of research on pedestrian detection. We provide a list of detec

Irtiza Hasan 594 Jan 05, 2023
Realtime_Multi-Person_Pose_Estimation

Introduction Multi Person PoseEstimation By PyTorch Results Require Pytorch Installation git submodule init && git submodule update Demo Download conv

tensorboy 1.3k Jan 05, 2023
An Efficient Training Approach for Very Large Scale Face Recognition or FΒ²C for simplicity.

Fast Face Classification (FΒ²C) This is the code of our paper An Efficient Training Approach for Very Large Scale Face Recognition or FΒ²C for simplicit

33 Jun 27, 2021
Easy and comprehensive assessment of predictive power, with support for neuroimaging features

Documentation: https://raamana.github.io/neuropredict/ News As of v0.6, neuropredict now supports regression applications i.e. predicting continuous t

Pradeep Reddy Raamana 93 Nov 29, 2022
[ICML 2021] Towards Understanding and Mitigating Social Biases in Language Models

Towards Understanding and Mitigating Social Biases in Language Models This repo contains code and data for evaluating and mitigating bias from generat

Paul Liang 42 Jan 03, 2023
A Real-Time-Strategy game for Deep Learning research

Description DeepRTS is a high-performance Real-TIme strategy game for Reinforcement Learning research. It is written in C++ for performance, but provi

Centre for Artificial Intelligence Research (CAIR) 156 Dec 19, 2022
A repository with exploration into using transformers to predict DNA ↔ transcription factor binding

Transcription Factor binding predictions with Attention and Transformers A repository with exploration into using transformers to predict DNA ↔ transc

Phil Wang 62 Dec 20, 2022
ΠšΠΎΠ½Ρ‚Ρ€ΠΎΠ»ΡŒΠ½Π°Ρ Ρ€Π°Π±ΠΎΡ‚Π° ΠΏΠΎ матСматичСским ΠΌΠ΅Ρ‚ΠΎΠ΄Π°ΠΌ машинного обучСния

ML-MathMethods-Test ΠšΠΎΠ½Ρ‚Ρ€ΠΎΠ»ΡŒΠ½Π°Ρ Ρ€Π°Π±ΠΎΡ‚Π° ΠΏΠΎ матСматичСским ΠΌΠ΅Ρ‚ΠΎΠ΄Π°ΠΌ машинного обучСния. ВычислСниС основных статистик, Π΄ΠΈΠ°Π³Ρ€Π°ΠΌΠΌ ΠΈ Π³Ρ€Π°Ρ„ΠΈΠΊΠΎΠ², ΠΏΡ€ΠΎΠ²Π΅Ρ€ΠΊΠ° Ρ€Π°Π·Π»

Stas Ivanovskii 1 Jan 06, 2022
Combinatorially Hard Games where the levels are procedurally generated

puzzlegen Implementation of two procedurally simulated environments with gym interfaces. IceSlider: the agent needs to reach and stop on the pink squa

Autonomous Learning Group 3 Jun 26, 2022