ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-modal Knowledge Integration

Overview

ROSITA

News & Updates

(24/08/2021)

  • Release the demo to perform fine-grained semantic alignments using the pretrained ROSITA model.

(15/08/2021)

  • Release the basic framework for ROSITA, including the pretrained base ROSITA model, as well as the scripts to run the fine-tuning and evaluation on three downstream tasks (i.e., VQA, REC, ITR) over six datasets.

Introduction

This repository contains source code necessary to reproduce the results presented in our ACM MM paper ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-modal Knowledge Integration, which encodes the cROSs- and InTrA-model prior knowledge in a in a unified scene graph to perform knowledge-guided vision-and-language pretraining. Compared with existing counterparts, ROSITA learns better fine-grained semantic alignments across different modalities, thus improving the capability of the pretrained model.

Performance

We compare ROSITA against existing state-of-the-art VLP methods on three downstream tasks. All methods use the base model of Transformer for a fair comparison. The trained checkpoints to reproduce these results are provided in finetune.md.

Tasks VQA REC ITR
Datasets VQAv2
dev | std
RefCOCO
val | testA | testB
RefCOCO+
val | testA | testB
RefCOCOg
val | test
IR-COCO
[email protected] | [email protected] | [email protected]
TR-COCO
[email protected] | [email protected] | [email protected]
IR-Flickr
[email protected] | [email protected] | [email protected]
TR-Flickr
[email protected] | [email protected] | [email protected]
ROSITA 73.91 | 73.97 84.79 | 87.99 | 78.28 76.06 | 82.01 | 67.40 78.23 | 78.25 54.40 | 80.92 | 88.60 71.26 | 91.62 | 95.58 74.08 | 92.44 | 96.08 88.90 | 98.10 | 99.30
SoTA-base 73.59 | 73.67 81.56 | 87.40 | 74.48 76.05 | 81.65 | 65.70 75.90 | 75.93 54.00 | 80.80 | 88.50 70.00 | 91.10 | 95.50 74.74 | 92.86 | 95.82 86.60 | 97.90 | 99.20

Installation

Software and Hardware Requirements

We recommand a workstation with 4 GPU (>= 24GB, e.g., RTX 3090 or V100), 120GB memory and 50GB free disk space. We strongly recommend to use a SSD drive to guarantee high-speed I/O. Also, you should first install some necessary package as follows:

  • Python >= 3.6
  • PyTorch >= 1.4 with Cuda >=10.2
  • torchvision >= 0.5.0
  • Cython
# git clone
$ git clone https://github.com/MILVLG/rosita.git 

# build essential utils
$ cd rosita/rosita/utils/rec
$ python setup.py build
$ cp build/lib*/bbox.cpython*.so .

Dataset Setup

To download the required datasets to run this project, please check datasets.md for details.

Pretraining

Please check pretrain.md for the details for ROSITA pretraining. We currently only provide the pretrained model to run finetuning on downstream tasks. The codes to run pretraining will be released later.

Finetuning

Please check finetune.md for the details for finetuning on downstream tasks. Scripts to run finetuning on downstream tasks are provided. Also, we provide trained models that can be directly evaluated to reproduce the results.

Demo

We provide the Jupyter notebook scripts for reproducing the visualization results shown in our paper.

Acknowledgment

We appreciate the well-known open-source projects such as LXMERT, UNITER, OSCAR, and Huggingface, which help us a lot when writing our codes.

Yuhao Cui (@cuiyuhao1996) and Tong-An Luo (@Zoroaster97) are the main contributors to this repository. Please kindly contact them if you find any issue.

Citations

Please consider citing this paper if you use the code:

@inProceedings{cui2021rosita,
  title={ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-modal Knowledge Integration},
  author={Cui, Yuhao and Yu, Zhou and Wang, Chunqi and Zhao, Zhongzhou and Zhang, Ji and Wang, Meng and Yu, Jun},
  booktitle={Proceedings of the 29th ACM International Conference on Multimedia},
  year={2021}
}
Owner
Vision and Language Group@ MIL
Hangzhou Dianzi University
Vision and Language Group@ MIL
OpenMMLab Computer Vision Foundation

English | 简体中文 Introduction MMCV is a foundational library for computer vision research and supports many research projects as below: MMCV: OpenMMLab

OpenMMLab 4.6k Jan 09, 2023
Research Artifact of USENIX Security 2022 Paper: Automated Side Channel Analysis of Media Software with Manifold Learning

Manifold-SCA Research Artifact of USENIX Security 2022 Paper: Automated Side Channel Analysis of Media Software with Manifold Learning The repo is org

Yuanyuan Yuan 172 Dec 29, 2022
Research code for the paper "How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models"

Introduction This repository contains research code for the ACL 2021 paper "How Good is Your Tokenizer? On the Monolingual Performance of Multilingual

AdapterHub 20 Aug 04, 2022
[CVPR 21] Vectorization and Rasterization: Self-Supervised Learning for Sketch and Handwriting, IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2021.

Vectorization and Rasterization: Self-Supervised Learning for Sketch and Handwriting, CVPR 2021. Ayan Kumar Bhunia, Pinaki nath Chowdhury, Yongxin Yan

Ayan Kumar Bhunia 44 Dec 12, 2022
Unofficial reimplementation of ECAPA-TDNN for speaker recognition (EER=0.86 for Vox1_O when train only in Vox2)

Introduction This repository contains my unofficial reimplementation of the standard ECAPA-TDNN, which is the speaker recognition in VoxCeleb2 dataset

Tao Ruijie 277 Dec 31, 2022
Implement the Pareto Optimizer and pcgrad to make a self-adaptive loss for multi-task

multi-task_losses_optimizer Implement the Pareto Optimizer and pcgrad to make a self-adaptive loss for multi-task 已经实验过了,不会有cuda out of memory情况 ##Par

14 Dec 25, 2022
Datasets, Transforms and Models specific to Computer Vision

vision Datasets, Transforms and Models specific to Computer Vision Installation First install the nightly version of OneFlow python3 -m pip install on

OneFlow 68 Dec 07, 2022
Supplementary materials for ISMIR 2021 LBD paper "Evaluation of Latent Space Disentanglement in the Presence of Interdependent Attributes"

Evaluation of Latent Space Disentanglement in the Presence of Interdependent Attributes Supplementary materials for ISMIR 2021 LBD submission: K. N. W

Karn Watcharasupat 2 Oct 25, 2021
nnFormer: Interleaved Transformer for Volumetric Segmentation Code for paper "nnFormer: Interleaved Transformer for Volumetric Segmentation "

nnFormer: Interleaved Transformer for Volumetric Segmentation Code for paper "nnFormer: Interleaved Transformer for Volumetric Segmentation ". Please

jsguo 610 Dec 28, 2022
YOLOX-Paddle - A reproduction of YOLOX by PaddlePaddle

YOLOX-Paddle A reproduction of YOLOX by PaddlePaddle 数据集准备 下载COCO数据集,准备为如下路径 /ho

QuanHao Guo 6 Dec 18, 2022
This YoloV5 based model is fit to detect people and different types of land vehicles, and displaying their density on a fitted map, according to their coordinates and detected labels.

This YoloV5 based model is fit to detect people and different types of land vehicles, and displaying their density on a fitted map, according to their

Liron Bdolah 8 May 22, 2022
Tensorforce: a TensorFlow library for applied reinforcement learning

Tensorforce: a TensorFlow library for applied reinforcement learning Introduction Tensorforce is an open-source deep reinforcement learning framework,

Tensorforce 3.2k Jan 02, 2023
TransMIL: Transformer based Correlated Multiple Instance Learning for Whole Slide Image Classification

TransMIL: Transformer based Correlated Multiple Instance Learning for Whole Slide Image Classification [NeurIPS 2021] Abstract Multiple instance learn

132 Dec 30, 2022
On-device wake word detection powered by deep learning.

Porcupine Made in Vancouver, Canada by Picovoice Porcupine is a highly-accurate and lightweight wake word engine. It enables building always-listening

Picovoice 2.8k Dec 29, 2022
CPPE - 5 (Medical Personal Protective Equipment) is a new challenging object detection dataset

CPPE - 5 CPPE - 5 (Medical Personal Protective Equipment) is a new challenging dataset with the goal to allow the study of subordinate categorization

Rishit Dagli 53 Dec 17, 2022
Python suite to construct benchmark machine learning datasets from the MIMIC-III clinical database.

MIMIC-III Benchmarks Python suite to construct benchmark machine learning datasets from the MIMIC-III clinical database. Currently, the benchmark data

Chengxi Zang 6 Jan 02, 2023
buildseg is a building extraction plugin of QGIS based on PaddlePaddle.

buildseg buildseg is a building extraction plugin of QGIS based on PaddlePaddle. TODO Extract building on 512x512 remote sensing images. Extract build

Yizhou Chen 11 Sep 26, 2022
Python implementation of Wu et al (2018)'s registration fusion

reg-fusion Projection of a central sulcus probability map using the RF-ANTs approach (right hemisphere shown). This is a Python implementation of Wu e

Dan Gale 26 Nov 12, 2021
The FIRST GANs-based omics-to-omics translation framework

OmiTrans Please also have a look at our multi-omics multi-task DL freamwork 👀 : OmiEmbed The FIRST GANs-based omics-to-omics translation framework Xi

Xiaoyu Zhang 6 Dec 14, 2022
RM Operation can equivalently convert ResNet to VGG, which is better for pruning; and can help RepVGG perform better when the depth is large.

RMNet: Equivalently Removing Residual Connection from Networks This repository is the official implementation of "RMNet: Equivalently Removing Residua

184 Jan 04, 2023