Oscar and VinVL

Overview

Oscar: Object-Semantics Aligned Pre-training for Vision-and-Language Tasks

VinVL: Revisiting Visual Representations in Vision-Language Models

Updates

05/28/2020: Released finetuned models on downstream tasks, please check MODEL_ZOO.md.
05/15/2020: Released pretrained models, datasets, and code for downstream tasks finetuning.
01/13/2021: our new work VinVL proposed OSCAR+, an improved version of OSCAR, and provided a better object-attribute detection model to extract features for V+L tasks. The VinVL work achieved SOTA performance on all seven V+L tasks here. Please stay tuned for the model and code release.
03/08/2021: Oscar+ pretraining code released, please check the last section in VinVL_MODEL_ZOO.md. All image features and model checkpoints in VinVL are also released. Please check VinVL for details.
04/13/2021: Our Scene Graph Benchmark Repo has been released. Welcome to use the code there to extract image features with VinVL pretrained models.

Introduction

This repository contains source code necessary to reproduce the results presented in the paper Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks. We propose a new cross-modal pre-training method Oscar (Object-Semantics Aligned Pre-training). It leverages object tags detected in images as anchor points to significantly ease the learning of image-text alignments. We pre-train Oscar on the public corpus of 6.5 million text-image pairs, and fine-tune it on downstream tasks, creating new state-of-the-arts on six well-established vision-language understanding and generation tasks. For more on this project, see the Microsoft Research Blog post.

Performance

Task t2i t2i i2t i2t IC IC IC IC NoCaps NoCaps VQA NLVR2 GQA
Metric [email protected] [email protected] [email protected] [email protected] [email protected] M C S C S test-std test-P test-std
SoTA_S 39.2 68.0 56.6 84.5 38.9 29.2 129.8 22.4 61.5 9.2 70.92 58.80 63.17
SoTA_B 54.0 80.8 70.0 91.1 40.5 29.7 137.6 22.8 86.58 12.38 73.67 79.30 -
SoTA_L 57.5 82.8 73.5 92.2 41.7 30.6 140.0 24.5 - - 74.93 81.47 -
----- --- --- --- --- --- --- --- --- --- --- --- --- ---
Oscar_B 54.0 80.8 70.0 91.1 40.5 29.7 137.6 22.8 78.8 11.7 73.44 78.36 61.62
Oscar_L 57.5 82.8 73.5 92.2 41.7 30.6 140.0 24.5 80.9 11.3 73.82 80.05 -
----- --- --- --- --- --- --- --- --- --- --- --- --- ---
VinVL_B 58.1 83.2 74.6 92.6 40.9 30.9 140.6 25.1 92.46 13.07 76.12 83.08 64.65
VinVL_L 58.8 83.5 75.4 92.9 41.0 31.1 140.9 25.2 - - 76.62 83.98 -
gain 1.3 0.7 1.9 0.6 -0.7 0.5 0.9 0.7 5.9 0.7 1.69 2.51 1.48

t2i: text-to-image retrieval; i2t: image-to-text retrieval; IC: image captioning on COCO.

Download

We released pre-trained models, datasets, VinVL image features, and Oscar+ pretraining corpus for downstream tasks. Please check VinVL_DOWNLOAD.md for details.

To download checkpoints for the Vanilla OSCAR, please check DOWNLOAD.md for details.

Installation

Check INSTALL.md for installation instructions.

Model Zoo

Check MODEL_ZOO.md for scripts to run oscar downstream finetuning.

Check VinVL_MODEL_ZOO.md for scripts to run oscar+ pretraining and downstream finetuning.

Citations

Please consider citing this paper if you use the code:

@article{li2020oscar,
  title={Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks},
  author={Li, Xiujun and Yin, Xi and Li, Chunyuan and Hu, Xiaowei and Zhang, Pengchuan and Zhang, Lei and Wang, Lijuan and Hu, Houdong and Dong, Li and Wei, Furu and Choi, Yejin and Gao, Jianfeng},
  journal={ECCV 2020},
  year={2020}
}

@article{zhang2021vinvl,
  title={VinVL: Making Visual Representations Matter in Vision-Language Models},
  author={Zhang, Pengchuan and Li, Xiujun and Hu, Xiaowei and Yang, Jianwei and Zhang, Lei and Wang, Lijuan and Choi, Yejin and Gao, Jianfeng},
  journal={CVPR 2021},
  year={2021}
}

License

Oscar is released under the MIT license. See LICENSE for details.

Owner
Microsoft
Open source projects and samples from Microsoft
Microsoft
[3DV 2020] PeeledHuman: Robust Shape Representation for Textured 3D Human Body Reconstruction

PeeledHuman: Robust Shape Representation for Textured 3D Human Body Reconstruction International Conference on 3D Vision, 2020 Sai Sagar Jinka1, Rohan

Rohan Chacko 39 Oct 12, 2022
A facial recognition doorbell system using a Raspberry Pi

Facial Recognition Doorbell This project expands on the person-detecting doorbell system to allow it to identify faces, and announce names accordingly

rydercalmdown 22 Apr 15, 2022
Learning View Priors for Single-view 3D Reconstruction (CVPR 2019)

Learning View Priors for Single-view 3D Reconstruction (CVPR 2019) This is code for a paper Learning View Priors for Single-view 3D Reconstruction by

Hiroharu Kato 38 Aug 17, 2022
Official code repository for "Exploring Neural Models for Query-Focused Summarization"

Query-Focused Summarization Official code repository for "Exploring Neural Models for Query-Focused Summarization" This is a work in progress. Expect

Salesforce 29 Dec 18, 2022
It is a system used to detect bone fractures. using techniques deep learning and image processing

MohammedHussiengadalla-Intelligent-Classification-System-for-Bone-Fractures It is a system used to detect bone fractures. using techniques deep learni

Mohammed Hussien 7 Nov 11, 2022
ICCV2021 Oral SA-ConvONet: Sign-Agnostic Optimization of Convolutional Occupancy Networks

Sign-Agnostic Convolutional Occupancy Networks Paper | Supplementary | Video | Teaser Video | Project Page This repository contains the implementation

64 Jan 05, 2023
Real-ESRGAN: Training Real-World Blind Super-Resolution with Pure Synthetic Data

Real-ESRGAN Real-ESRGAN: Training Real-World Blind Super-Resolution with Pure Synthetic Data Ported from https://github.com/xinntao/Real-ESRGAN Depend

Holy Wu 44 Dec 27, 2022
[ICLR 2022 Oral] F8Net: Fixed-Point 8-bit Only Multiplication for Network Quantization

F8Net Fixed-Point 8-bit Only Multiplication for Network Quantization (ICLR 2022 Oral) OpenReview | arXiv | PDF | Model Zoo | BibTex PyTorch implementa

Snap Research 76 Dec 13, 2022
SpecAugmentPyTorch - A Pytorch (support batch and channel) implementation of GoogleBrain's SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition

SpecAugment An implementation of SpecAugment for Pytorch How to use Install pytorch, version=1.9.0 (new feature (torch.Tensor.take_along_dim) is used

IMLHF 3 Oct 11, 2022
I created My own Virtual Artificial Intelligence named genesis, He can assist with my Tasks and also perform some analysis,,

Virtual-Artificial-Intelligence-genesis- I created My own Virtual Artificial Intelligence named genesis, He can assist with my Tasks and also perform

AKASH M 1 Nov 05, 2021
Neural machine translation between the writings of Shakespeare and modern English using TensorFlow

Shakespeare translations using TensorFlow This is an example of using the new Google's TensorFlow library on monolingual translation going from modern

Motoki Wu 245 Dec 28, 2022
Data manipulation and transformation for audio signal processing, powered by PyTorch

torchaudio: an audio library for PyTorch The aim of torchaudio is to apply PyTorch to the audio domain. By supporting PyTorch, torchaudio follows the

1.9k Dec 28, 2022
A PyTorch Implementation of Gated Graph Sequence Neural Networks (GGNN)

A PyTorch Implementation of GGNN This is a PyTorch implementation of the Gated Graph Sequence Neural Networks (GGNN) as described in the paper Gated G

Ching-Yao Chuang 427 Dec 13, 2022
ISTR: End-to-End Instance Segmentation with Transformers (https://arxiv.org/abs/2105.00637)

This is the project page for the paper: ISTR: End-to-End Instance Segmentation via Transformers, Jie Hu, Liujuan Cao, Yao Lu, ShengChuan Zhang, Yan Wa

Jie Hu 182 Dec 19, 2022
Pytorch implementation of BRECQ, ICLR 2021

BRECQ Pytorch implementation of BRECQ, ICLR 2021 @inproceedings{ li&gong2021brecq, title={BRECQ: Pushing the Limit of Post-Training Quantization by Bl

Yuhang Li 148 Dec 28, 2022
A framework that allows people to write their own Rocket League bots.

YOU PROBABLY SHOULDN'T PULL THIS REPO Bot Makers Read This! If you just want to make a bot, you don't need to be here. Instead, start with one of thes

543 Dec 20, 2022
Framework for abstracting Amiga debuggers and access to AmigaOS libraries and devices.

Framework for abstracting Amiga debuggers. This project provides abstration to control an Amiga remotely using a debugger. The APIs are not yet stable

Roc Vallès 39 Nov 22, 2022
TEDSummary is a speech summary corpus. It includes TED talks subtitle (Document), Title-Detail (Summary), speaker name (Meta info), MP4 URL, and utterance id

TEDSummary is a speech summary corpus. It includes TED talks subtitle (Document), Title-Detail (Summary), speaker name (Meta info), MP4 URL

3 Dec 26, 2022
Official implementation of "Learning to Discover Cross-Domain Relations with Generative Adversarial Networks"

DiscoGAN Official PyTorch implementation of Learning to Discover Cross-Domain Relations with Generative Adversarial Networks. Prerequisites Python 2.7

SK T-Brain 754 Dec 29, 2022
GazeScroller - Using Facial Movements to perform Hands-free Gesture on the system

GazeScroller Using Facial Movements to perform Hands-free Gesture on the system

2 Jan 05, 2022