[CVPR'21 Oral] Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning

Related tags

Deep Learningsoho
Overview

Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning [CVPR'21, Oral]

By Zhicheng Huang*, Zhaoyang Zeng*, Yupan Huang*, Bei Liu, Dongmei Fu and Jianlong Fu

Introduction

This is the official implementation of the paper. In this paper, we propose SOHO to "See Out of tHe bOx" that takes a whole image as input, and learns vision-language representation in an end-to-end manner. SOHO does not require bounding box annotations which enables inference 10 times faster than region-based approaches.

Architecture

Release Progress

  • VQA Codebase

  • Pre-training Codebase

  • Other Downstream Tasks

Installation

conda create -n soho python=3.7
conda activate soho
git clone https://github.com/researchmm/soho.git
cd soho
bash tools/install.sh

Getting Started

  1. Download the training, validation and test data

    mkdir -p $SOHO_ROOT/data/coco
    cd $SOHO_ROOT/data/coco
    # need to update
    wget https://vqasc.blob.core.windows.net/t-zhihuawork/code_10/MultiScalePretrain/data/coco/train2014.zip
    wget https://vqasc.blob.core.windows.net/t-zhihuawork/code_10/MultiScalePretrain/data/coco/val2014.zip
    wget https://vqasc.blob.core.windows.net/t-zhihuawork/code_10/MultiScalePretrain/data/coco/test2015.zip
    wget https://vqasc.blob.core.windows.net/t-zhihuawork/code_10/MultiScalePretrain/data/coco/train_data_qa_caption_new_box.json
    wget https://vqasc.blob.core.windows.net/t-zhihuawork/code_10/MultiScalePretrain/data/coco/val_data_qa_caption_new_box.json
    wget https://vqasc.blob.core.windows.net/t-zhihuawork/code_10/MultiScalePretrain/data/coco/test_data_qa.json
  2. Download the Pre-training models

    cd $SOHO_ROOT
    mkdir -p $SOHO_ROOT/pretrained
    cd $SOHO_ROOT/pretrained
    # the following need to update
    wget 
  3. Training a VQA model

    cd $SOHO_ROOT
    #use 8 GPUS to train the model
    bash tools/dist_train.sh configs/VQA/soho_res18_vqa.py 8
  4. Evaluate a VQA model

    bash tools/dist_test_vqa.sh configs/VQA/soho_res18_vqa.py 18 8

Citation

If you find this repo useful in your research, please consider citing the following papers:

@inproceedings{huang2021seeing,
  title={Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning},
  author={Huang, Zhicheng and Zeng, Zhaoyang and Huang, Yupan and Liu, Bei and Fu, Dongmei and Fu, Jianlong},
  booktitle={The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2021}
}

@article{huang2020pixel,
  title={Pixel-bert: Aligning image pixels with text by deep multi-modal transformers},
  author={Huang, Zhicheng and Zeng, Zhaoyang and Liu, Bei and Fu, Dongmei and Fu, Jianlong},
  journal={arXiv preprint arXiv:2004.00849},
  year={2020}
}

Acknowledgements

We would like to thank mmcv and mmdetection. Our commons lib is based on mmcv.

Owner
Multimedia Research
Multimedia Research at Microsoft Research Asia
Multimedia Research
Instance-Dependent Partial Label Learning

Instance-Dependent Partial Label Learning Installation pip install -r requirements.txt Run the Demo benchmark-random mnist python -u main.py --gpu 0 -

17 Dec 29, 2022
Draw like Bob Ross using the power of Neural Networks (With PyTorch)!

Draw like Bob Ross using the power of Neural Networks! (+ Pytorch) Learning Process Visualization Getting started Install dependecies Requires python3

Kendrick Tan 116 Mar 07, 2022
The Official TensorFlow Implementation for SPatchGAN (ICCV2021)

SPatchGAN: Official TensorFlow Implementation Paper "SPatchGAN: A Statistical Feature Based Discriminator for Unsupervised Image-to-Image Translation"

39 Dec 30, 2022
PySOT - SenseTime Research platform for single object tracking, implementing algorithms like SiamRPN and SiamMask.

PySOT is a software system designed by SenseTime Video Intelligence Research team. It implements state-of-the-art single object tracking algorit

STVIR 4.1k Dec 29, 2022
This program will stylize your photos with fast neural style transfer.

Neural Style Transfer (NST) Using TensorFlow Demo TensorFlow TensorFlow is an end-to-end open source platform for machine learning. It has a comprehen

Ismail Boularbah 1 Aug 08, 2022
Repository for RNNs using TensorFlow and Keras - LSTM and GRU Implementation from Scratch - Simple Classification and Regression Problem using RNNs

RNN 01- RNN_Classification Simple RNN training for classification task of 3 signal: Sine, Square, Triangle. 02- RNN_Regression Simple RNN training for

Nahid Ebrahimian 13 Dec 13, 2022
PyTorch implementation HoroPCA: Hyperbolic Dimensionality Reduction via Horospherical Projections

HoroPCA This code is the official PyTorch implementation of the ICML 2021 paper: HoroPCA: Hyperbolic Dimensionality Reduction via Horospherical Projec

HazyResearch 52 Nov 14, 2022
Ivy is a templated deep learning framework which maximizes the portability of deep learning codebases.

Ivy is a templated deep learning framework which maximizes the portability of deep learning codebases. Ivy wraps the functional APIs of existing frameworks. Framework-agnostic functions, libraries an

Ivy 8.2k Jan 02, 2023
Tensorflow 2.x implementation of Vision-Transformer model

Vision Transformer Unofficial Tensorflow 2.x implementation of the Transformer based Image Classification model proposed by the paper AN IMAGE IS WORT

Soumik Rakshit 16 Jul 20, 2022
Winners of the Facebook Image Similarity Challenge

Winners of the Facebook Image Similarity Challenge

DrivenData 111 Jan 05, 2023
Behind the Curtain: Learning Occluded Shapes for 3D Object Detection

Behind the Curtain: Learning Occluded Shapes for 3D Object Detection Acknowledgement We implement our model, BtcDet, based on [OpenPcdet 0.3.0]. Insta

Qiangeng Xu 163 Dec 19, 2022
KUIELAB-MDX-Net got the 2nd place on the Leaderboard A and the 3rd place on the Leaderboard B in the MDX-Challenge ISMIR 2021

KUIELAB-MDX-Net got the 2nd place on the Leaderboard A and the 3rd place on the Leaderboard B in the MDX-Challenge ISMIR 2021

IELab@ Korea University 74 Dec 28, 2022
MapReader: A computer vision pipeline for the semantic exploration of maps at scale

MapReader A computer vision pipeline for the semantic exploration of maps at scale MapReader is an end-to-end computer vision (CV) pipeline designed b

Living with Machines 25 Dec 26, 2022
Implementation of Gans

GAN Generative Adverserial Networks are an approach to generative data modelling using Deep learning methods. I have currently implemented : DCGAN on

Sibam Parida 5 Sep 07, 2021
3D mesh stylization driven by a text input in PyTorch

Text2Mesh [Project Page] Text2Mesh is a method for text-driven stylization of a 3D mesh, as described in "Text2Mesh: Text-Driven Neural Stylization fo

Threedle (University of Chicago) 649 Dec 27, 2022
A gesture recognition system powered by OpenPose, k-nearest neighbours, and local outlier factor.

OpenHands OpenHands is a gesture recognition system powered by OpenPose, k-nearest neighbours, and local outlier factor. Currently the system can iden

Paul Treanor 12 Jan 10, 2022
Tools for robust generative diffeomorphic slice to volume reconstruction

RGDSVR Tools for Robust Generative Diffeomorphic Slice to Volume Reconstructions (RGDSVR) This repository provides tools to implement the methods in t

Lucilio Cordero-Grande 0 Oct 29, 2021
DeepRec is a recommendation engine based on TensorFlow.

DeepRec Introduction DeepRec is a recommendation engine based on TensorFlow 1.15, Intel-TensorFlow and NVIDIA-TensorFlow. Background Sparse model is a

Alibaba 676 Jan 03, 2023
NVIDIA Deep Learning Examples for Tensor Cores

NVIDIA Deep Learning Examples for Tensor Cores Introduction This repository provides State-of-the-Art Deep Learning examples that are easy to train an

NVIDIA Corporation 10k Dec 31, 2022
Reference implementation for Deep Unsupervised Learning using Nonequilibrium Thermodynamics

Diffusion Probabilistic Models This repository provides a reference implementation of the method described in the paper: Deep Unsupervised Learning us

Jascha Sohl-Dickstein 238 Jan 02, 2023