Pytorch implementation for our ICCV 2021 paper "TRAR: Routing the Attention Spans in Transformers for Visual Question Answering".

Overview

TRAnsformer Routing Networks (TRAR)

This is an official implementation for ICCV 2021 paper "TRAR: Routing the Attention Spans in Transformers for Visual Question Answering". It currently includes the code for training TRAR on VQA2.0 and CLEVR dataset. Our TRAR model for REC task is coming soon.

Updates

  • (2021/10/10) Release our TRAR-VQA project.
  • (2021/08/31) Release our pretrained CLEVR TRAR model on train split: TRAR CLEVR Pretrained Models.
  • (2021/08/18) Release our pretrained TRAR model on train+val split and train+val+vg split: VQA-v2 TRAR Pretrained Models
  • (2021/08/16) Release our train2014, val2014 and test2015 data. Please check our dataset setup page DATA.md for more details.
  • (2021/08/15) Release our pretrained weight on train split. Please check our model page MODEL.md for more details.
  • (2021/08/13) The project page for TRAR is avaliable.

Introduction

TRAR vs Standard Transformer

TRAR Overall

Table of Contents

  1. Installation
  2. Dataset setup
  3. Config Introduction
  4. Training
  5. Validation and Testing
  6. Models

Installation

  • Clone this repo
git clone https://github.com/rentainhe/TRAR-VQA.git
cd TRAR-VQA
  • Create a conda virtual environment and activate it
conda create -n trar python=3.7 -y
conda activate trar
conda install pytorch==1.7.1 torchvision==0.8.2 cudatoolkit=10.1 -c pytorch
  • Install Spacy and initialize the GloVe as follows:
pip install -r requirements.txt
wget https://github.com/explosion/spacy-models/releases/download/en_vectors_web_lg-2.1.0/en_vectors_web_lg-2.1.0.tar.gz -O en_vectors_web_lg-2.1.0.tar.gz
pip install en_vectors_web_lg-2.1.0.tar.gz

Dataset setup

see DATA.md

Config Introduction

In trar.yml config we have these specific settings for TRAR model

ORDERS: [0, 1, 2, 3]
IMG_SCALE: 8 
ROUTING: 'hard' # {'soft', 'hard'}
POOLING: 'attention' # {'attention', 'avg', 'fc'}
TAU_POLICY: 1 # {0: 'SLOW', 1: 'FAST', 2: 'FINETUNE'}
TAU_MAX: 10
TAU_MIN: 0.1
BINARIZE: False
  • ORDERS=list, to set the local attention window size for routing.0 for global attention.
  • IMG_SCALE=int, which should be equal to the image feature size used for training. You should set IMG_SCALE: 16 for 16 × 16 training features.
  • ROUTING={'hard', 'soft'}, to set the Routing Block Type in TRAR model.
  • POOLING={'attention', 'avg', 'fc}, to set the Downsample Strategy used in Routing Block.
  • TAU_POLICY={0, 1, 2}, to set the temperature schedule in training TRAR when using ROUTING: 'hard'.
  • TAU_MAX=float, to set the maximum temperature in training.
  • TAU_MIN=float, to set the minimum temperature in training.
  • BINARIZE=bool, binarize the predicted alphas (alphas: the prob of choosing one path), which means during test time, we only keep the maximum alpha and set others to zero. If BINARIZE=False, it will keep all of the alphas and get a weight sum of different routing predict result by alphas. It won't influence the training time, just a small difference during test time.

Note that please set BINARIZE=False when ROUTING='soft', it's no need to binarize the path prob in soft routing block.

TAU_POLICY visualization

For MAX_EPOCH=13 with WARMUP_EPOCH=3 we have the following policy strategy:

Training

Train model on VQA-v2 with default hyperparameters:

python3 run.py --RUN='train' --DATASET='vqa' --MODEL='trar'

and the training log will be seved to:

results/log/log_run_
   
    .txt

   

Args:

  • --DATASET={'vqa', 'clevr'} to choose the task for training
  • --GPU=str, e.g. --GPU='2' to train model on specific GPU device.
  • --SPLIT={'train', 'train+val', train+val+vg'}, which combines different training datasets. The default training split is train.
  • --MAX_EPOCH=int to set the total training epoch number.

Resume Training

Resume training from specific saved model weights

python3 run.py --RUN='train' --DATASET='vqa' --MODEL='trar' --RESUME=True --CKPT_V=str --CKPT_E=int
  • --CKPT_V=str: the specific checkpoint version
  • --CKPT_E=int: the resumed epoch number

Multi-GPU Training and Gradient Accumulation

  1. Multi-GPU Training: Add --GPU='0, 1, 2, 3...' after the training scripts.
python3 run.py --RUN='train' --DATASET='vqa' --MODEL='trar' --GPU='0,1,2,3'

The batch size on each GPU will be divided into BATCH_SIZE/GPUs automatically.

  1. Gradient Accumulation: Add --ACCU=n after the training scripts
python3 run.py --RUN='train' --DATASET='vqa' --MODEL='trar' --ACCU=2

This makes the optimizer accumulate gradients for n mini-batches and update the model weights once. BATCH_SIZE should be divided by n.

Validation and Testing

Warning: The args --MODEL and --DATASET should be set to the same values as those in the training stage.

Validate on Local Machine Offline evaluation only support the evaluations on the coco_2014_val dataset now.

  1. Use saved checkpoint
python3 run.py --RUN='val' --MODEL='trar' --DATASET='{vqa, clevr}' --CKPT_V=str --CKPT_E=int
  1. Use the absolute path
python3 run.py --RUN='val' --MODEL='trar' --DATASET='{vqa, clevr}' --CKPT_PATH=str

Online Testing All the evaluations on the test dataset of VQA-v2 and CLEVR benchmarks can be achieved as follows:

python3 run.py --RUN='test' --MODEL='trar' --DATASET='{vqa, clevr}' --CKPT_V=str --CKPT_E=int

Result file are saved at:

results/result_test/result_run_ _ .json

You can upload the obtained result json file to Eval AI to evaluate the scores.

Models

Here we provide our pretrained model and log, please see MODEL.md

Acknowledgements

Citation

if TRAR is helpful for your research or you wish to refer the baseline results published here, we'd really appreciate it if you could cite this paper:

@InProceedings{Zhou_2021_ICCV,
    author    = {Zhou, Yiyi and Ren, Tianhe and Zhu, Chaoyang and Sun, Xiaoshuai and Liu, Jianzhuang and Ding, Xinghao and Xu, Mingliang and Ji, Rongrong},
    title     = {TRAR: Routing the Attention Spans in Transformer for Visual Question Answering},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    month     = {October},
    year      = {2021},
    pages     = {2074-2084}
}
You might also like...
Official implementation of the paper Vision Transformer with Progressive Sampling, ICCV 2021.
Official implementation of the paper Vision Transformer with Progressive Sampling, ICCV 2021.

Vision Transformer with Progressive Sampling This is the official implementation of the paper Vision Transformer with Progressive Sampling, ICCV 2021.

 Official implementation of the ICCV 2021 paper
Official implementation of the ICCV 2021 paper "Conditional DETR for Fast Training Convergence".

The DETR approach applies the transformer encoder and decoder architecture to object detection and achieves promising performance. In this paper, we handle the critical issue, slow training convergence, and present a conditional cross-attention mechanism for fast DETR training. Our approach is motivated by that the cross-attention in DETR relies highly on the content embeddings and that the spatial embeddings make minor contributions, increasing the need for high-quality content embeddings and thus increasing the training difficulty.

The Official Implementation of the ICCV-2021 Paper: Semantically Coherent Out-of-Distribution Detection.
The Official Implementation of the ICCV-2021 Paper: Semantically Coherent Out-of-Distribution Detection.

SCOOD-UDG (ICCV 2021) This repository is the official implementation of the paper: Semantically Coherent Out-of-Distribution Detection Jingkang Yang,

Official implementation of the ICCV 2021 paper:
Official implementation of the ICCV 2021 paper: "The Power of Points for Modeling Humans in Clothing".

The Power of Points for Modeling Humans in Clothing (ICCV 2021) This repository contains the official PyTorch implementation of the ICCV 2021 paper: T

Official implementation of the ICCV 2021 paper
Official implementation of the ICCV 2021 paper "Joint Inductive and Transductive Learning for Video Object Segmentation"

JOINT This is the official implementation of Joint Inductive and Transductive learning for Video Object Segmentation, to appear in ICCV 2021. @inproce

Implementation for paper "STAR: A Structure-aware Lightweight Transformer for Real-time Image Enhancement" (ICCV 2021).

STAR-pytorch Implementation for paper "STAR: A Structure-aware Lightweight Transformer for Real-time Image Enhancement" (ICCV 2021). CVF (pdf) STAR-DC

PyTorch implementations for our SIGGRAPH 2021 paper: Editable Free-viewpoint Video Using a Layered Neural Representation.
PyTorch implementations for our SIGGRAPH 2021 paper: Editable Free-viewpoint Video Using a Layered Neural Representation.

st-nerf We provide PyTorch implementations for our paper: Editable Free-viewpoint Video Using a Layered Neural Representation SIGGRAPH 2021 Jiakai Zha

An official implementation of "Exploiting a Joint Embedding Space for Generalized Zero-Shot Semantic Segmentation" (ICCV 2021) in PyTorch.

Exploiting a Joint Embedding Space for Generalized Zero-Shot Semantic Segmentation This is an official implementation of the paper "Exploiting a Joint

[ICCV 2021]  Official Pytorch implementation for Discriminative Region-based Multi-Label Zero-Shot Learning SOTA results on NUS-WIDE and OpenImages
[ICCV 2021] Official Pytorch implementation for Discriminative Region-based Multi-Label Zero-Shot Learning SOTA results on NUS-WIDE and OpenImages

Discriminative Region-based Multi-Label Zero-Shot Learning (ICCV 2021) [arXiv][Project page coming soon] Sanath Narayan*, Akshita Gupta*, Salman Kh

Comments
  • Could the authors provide REC code?

    Could the authors provide REC code?

    Hello,

    I am very interested in your work. I noticed that the authors have conducted experiments on REC datasets (RefCOCO, RefCOCO+, RefCOCOg).However, I only find the code about VQA datasets (VQA2.0 and CLEVR), could you provide this code of this part?

    Thank you!

    opened by QiuHeqian 5
  • 求助TRAR相关的问题

    求助TRAR相关的问题

    尊敬的TRAR作者,您好,我最近也在训练TRAR模型,在超参数基本同您一致的情况下,采用了您仓库中所提供的 8x8 Grid features数据集,经过多次训练,我的模型准确度大概在71.5%(VQA2.0)左右,达不到您在文中所提出的为72%, 另外,我也加载了您所提供的train+val+vg->test预训练模型参数,并在这个数据集上只能跑到70.6%(VQA2.0),综上,请问是因为这个8x8网格特征的问题吗?或者还是其他原因? 期待您的答复,谢谢。

    opened by MissionAbort 3
Releases(v1.0.0)
Owner
Ren Tianhe
Ren Tianhe
Very Deep Convolutional Networks for Large-Scale Image Recognition

pytorch-vgg Some scripts to convert the VGG-16 and VGG-19 models [1] from Caffe to PyTorch. The converted models can be used with the PyTorch model zo

Justin Johnson 217 Dec 05, 2022
[CVPR 2022 Oral] EPro-PnP: Generalized End-to-End Probabilistic Perspective-n-Points for Monocular Object Pose Estimation

EPro-PnP EPro-PnP: Generalized End-to-End Probabilistic Perspective-n-Points for Monocular Object Pose Estimation In CVPR 2022 (Oral). [paper] Hanshen

同济大学智能汽车研究所综合感知研究组 ( Comprehensive Perception Research Group under Institute of Intelligent Vehicles, School of Automotive Studies, Tongji University) 842 Jan 04, 2023
we propose a novel deep network, named feature aggregation and refinement network (FARNet), for the automatic detection of anatomical landmarks.

Feature Aggregation and Refinement Network for 2D Anatomical Landmark Detection Overview Localization of anatomical landmarks is essential for clinica

aoyueyuan 0 Aug 28, 2022
OptaPlanner wrappers for Python. Currently significantly slower than OptaPlanner in Java or Kotlin.

OptaPy is an AI constraint solver for Python to optimize the Vehicle Routing Problem, Employee Rostering, Maintenance Scheduling, Task Assignment, School Timetabling, Cloud Optimization, Conference S

OptaPy 211 Jan 02, 2023
Semantic segmentation task for ADE20k & cityscapse dataset, based on several models.

semantic-segmentation-tensorflow This is a Tensorflow implementation of semantic segmentation models on MIT ADE20K scene parsing dataset and Cityscape

HsuanKung Yang 83 Oct 13, 2022
Gradient-free global optimization algorithm for multidimensional functions based on the low rank tensor train format

ttopt Description Gradient-free global optimization algorithm for multidimensional functions based on the low rank tensor train (TT) format and maximu

5 May 23, 2022
Towards End-to-end Video-based Eye Tracking

Towards End-to-end Video-based Eye Tracking The code accompanying our ECCV 2020 publication and dataset, EVE. Authors: Seonwook Park, Emre Aksan, Xuco

Seonwook Park 76 Dec 12, 2022
CAPITAL: Optimal Subgroup Identification via Constrained Policy Tree Search

CAPITAL: Optimal Subgroup Identification via Constrained Policy Tree Search This repository is the official implementation of CAPITAL: Optimal Subgrou

Hengrui Cai 0 Oct 19, 2021
gym-anm is a framework for designing reinforcement learning (RL) environments that model Active Network Management (ANM) tasks in electricity distribution networks.

gym-anm is a framework for designing reinforcement learning (RL) environments that model Active Network Management (ANM) tasks in electricity distribution networks. It is built on top of the OpenAI G

Robin Henry 99 Dec 12, 2022
A machine learning malware analysis framework for Android apps.

🕵️ A machine learning malware analysis framework for Android apps. ☢️ DroidDetective is a Python tool for analysing Android applications (APKs) for p

James Stevenson 77 Dec 27, 2022
68 keypoint annotations for COFW test data

68 keypoint annotations for COFW test data This repository contains manually annotated 68 keypoints for COFW test data (original annotation of CFOW da

31 Dec 06, 2022
Supplementary code for TISMIR paper "Sliding-Window Pitch-Class Histograms as a Means of Modeling Musical Form"

Sliding-Window Pitch-Class Histograms as a Means of Modeling Musical Form This is supplementary code for the TISMIR paper Sliding-Window Pitch-Class H

1 Nov 27, 2021
blind SQLIpy sebuah alat injeksi sql yang menggunakan waktu sql untuk mendapatkan sebuah server database.

blind SQLIpy Alat blind SQLIpy ini merupakan alat injeksi sql yang menggunakan metode time based blind sql injection metode tersebut membutuhkan waktu

Galih Anggoro Prasetya 4 Feb 24, 2022
Bayesian Generative Adversarial Networks in Tensorflow

Bayesian Generative Adversarial Networks in Tensorflow This repository contains the Tensorflow implementation of the Bayesian GAN by Yunus Saatchi and

Andrew Gordon Wilson 1k Nov 29, 2022
Deep learning algorithms for muon momentum estimation in the CMS Trigger System

Deep learning algorithms for muon momentum estimation in the CMS Trigger System The Compact Muon Solenoid (CMS) is a general-purpose detector at the L

anuragB 2 Oct 06, 2021
pytorch implementation of the ICCV'21 paper "MVTN: Multi-View Transformation Network for 3D Shape Recognition"

MVTN: Multi-View Transformation Network for 3D Shape Recognition (ICCV 2021) By Abdullah Hamdi, Silvio Giancola, Bernard Ghanem Paper | Video | Tutori

Abdullah Hamdi 64 Jan 03, 2023
Code of paper "Compositionally Generalizable 3D Structure Prediction"

Compositionally Generalizable 3D Structure Prediction In this work, We bring in the concept of compositional generalizability and factorizes the 3D sh

Songfang Han 30 Dec 17, 2022
Robotics with GPU computing

Robotics with GPU computing Cupoch is a library that implements rapid 3D data processing for robotics using CUDA. The goal of this library is to imple

Shirokuma 625 Jan 07, 2023
Project page for our ICCV 2021 paper "The Way to my Heart is through Contrastive Learning"

The Way to my Heart is through Contrastive Learning: Remote Photoplethysmography from Unlabelled Video This is the official project page of our ICCV 2

36 Jan 06, 2023
TensorFlow 2 implementation of the Yahoo Open-NSFW model

TensorFlow 2 implementation of the Yahoo Open-NSFW model

Bosco Yung 101 Jan 01, 2023