Bottom-up attention model for image captioning and VQA, based on Faster R-CNN and Visual Genome

Overview

bottom-up-attention

This code implements a bottom-up attention model, based on multi-gpu training of Faster R-CNN with ResNet-101, using object and attribute annotations from Visual Genome.

The pretrained model generates output features corresponding to salient image regions. These bottom-up attention features can typically be used as a drop-in replacement for CNN features in attention-based image captioning and visual question answering (VQA) models. This approach was used to achieve state-of-the-art image captioning performance on MSCOCO (CIDEr 117.9, BLEU_4 36.9) and to win the 2017 VQA Challenge (70.3% overall accuracy), as described in:

Some example object and attribute predictions for salient image regions are illustrated below.

teaser-bike teaser-oven

Note: This repo only includes code for training the bottom-up attention / Faster R-CNN model (section 3.1 of the paper). The actual captioning model (section 3.2) is available in a separate repo here.

Reference

If you use our code or features, please cite our paper:

@inproceedings{Anderson2017up-down,
  author = {Peter Anderson and Xiaodong He and Chris Buehler and Damien Teney and Mark Johnson and Stephen Gould and Lei Zhang},
  title = {Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering},
  booktitle={CVPR},
  year = {2018}
}

Disclaimer

This code is modified from py-R-FCN-multiGPU, which is in turn modified from py-faster-rcnn code. Please refer to these links for further README information (for example, relating to other models and datasets included in the repo) and appropriate citations for these works. This README only relates to Faster R-CNN trained on Visual Genome.

License

bottom-up-attention is released under the MIT License (refer to the LICENSE file for details).

Pretrained features

For ease-of-use, we make pretrained features available for the entire MSCOCO dataset. It is not necessary to clone or build this repo to use features downloaded from the links below. Features are stored in tsv (tab-separated-values) format that can be read with tools/read_tsv.py.

LINKS HAVE BEEN UPDATED TO GOOGLE CLOUD STORAGE (14 Feb 2021)

10 to 100 features per image (adaptive):

36 features per image (fixed):

Both sets of features can be recreated by using tools/generate_tsv.py with the appropriate pretrained model and with MIN_BOXES/MAX_BOXES set to either 10/100 or 36/36 respectively - refer Demo.

Contents

  1. Requirements: software
  2. Requirements: hardware
  3. Basic installation
  4. Demo
  5. Training
  6. Testing

Requirements: software

  1. Important Please use the version of caffe contained within this repository.

  2. Requirements for Caffe and pycaffe (see: Caffe installation instructions)

Note: Caffe must be built with support for Python layers and NCCL!

# In your Makefile.config, make sure to have these lines uncommented
WITH_PYTHON_LAYER := 1
USE_NCCL := 1
# Unrelatedly, it's also recommended that you use CUDNN
USE_CUDNN := 1
  1. Python packages you might not have: cython, python-opencv, easydict
  2. Nvidia's NCCL library which is used for multi-GPU training https://github.com/NVIDIA/nccl

Requirements: hardware

Any NVIDIA GPU with 12GB or larger memory is OK for training Faster R-CNN ResNet-101.

Installation

  1. Clone the repository
git clone https://github.com/peteanderson80/bottom-up-attention/
  1. Build the Cython modules

    cd $REPO_ROOT/lib
    make
  2. Build Caffe and pycaffe

    cd $REPO_ROOT/caffe
    # Now follow the Caffe installation instructions here:
    #   http://caffe.berkeleyvision.org/installation.html
    
    # If you're experienced with Caffe and have all of the requirements installed
    # and your Makefile.config in place, then simply do:
    make -j8 && make pycaffe

Demo

  1. Download pretrained model, and put it under data\faster_rcnn_models.

  2. Run tools/demo.ipynb to show object and attribute detections on demo images.

  3. Run tools/generate_tsv.py to extract bounding box features to a tab-separated-values (tsv) file. This will require modifying the load_image_ids function to suit your data locations. To recreate the pretrained feature files with 10 to 100 features per image, set MIN_BOXES=10 and MAX_BOXES=100. To recreate the pretrained feature files with 36 features per image, set MIN_BOXES=36 and MAX_BOXES=36 use this alternative pretrained model instead. The alternative pretrained model was trained for fewer iterations but performance is similar.

Training

  1. Download the Visual Genome dataset. Extract all the json files, as well as the image directories VG_100K and VG_100K_2 into one folder $VGdata.

  2. Create symlinks for the Visual Genome dataset

    cd $REPO_ROOT/data
    ln -s $VGdata vg
  3. Generate xml files for each image in the pascal voc format (this will take some time). This script will extract the top 2500/1000/500 objects/attributes/relations and also does basic cleanup of the visual genome data. Note however, that our training code actually only uses a subset of the annotations in the xml files, i.e., only 1600 object classes and 400 attribute classes, based on the hand-filtered vocabs found in data/genome/1600-400-20. The relevant part of the codebase is lib/datasets/vg.py. Relation labels can be included in the data layers but are currently not used.

    cd $REPO_ROOT
    ./data/genome/setup_vg.py
  4. Please download the ImageNet-pre-trained ResNet-100 model manually, and put it into $REPO_ROOT/data/imagenet_models

  5. You can train your own model using ./experiments/scripts/faster_rcnn_end2end_multi_gpu_resnet_final.sh (see instructions in file). The train (95k) / val (5k) / test (5k) splits are in data/genome/{split}.txt and have been determined using data/genome/create_splits.py. To avoid val / test set contamination when pre-training for MSCOCO tasks, for images in both datasets these splits match the 'Karpathy' COCO splits.

    Trained Faster-RCNN snapshots are saved under:

    output/faster_rcnn_resnet/vg/
    

    Logging outputs are saved under:

    experiments/logs/
    
  6. Run tools/review_training.ipynb to visualize the training data and predictions.

Testing

  1. The model will be tested on the validation set at the end of training, or models can be tested directly using tools/test_net.py, e.g.:

    ./tools/test_net.py --gpu 0 --imdb vg_1600-400-20_val --def models/vg/ResNet-101/faster_rcnn_end2end_final/test.prototxt --cfg experiments/cfgs/faster_rcnn_end2end_resnet.yml --net data/faster_rcnn_models/resnet101_faster_rcnn_final.caffemodel > experiments/logs/eval.log 2<&1
    

    Mean AP is reported separately for object prediction and attibute prediction (given ground-truth object detections). Test outputs are saved under:

    output/faster_rcnn_resnet/vg_1600-400-20_val/<network snapshot name>/
    

Expected detection results for the pretrained model

objects [email protected] objects weighted [email protected] attributes [email protected] attributes weighted [email protected]
Faster R-CNN, ResNet-101 10.2% 15.1% 7.8% 27.8%

Note that mAP is relatively low because many classes overlap (e.g. person / man / guy), some classes can't be precisely located (e.g. street, field) and separate classes exist for singular and plural objects (e.g. person / people). We focus on performance in downstream tasks (e.g. image captioning, VQA) rather than detection performance.

Fast Neural Representations for Direct Volume Rendering

Fast Neural Representations for Direct Volume Rendering Sebastian Weiss, Philipp Hermüller, Rüdiger Westermann This repository contains the code and s

Sebastian Weiss 20 Dec 03, 2022
RADIal is available now! Check the download section

Latest news: RADIal is available now! Check the download section. However, because we are currently working on the data anonymization, we provide for

valeo.ai 55 Jan 03, 2023
Computer Vision and Pattern Recognition, NUS CS4243, 2022

CS4243_2022 Computer Vision and Pattern Recognition, NUS CS4243, 2022 Cloud Machine #1 : Google Colab (Free GPU) Follow this Notebook installation : h

Xavier Bresson 142 Dec 15, 2022
Code for paper 'Hand-Object Contact Consistency Reasoning for Human Grasps Generation' at ICCV 2021

GraspTTA Hand-Object Contact Consistency Reasoning for Human Grasps Generation (ICCV 2021). Project Page with Videos Demo Quick Results Visualization

Hanwen Jiang 47 Dec 09, 2022
PoolFormer: MetaFormer is Actually What You Need for Vision

PoolFormer: MetaFormer is Actually What You Need for Vision (arXiv) This is a PyTorch implementation of PoolFormer proposed by our paper "MetaFormer i

Sea AI Lab 1k Dec 30, 2022
This repo contains implementation of different architectures for emotion recognition in conversations.

Emotion Recognition in Conversations Updates 🔥 🔥 🔥 Date Announcements 03/08/2021 🎆 🎆 We have released a new dataset M2H2: A Multimodal Multiparty

Deep Cognition and Language Research (DeCLaRe) Lab 1k Dec 30, 2022
Pytorch code for "DPFM: Deep Partial Functional Maps" - 3DV 2021 (Oral)

DPFM Code for "DPFM: Deep Partial Functional Maps" - 3DV 2021 (Oral) Installation This implementation runs on python = 3.7, use pip to install depend

Souhaib Attaiki 29 Oct 03, 2022
Experimenting with computer vision techniques to generate annotated image datasets from gameplay recordings automatically.

Experimenting with computer vision techniques to generate annotated image datasets from gameplay recordings automatically. The collected data will then be used to train a deep neural network that can

Martin Valchev 3 Apr 24, 2022
Official PyTorch implementation of paper: Standardized Max Logits: A Simple yet Effective Approach for Identifying Unexpected Road Obstacles in Urban-Scene Segmentation (ICCV 2021 Oral Presentation)

SML (ICCV 2021, Oral) : Official Pytorch Implementation This repository provides the official PyTorch implementation of the following paper: Standardi

SangHun 61 Dec 27, 2022
Source code for "UniRE: A Unified Label Space for Entity Relation Extraction.", ACL2021.

UniRE Source code for "UniRE: A Unified Label Space for Entity Relation Extraction.", ACL2021. Requirements python: 3.7.6 pytorch: 1.8.1 transformers:

Wang Yijun 109 Nov 29, 2022
The official code for PRIMER: Pyramid-based Masked Sentence Pre-training for Multi-document Summarization

PRIMER The official code for PRIMER: Pyramid-based Masked Sentence Pre-training for Multi-document Summarization. PRIMER is a pre-trained model for mu

AI2 114 Jan 06, 2023
Official PyTorch Implementation of paper EAN: Event Adaptive Network for Efficient Action Recognition

Official PyTorch Implementation of paper EAN: Event Adaptive Network for Efficient Action Recognition

TianYuan 27 Nov 07, 2022
PyTorch Implementation of NCSOFT's FastPitchFormant: Source-filter based Decomposed Modeling for Speech Synthesis

FastPitchFormant - PyTorch Implementation PyTorch Implementation of FastPitchFormant: Source-filter based Decomposed Modeling for Speech Synthesis. Qu

Keon Lee 63 Jan 02, 2023
Dogs classification with Deep Metric Learning using some popular losses

Tsinghua Dogs classification with Deep Metric Learning 1. Introduction Tsinghua Dogs dataset Tsinghua Dogs is a fine-grained classification dataset fo

QuocThangNguyen 45 Nov 09, 2022
Official Pytorch Implementation of Length-Adaptive Transformer (ACL 2021)

Length-Adaptive Transformer This is the official Pytorch implementation of Length-Adaptive Transformer. For detailed information about the method, ple

Clova AI Research 93 Dec 28, 2022
Multi-Object Tracking in Satellite Videos with Graph-Based Multi-Task Modeling

TGraM Multi-Object Tracking in Satellite Videos with Graph-Based Multi-Task Modeling, Qibin He, Xian Sun, Zhiyuan Yan, Beibei Li, Kun Fu Abstract Rece

Qibin He 6 Nov 25, 2022
An open source Python package for plasma science that is under development

PlasmaPy PlasmaPy is an open source, community-developed Python 3.7+ package for plasma science. PlasmaPy intends to be for plasma science what Astrop

PlasmaPy 444 Jan 07, 2023
Python code to fuse multiple RGB-D images into a TSDF voxel volume.

Volumetric TSDF Fusion of RGB-D Images in Python This is a lightweight python script that fuses multiple registered color and depth images into a proj

Andy Zeng 845 Jan 03, 2023
This is the PyTorch implementation of GANs N’ Roses: Stable, Controllable, Diverse Image to Image Translation

Official PyTorch repo for GAN's N' Roses. Diverse im2im and vid2vid selfie to anime translation.

1.1k Jan 01, 2023