A unified framework to jointly model images, text, and human attention traces.

Overview

connect-caption-and-trace

This repository contains the reference code for our paper Connecting What to Say With Where to Look by Modeling Human Attention Traces (CVPR2021).

example results

Requirements

  • Python 3
  • PyTorch 1.5+ (along with torchvision)
  • coco-caption (Remember to follow initialization steps in coco-caption/README.md)

Prepare data

Our experiments cover all four datasets included in Localized Narratives: COCO2017, Flickr30k, Open Images and ADE20k. For each dataset, we need four things: (1) json file containing image info and word tokens. (DATASET_LN.json) (2) h5 file containing caption labels (DATASET_LN_label.h5) (3) The trace labels extracted from Localized Narratives (DATASET_LN_trace_box/) (4) json file for coco-caption evaluation (captions_DATASET_LN_test.json) (5) Image features (with bounding boxes) extracted by a Mask-RCNN pretrained on Visual Genome.

You can download (1--4) from here: (make a folder named data and put (1--3) in it, and put (4) under coco-caption/annotaions/)

To get (5), you can use Detectron2. First, install Detectron2, then follow Prepare COCO-style annotations for Visual Genome (We use the pre-trained Resnet101-C4 model provided there). After that you can utilize tools/extract_feats.py in Detectron2 to extract features. Finally, run scripts/prepare_feats_boxes_from_npz.py in this repo to prepare features and bounding boxes in seperate folders for training.

For COCO dataest you can also directly use the features provided by Peter Anderson here. The performance is almost the same (with around 0.2% difference.)

Training

The dataset can be chosen from the four datasets. The --task can be chosen from trace, caption, c_joint_t and pred_both. The --eval_task can be chosen from trace, caption, and pred_both.

COCO: joint training of controlled caption generation and trace generation (N=2 layers, evaluated on caption generation)

python tools/train.py --language_eval 0 --id transformer_LN_coco  --caption_model transformer --input_json data/coco_LN.json --input_att_dir Dir_to_image_features_vg --input_box_dir Dir_to_bounding_boxes_vg --input_label_h5 data/coco_LN_label.h5 --batch_size 30 --learning_rate 5e-4 --learning_rate_decay_start 0 --scheduled_sampling_start 100 --learning_rate_decay_every 3  --save_checkpoint_every 1000 --max_epochs 30 --max_length 225 --seq_per_img 1 --use_box 1   --use_trace 1  --input_trace_dir data/coco_LN_trace_box --use_trace_feat 0 --beam_size 1 --val_images_use -1 --num_layers 2 --task c_joint_t --eval_task caption --dataset_choice=coco

Open image: training of generating caption and trace at the same time (N=1 layers, evaluated on predicting both)

python tools/train.py --language_eval 0 --id transformer_LN_openimg  --caption_model transformer --input_json data/openimg_LN.json --input_att_dir Dir_to_image_features_vg --input_box_dir Dir_to_bounding_boxes_vg --input_label_h5 data/openimg_LN_label.h5 --batch_size 30 --learning_rate 5e-4 --learning_rate_decay_start 0 --scheduled_sampling_start 100 --learning_rate_decay_every 3  --save_checkpoint_every 1000 --max_epochs 30 --max_length 225 --seq_per_img 1 --use_box 1   --use_trace 1  --input_trace_dir data/openimg_LN_trace_box --use_trace_feat 0 --beam_size 1 --val_images_use -1 --num_layers 1 --task pred_both --eval_task pred_both --dataset_choice=openimg

Flickr30k: training of controlled caption generation alone (N=1 layer)

python tools/train.py --language_eval 0 --id transformer_LN_flk30k  --caption_model transformer --input_json data/flk30k_LN.json --input_att_dir Dir_to_image_features_vg --input_box_dir Dir_to_bounding_boxes_vg --input_label_h5 data/flk30k_LN_label.h5 --batch_size 30 --learning_rate 5e-4 --learning_rate_decay_start 0 --scheduled_sampling_start 100 --learning_rate_decay_every 3  --save_checkpoint_every 1000 --max_epochs 30 --max_length 225 --seq_per_img 1 --use_box 1   --use_trace 1  --input_trace_dir data/flk30k_LN_trace_box --use_trace_feat 0 --beam_size 1 --val_images_use -1 --num_layers 1 --task caption --eval_task caption --dataset_choice=flk30k

ADE20k: training of controlled trace generation alone (N=1 layer)

python tools/train.py --language_eval 0 --id transformer_LN_ade20k  --caption_model transformer --input_json data/ade20k_LN.json --input_att_dir Dir_to_image_features_vg --input_box_dir Dir_to_bounding_boxes_vg --input_label_h5 data/ade20k_LN_label.h5 --batch_size 30 --learning_rate 5e-4 --learning_rate_decay_start 0 --scheduled_sampling_start 100 --learning_rate_decay_every 3  --save_checkpoint_every 1000 --max_epochs 30 --max_length 225 --seq_per_img 1 --use_box 1   --use_trace 1  --input_trace_dir data/ade20k_LN_trace_box --use_trace_feat 0 --beam_size 1 --val_images_use -1 --num_layers 1 --task trace --eval_task trace --dataset_choice=ade20k

Evaluating

COCO: joint training of controlled caption generation and trace generation (N=2 layers, evaluated on caption generation)

python tools/train.py --language_eval 1 --id transformer_LN_coco  --caption_model transformer --input_json data/coco_LN.json --input_att_dir Dir_to_image_features_vg --input_box_dir Dir_to_bounding_boxes_vg --input_label_h5 data/coco_LN_label.h5 --batch_size 2 --learning_rate 5e-4 --learning_rate_decay_start 0 --scheduled_sampling_start 100 --learning_rate_decay_every 3  --save_checkpoint_every 1000 --max_epochs 30 --max_length 225 --seq_per_img 1 --use_box 1   --use_trace 1  --input_trace_dir data/coco_LN_trace_box --use_trace_feat 0 --beam_size 5 --val_images_use -1 --num_layers 2 --task c_joint_t --eval_task caption --dataset_choice=coco

COCO: joint training of controlled caption generation and trace generation (N=2 layers, evaluated on trace generation)

python tools/train.py --language_eval 1 --id transformer_LN_coco  --caption_model transformer --input_json data/coco_LN.json --input_att_dir Dir_to_image_features_vg --input_box_dir Dir_to_bounding_boxes_vg --input_label_h5 data/coco_LN_label.h5 --batch_size 30 --learning_rate 5e-4 --learning_rate_decay_start 0 --scheduled_sampling_start 100 --learning_rate_decay_every 3  --save_checkpoint_every 1000 --max_epochs 30 --max_length 225 --seq_per_img 1 --use_box 1   --use_trace 1  --input_trace_dir data/coco_LN_trace_box --use_trace_feat 0 --beam_size 1 --val_images_use -1 --num_layers 2 --task c_joint_t --eval_task trace --dataset_choice=coco

Open image: training of generating caption and trace at the same time (N=1 layers, evaluated on predicting both)

python tools/train.py --language_eval 1 --id transformer_LN_openimg  --caption_model transformer --input_json data/openimg_LN.json --input_att_dir Dir_to_image_features_vg --input_box_dir Dir_to_bounding_boxes_vg --input_label_h5 data/openimg_LN_label.h5 --batch_size 2 --learning_rate 5e-4 --learning_rate_decay_start 0 --scheduled_sampling_start 100 --learning_rate_decay_every 3  --save_checkpoint_every 1000 --max_epochs 30 --max_length 225 --seq_per_img 1 --use_box 1   --use_trace 1  --input_trace_dir data/openimg_LN_trace_box --use_trace_feat 0 --beam_size 5 --val_images_use -1 --num_layers 1 --task pred_both --eval_task pred_both --dataset_choice=openimg

Acknowledgements

Some components of this repo were built from Ruotian Luo's ImageCaptioning.pytorch.

Owner
Meta Research
Meta Research
Official PyTorch implementation of the paper "Likelihood Training of Schrödinger Bridge using Forward-Backward SDEs Theory (SB-FBSDE)"

Official PyTorch implementation of the paper "Likelihood Training of Schrödinger Bridge using Forward-Backward SDEs Theory (SB-FBSDE)" which introduces a new class of deep generative models that gene

Guan-Horng Liu 43 Jan 03, 2023
PyTorch implementation of the REMIND method from our ECCV-2020 paper "REMIND Your Neural Network to Prevent Catastrophic Forgetting"

REMIND Your Neural Network to Prevent Catastrophic Forgetting This is a PyTorch implementation of the REMIND algorithm from our ECCV-2020 paper. An ar

Tyler Hayes 72 Nov 27, 2022
[CVPR 2021] "The Lottery Tickets Hypothesis for Supervised and Self-supervised Pre-training in Computer Vision Models" Tianlong Chen, Jonathan Frankle, Shiyu Chang, Sijia Liu, Yang Zhang, Michael Carbin, Zhangyang Wang

The Lottery Tickets Hypothesis for Supervised and Self-supervised Pre-training in Computer Vision Models Codes for this paper The Lottery Tickets Hypo

VITA 59 Dec 28, 2022
A Multi-attribute Controllable Generative Model for Histopathology Image Synthesis

A Multi-attribute Controllable Generative Model for Histopathology Image Synthesis This is the pytorch implementation for our MICCAI 2021 paper. A Mul

Jiarong Ye 7 Apr 04, 2022
Unofficial implementation of MLP-Mixer: An all-MLP Architecture for Vision

MLP-Mixer: An all-MLP Architecture for Vision This repo contains PyTorch implementation of MLP-Mixer: An all-MLP Architecture for Vision. Usage : impo

Rishikesh (ऋषिकेश) 175 Dec 23, 2022
Code repo for "RBSRICNN: Raw Burst Super-Resolution through Iterative Convolutional Neural Network" (Machine Learning and the Physical Sciences workshop in NeurIPS 2021).

RBSRICNN: Raw Burst Super-Resolution through Iterative Convolutional Neural Network An official PyTorch implementation of the RBSRICNN network as desc

Rao Muhammad Umer 6 Nov 14, 2022
This is an open source python repository for various python tests

Welcome to Py-tests This is an open source python repository for various python tests. This is in response to the hacktoberfest2021 challenge. It is a

Yada Martins Tisan 3 Oct 31, 2021
Feup-csr - Repository holding my group's submission to the CSR project competition

CSR Competições de Swarm Robotics Swarm Robotics Competitions This repository holds the files submitted for the CSR project competition. Project group

Nuno Pereira 1 Jan 04, 2022
pytorch implementation of ABC : Auxiliary Balanced Classifier for Class-imbalanced Semi-supervised Learning

ABC:Auxiliary Balanced Classifier for Class-imbalanced Semi-supervised Learning, NeurIPS 2021 pytorch implementation of ABC : Auxiliary Balanced Class

Hyuck Lee 25 Dec 22, 2022
TensorFlow Implementation of "Show, Attend and Tell"

Show, Attend and Tell Update (December 2, 2016) TensorFlow implementation of Show, Attend and Tell: Neural Image Caption Generation with Visual Attent

Yunjey Choi 902 Nov 29, 2022
Code for "Learning Graph Cellular Automata"

Learning Graph Cellular Automata This code implements the experiments from the NeurIPS 2021 paper: "Learning Graph Cellular Automata" Daniele Grattaro

Daniele Grattarola 37 Oct 26, 2022
IPATool-py: download ipa easily

IPATool-py Python version of IPATool! Installation pip3 install -r requirements.txt Usage Quickstart: download app with specific bundleId into DIR: p

159 Dec 30, 2022
Pytorch reimplementation of the Vision Transformer (An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale)

Vision Transformer Pytorch reimplementation of Google's repository for the ViT model that was released with the paper An Image is Worth 16x16 Words: T

Eunkwang Jeon 1.4k Dec 28, 2022
Supplementary code for TISMIR paper "Sliding-Window Pitch-Class Histograms as a Means of Modeling Musical Form"

Sliding-Window Pitch-Class Histograms as a Means of Modeling Musical Form This is supplementary code for the TISMIR paper Sliding-Window Pitch-Class H

1 Nov 27, 2021
Adversarial Attacks are Reversible via Natural Supervision

Adversarial Attacks are Reversible via Natural Supervision ICCV2021 Citation @InProceedings{Mao_2021_ICCV, author = {Mao, Chengzhi and Chiquier

Computer Vision Lab at Columbia University 20 May 22, 2022
Focal and Global Knowledge Distillation for Detectors

FGD Paper: Focal and Global Knowledge Distillation for Detectors Install MMDetection and MS COCO2017 Our codes are based on MMDetection. Please follow

Mesopotamia 261 Dec 23, 2022
7th place solution of Human Protein Atlas - Single Cell Classification on Kaggle

kaggle-hpa-2021-7th-place-solution Code for 7th place solution of Human Protein Atlas - Single Cell Classification on Kaggle. A description of the met

8 Jul 09, 2021
TSIT: A Simple and Versatile Framework for Image-to-Image Translation

TSIT: A Simple and Versatile Framework for Image-to-Image Translation This repository provides the official PyTorch implementation for the following p

Liming Jiang 255 Nov 23, 2022
Pytorch implementation of the paper "Enhancing Content Preservation in Text Style Transfer Using Reverse Attention and Conditional Layer Normalization"

Pytorch implementation of the paper "Enhancing Content Preservation in Text Style Transfer Using Reverse Attention and Conditional Layer Normalization"

Dongkyu Lee 4 Sep 18, 2022
🐦 Quickly annotate data from the comfort of your Jupyter notebook

🐦 pigeon - Quickly annotate data on Jupyter Pigeon is a simple widget that lets you quickly annotate a dataset of unlabeled examples from the comfort

Anastasis Germanidis 647 Jan 05, 2023