A unified framework to jointly model images, text, and human attention traces.

Overview

connect-caption-and-trace

This repository contains the reference code for our paper Connecting What to Say With Where to Look by Modeling Human Attention Traces (CVPR2021).

example results

Requirements

  • Python 3
  • PyTorch 1.5+ (along with torchvision)
  • coco-caption (Remember to follow initialization steps in coco-caption/README.md)

Prepare data

Our experiments cover all four datasets included in Localized Narratives: COCO2017, Flickr30k, Open Images and ADE20k. For each dataset, we need four things: (1) json file containing image info and word tokens. (DATASET_LN.json) (2) h5 file containing caption labels (DATASET_LN_label.h5) (3) The trace labels extracted from Localized Narratives (DATASET_LN_trace_box/) (4) json file for coco-caption evaluation (captions_DATASET_LN_test.json) (5) Image features (with bounding boxes) extracted by a Mask-RCNN pretrained on Visual Genome.

You can download (1--4) from here: (make a folder named data and put (1--3) in it, and put (4) under coco-caption/annotaions/)

To get (5), you can use Detectron2. First, install Detectron2, then follow Prepare COCO-style annotations for Visual Genome (We use the pre-trained Resnet101-C4 model provided there). After that you can utilize tools/extract_feats.py in Detectron2 to extract features. Finally, run scripts/prepare_feats_boxes_from_npz.py in this repo to prepare features and bounding boxes in seperate folders for training.

For COCO dataest you can also directly use the features provided by Peter Anderson here. The performance is almost the same (with around 0.2% difference.)

Training

The dataset can be chosen from the four datasets. The --task can be chosen from trace, caption, c_joint_t and pred_both. The --eval_task can be chosen from trace, caption, and pred_both.

COCO: joint training of controlled caption generation and trace generation (N=2 layers, evaluated on caption generation)

python tools/train.py --language_eval 0 --id transformer_LN_coco  --caption_model transformer --input_json data/coco_LN.json --input_att_dir Dir_to_image_features_vg --input_box_dir Dir_to_bounding_boxes_vg --input_label_h5 data/coco_LN_label.h5 --batch_size 30 --learning_rate 5e-4 --learning_rate_decay_start 0 --scheduled_sampling_start 100 --learning_rate_decay_every 3  --save_checkpoint_every 1000 --max_epochs 30 --max_length 225 --seq_per_img 1 --use_box 1   --use_trace 1  --input_trace_dir data/coco_LN_trace_box --use_trace_feat 0 --beam_size 1 --val_images_use -1 --num_layers 2 --task c_joint_t --eval_task caption --dataset_choice=coco

Open image: training of generating caption and trace at the same time (N=1 layers, evaluated on predicting both)

python tools/train.py --language_eval 0 --id transformer_LN_openimg  --caption_model transformer --input_json data/openimg_LN.json --input_att_dir Dir_to_image_features_vg --input_box_dir Dir_to_bounding_boxes_vg --input_label_h5 data/openimg_LN_label.h5 --batch_size 30 --learning_rate 5e-4 --learning_rate_decay_start 0 --scheduled_sampling_start 100 --learning_rate_decay_every 3  --save_checkpoint_every 1000 --max_epochs 30 --max_length 225 --seq_per_img 1 --use_box 1   --use_trace 1  --input_trace_dir data/openimg_LN_trace_box --use_trace_feat 0 --beam_size 1 --val_images_use -1 --num_layers 1 --task pred_both --eval_task pred_both --dataset_choice=openimg

Flickr30k: training of controlled caption generation alone (N=1 layer)

python tools/train.py --language_eval 0 --id transformer_LN_flk30k  --caption_model transformer --input_json data/flk30k_LN.json --input_att_dir Dir_to_image_features_vg --input_box_dir Dir_to_bounding_boxes_vg --input_label_h5 data/flk30k_LN_label.h5 --batch_size 30 --learning_rate 5e-4 --learning_rate_decay_start 0 --scheduled_sampling_start 100 --learning_rate_decay_every 3  --save_checkpoint_every 1000 --max_epochs 30 --max_length 225 --seq_per_img 1 --use_box 1   --use_trace 1  --input_trace_dir data/flk30k_LN_trace_box --use_trace_feat 0 --beam_size 1 --val_images_use -1 --num_layers 1 --task caption --eval_task caption --dataset_choice=flk30k

ADE20k: training of controlled trace generation alone (N=1 layer)

python tools/train.py --language_eval 0 --id transformer_LN_ade20k  --caption_model transformer --input_json data/ade20k_LN.json --input_att_dir Dir_to_image_features_vg --input_box_dir Dir_to_bounding_boxes_vg --input_label_h5 data/ade20k_LN_label.h5 --batch_size 30 --learning_rate 5e-4 --learning_rate_decay_start 0 --scheduled_sampling_start 100 --learning_rate_decay_every 3  --save_checkpoint_every 1000 --max_epochs 30 --max_length 225 --seq_per_img 1 --use_box 1   --use_trace 1  --input_trace_dir data/ade20k_LN_trace_box --use_trace_feat 0 --beam_size 1 --val_images_use -1 --num_layers 1 --task trace --eval_task trace --dataset_choice=ade20k

Evaluating

COCO: joint training of controlled caption generation and trace generation (N=2 layers, evaluated on caption generation)

python tools/train.py --language_eval 1 --id transformer_LN_coco  --caption_model transformer --input_json data/coco_LN.json --input_att_dir Dir_to_image_features_vg --input_box_dir Dir_to_bounding_boxes_vg --input_label_h5 data/coco_LN_label.h5 --batch_size 2 --learning_rate 5e-4 --learning_rate_decay_start 0 --scheduled_sampling_start 100 --learning_rate_decay_every 3  --save_checkpoint_every 1000 --max_epochs 30 --max_length 225 --seq_per_img 1 --use_box 1   --use_trace 1  --input_trace_dir data/coco_LN_trace_box --use_trace_feat 0 --beam_size 5 --val_images_use -1 --num_layers 2 --task c_joint_t --eval_task caption --dataset_choice=coco

COCO: joint training of controlled caption generation and trace generation (N=2 layers, evaluated on trace generation)

python tools/train.py --language_eval 1 --id transformer_LN_coco  --caption_model transformer --input_json data/coco_LN.json --input_att_dir Dir_to_image_features_vg --input_box_dir Dir_to_bounding_boxes_vg --input_label_h5 data/coco_LN_label.h5 --batch_size 30 --learning_rate 5e-4 --learning_rate_decay_start 0 --scheduled_sampling_start 100 --learning_rate_decay_every 3  --save_checkpoint_every 1000 --max_epochs 30 --max_length 225 --seq_per_img 1 --use_box 1   --use_trace 1  --input_trace_dir data/coco_LN_trace_box --use_trace_feat 0 --beam_size 1 --val_images_use -1 --num_layers 2 --task c_joint_t --eval_task trace --dataset_choice=coco

Open image: training of generating caption and trace at the same time (N=1 layers, evaluated on predicting both)

python tools/train.py --language_eval 1 --id transformer_LN_openimg  --caption_model transformer --input_json data/openimg_LN.json --input_att_dir Dir_to_image_features_vg --input_box_dir Dir_to_bounding_boxes_vg --input_label_h5 data/openimg_LN_label.h5 --batch_size 2 --learning_rate 5e-4 --learning_rate_decay_start 0 --scheduled_sampling_start 100 --learning_rate_decay_every 3  --save_checkpoint_every 1000 --max_epochs 30 --max_length 225 --seq_per_img 1 --use_box 1   --use_trace 1  --input_trace_dir data/openimg_LN_trace_box --use_trace_feat 0 --beam_size 5 --val_images_use -1 --num_layers 1 --task pred_both --eval_task pred_both --dataset_choice=openimg

Acknowledgements

Some components of this repo were built from Ruotian Luo's ImageCaptioning.pytorch.

Owner
Meta Research
Meta Research
Block-wisely Supervised Neural Architecture Search with Knowledge Distillation (CVPR 2020)

DNA This repository provides the code of our paper: Blockwisely Supervised Neural Architecture Search with Knowledge Distillation. Illustration of DNA

Changlin Li 215 Dec 19, 2022
Pydantic models for pywttr and aiopywttr.

Pydantic models for pywttr and aiopywttr.

Almaz 2 Dec 08, 2022
Official implementation for Multi-Modal Interaction Graph Convolutional Network for Temporal Language Localization in Videos

Multi-modal Interaction Graph Convolutioal Network for Temporal Language Localization in Videos Official implementation for Multi-Modal Interaction Gr

Zongmeng Zhang 15 Oct 18, 2022
Data Augmentation with Variational Autoencoders

Documentation Pyraug This library provides a way to perform Data Augmentation using Variational Autoencoders in a reliable way even in challenging con

112 Nov 30, 2022
TF2 implementation of knowledge distillation using the "function matching" hypothesis from the paper Knowledge distillation: A good teacher is patient and consistent by Beyer et al.

FunMatch-Distillation TF2 implementation of knowledge distillation using the "function matching" hypothesis from the paper Knowledge distillation: A g

Sayak Paul 67 Dec 20, 2022
Evolutionary Scale Modeling (esm): Pretrained language models for proteins

Evolutionary Scale Modeling This repository contains code and pre-trained weights for Transformer protein language models from Facebook AI Research, i

Meta Research 1.6k Jan 09, 2023
This Jupyter notebook shows one way to implement a simple first-order low-pass filter on sampled data in discrete time.

How to Implement a First-Order Low-Pass Filter in Discrete Time We often teach or learn about filters in continuous time, but then need to implement t

Joshua Marshall 4 Aug 24, 2022
Generic template to bootstrap your PyTorch project with PyTorch Lightning, Hydra, W&B, and DVC.

NN Template Generic template to bootstrap your PyTorch project. Click on Use this Template and avoid writing boilerplate code for: PyTorch Lightning,

Luca Moschella 520 Dec 30, 2022
DvD-TD3: Diversity via Determinants for TD3 version

DvD-TD3: Diversity via Determinants for TD3 version The implementation of paper Effective Diversity in Population Based Reinforcement Learning. Instal

3 Feb 11, 2022
This tutorial repository is to introduce the functionality of KGTK to first-time users

Welcome to the KGTK notebook tutorial The goal of this tutorial repository is to introduce the functionality of KGTK to first-time users. The Knowledg

USC ISI I2 58 Dec 21, 2022
Start-to-finish tutorial for interactive music co-creation in PyTorch and Tensorflow.js

Start-to-finish tutorial for interactive music co-creation in PyTorch and Tensorflow.js

Chris Donahue 98 Dec 14, 2022
Single-Stage 6D Object Pose Estimation, CVPR 2020

Overview This repository contains the code for the paper Single-Stage 6D Object Pose Estimation. Yinlin Hu, Pascal Fua, Wei Wang and Mathieu Salzmann.

CVLAB @ EPFL 89 Dec 26, 2022
Easy to use Audio Tagging in PyTorch

Audio Classification, Tagging & Sound Event Detection in PyTorch Progress: Fine-tune on audio classification Fine-tune on audio tagging Fine-tune on s

sithu3 15 Dec 22, 2022
R-package accompanying the paper "Dynamic Factor Model for Functional Time Series: Identification, Estimation, and Prediction"

dffm The goal of dffm is to provide functionality to apply the methods developed in the paper “Dynamic Factor Model for Functional Time Series: Identi

Sven Otto 3 Dec 09, 2022
Minecraft agent to farm resources using reinforcement learning

BarnyardBot CS 175 group project using Malmo download BarnyardBot.py into the python examples directory and run 'python BarnyardBot.py' in the console

0 Jul 26, 2022
Image-retrieval-baseline - MUGE Multimodal Retrieval Baseline

MUGE Multimodal Retrieval Baseline This repo is implemented based on the open_cl

47 Dec 16, 2022
Moiré Attack (MA): A New Potential Risk of Screen Photos [NeurIPS 2021]

Moiré Attack (MA): A New Potential Risk of Screen Photos [NeurIPS 2021] This repository is the official implementation of Moiré Attack (MA): A New Pot

Dantong Niu 22 Dec 24, 2022
Code for the CIKM 2019 paper "DSANet: Dual Self-Attention Network for Multivariate Time Series Forecasting".

Dual Self-Attention Network for Multivariate Time Series Forecasting 20.10.26 Update: Due to the difficulty of installation and code maintenance cause

Kyon Huang 223 Dec 16, 2022
Semantic Segmentation of images using PixelLib with help of Pascalvoc dataset trained with Deeplabv3+ framework.

CARscan- Approach 1 - Segmentation of images by detecting contours. It failed because in images with elements along with cars were also getting detect

Padmanabha Banerjee 5 Jul 29, 2021
Complete-IoU (CIoU) Loss and Cluster-NMS for Object Detection and Instance Segmentation (YOLACT)

Complete-IoU Loss and Cluster-NMS for Improving Object Detection and Instance Segmentation. Our paper is accepted by IEEE Transactions on Cybernetics

290 Dec 25, 2022