Implementation of the Object Relation Transformer for Image Captioning

Overview

Object Relation Transformer

This is a PyTorch implementation of the Object Relation Transformer published in NeurIPS 2019. You can find the paper here. This repository is largely based on code from Ruotian Luo's Self-critical Sequence Training for Image Captioning GitHub repo, which can be found here.

The primary additions are as follows:

  • Relation transformer model
  • Script to create reports for runs on MSCOCO

Requirements

  • Python 2.7 (because there is no coco-caption version for Python 3)
  • PyTorch 0.4+ (along with torchvision)
  • h5py
  • scikit-image
  • typing
  • pyemd
  • gensim
  • cider (already added as a submodule). See .gitmodules and clone the referenced repo into the object_relation_transformer folder.
  • The coco-caption library, which is used for generating different evaluation metrics. To set it up, clone the repo into the object_relation_transformer folder. Make sure to keep the cloned repo folder name as coco-caption and also to run the get_stanford_models.sh script from within that repo.

Data Preparation

Download ResNet101 weights for feature extraction

Download the file resnet101.pth from here. Copy the weights to a folder imagenet_weights within the data folder:

mkdir data/imagenet_weights
cp /path/to/downloaded/weights/resnet101.pth data/imagenet_weights

Download and preprocess the COCO captions

Download the preprocessed COCO captions from Karpathy's homepage. Extract dataset_coco.json from the zip file and copy it in to data/. This file provides preprocessed captions and also standard train-val-test splits.

Then run:

$ python scripts/prepro_labels.py --input_json data/dataset_coco.json --output_json data/cocotalk.json --output_h5 data/cocotalk

prepro_labels.py will map all words that occur <= 5 times to a special UNK token, and create a vocabulary for all the remaining words. The image information and vocabulary are dumped into data/cocotalk.json and discretized caption data are dumped into data/cocotalk_label.h5.

Next run:

$ python scripts/prepro_ngrams.py --input_json data/dataset_coco.json --dict_json data/cocotalk.json --output_pkl data/coco-train --split train

This will preprocess the dataset and get the cache for calculating cider score.

Download the COCO dataset and pre-extract the image features

Download the COCO images from the MSCOCO website. We need 2014 training images and 2014 validation images. You should put the train2014/ and val2014/ folders in the same directory, denoted as $IMAGE_ROOT:

mkdir $IMAGE_ROOT
pushd $IMAGE_ROOT
wget http://images.cocodataset.org/zips/train2014.zip
unzip train2014.zip
wget http://images.cocodataset.org/zips/val2014.zip
unzip val2014.zip
popd
wget https://msvocds.blob.core.windows.net/images/262993_z.jpg
mv 262993_z.jpg $IMAGE_ROOT/train2014/COCO_train2014_000000167126.jpg

The last two commands are needed to address an issue with a corrupted image in the MSCOCO dataset (see here). The prepro script will fail otherwise.

Then run:

$ python scripts/prepro_feats.py --input_json data/dataset_coco.json --output_dir data/cocotalk --images_root $IMAGE_ROOT

prepro_feats.py extracts the ResNet101 features (both fc feature and last conv feature) of each image. The features are saved in data/cocotalk_fc and data/cocotalk_att, and resulting files are about 200GB. Running this script may take a day or more, depending on hardware.

(Check the prepro scripts for more options, like other ResNet models or other attention sizes.)

Download the Bottom-up features

Download the pre-extracted features from here. For the paper, the adaptive features were used.

Do the following:

mkdir data/bu_data; cd data/bu_data
wget https://imagecaption.blob.core.windows.net/imagecaption/trainval.zip
unzip trainval.zip

The .zip file is around 22 GB. Then return to the base directory and run:

python scripts/make_bu_data.py --output_dir data/cocobu

This will create data/cocobu_fc, data/cocobu_att and data/cocobu_box.

Generate the relative bounding box coordinates for the Relation Transformer

Run the following:

python scripts/prepro_bbox_relative_coords.py --input_json data/dataset_coco.json --input_box_dir data/cocobu_box --output_dir data/cocobu_box_relative --image_root $IMAGE_ROOT

This should take a couple hours or so, depending on hardware.

Model Training and Evaluation

Standard cross-entropy loss training

python train.py --id relation_transformer_bu --caption_model relation_transformer --input_json data/cocotalk.json --input_fc_dir data/cocobu_fc --input_att_dir data/cocobu_att --input_box_dir data/cocobu_box --input_rel_box_dir data/cocobu_box_relative --input_label_h5 data/cocotalk_label.h5 --checkpoint_path log_relation_transformer_bu --noamopt --noamopt_warmup 10000 --label_smoothing 0.0 --batch_size 15 --learning_rate 5e-4 --num_layers 6 --input_encoding_size 512 --rnn_size 2048 --learning_rate_decay_start 0 --scheduled_sampling_start 0 --save_checkpoint_every 6000 --language_eval 1 --val_images_use 5000 --max_epochs 30 --use_box 1

The train script will dump checkpoints into the folder specified by --checkpoint_path (default = save/). We only save the best-performing checkpoint on validation and the latest checkpoint to save disk space.

To resume training, you can specify --start_from option to be the path saving infos.pkl and model.pth (usually you could just set --start_from and --checkpoint_path to be the same).

If you have tensorflow, the loss histories are automatically dumped into --checkpoint_path, and can be visualized using tensorboard.

The current command uses scheduled sampling. You can also set scheduled_sampling_start to -1 to disable it.

If you'd like to evaluate BLEU/METEOR/CIDEr scores during training in addition to validation cross entropy loss, use --language_eval 1 option, but don't forget to download the coco-caption code into coco-caption directory.

For more options, see opts.py.

The above training script should achieve a CIDEr-D score of about 115.

Self-critical RL training

After training using cross-entropy loss, additional self-critical training produces signficant gains in CIDEr-D score.

First, copy the model from the pretrained model using cross entropy. (It's not mandatory to copy the model, just for back-up)

$ bash scripts/copy_model.sh relation_transformer_bu relation_transformer_bu_rl

Then:

python train.py --id relation_transformer_bu_rl --caption_model relation_transformer --input_json data/cocotalk.json --input_fc_dir data/cocobu_fc --input_att_dir data/cocobu_att --input_label_h5 data/cocotalk_label.h5  --input_box_dir data/cocobu_box --input_rel_box_dir data/cocobu_box_relative --input_label_h5 data/cocotalk_label.h5 --checkpoint_path log_relation_transformer_bu_rl --label_smoothing 0.0 --batch_size 10 --learning_rate 5e-4 --num_layers 6 --input_encoding_size 512 --rnn_size 2048 --learning_rate_decay_start 0 --scheduled_sampling_start 0 --start_from log_transformer_bu_rl --save_checkpoint_every 6000 --language_eval 1 --val_images_use 5000 --self_critical_after 30 --max_epochs 60 --use_box 1

The above training script should achieve a CIDEr-D score of about 128.

Evaluate on Karpathy's test split

To evaluate the cross-entropy model, run:

python eval.py --dump_images 0 --num_images 5000 --model log_relation_transformer_bu/model.pth --infos_path log_relation_transformer_bu/infos_relation_transformer_bu-best.pkl --image_root $IMAGE_ROOT --input_json data/cocotalk.json --input_label_h5 data/cocotalk_label.h5  --input_fc_dir data/cocobu_fc --input_att_dir data/cocobu_att --input_box_dir data/cocobu_box --input_rel_box_dir data/cocobu_box_relative --use_box 1 --language_eval 1

and for cross-entropy+RL run:

python eval.py --dump_images 0 --num_images 5000 --model log_relation_transformer_bu_rl/model.pth --infos_path log_relation_transformer_bu_rl/infos_relation_transformer_bu-best.pkl --image_root $IMAGE_ROOT --input_json data/cocotalk.json --input_label_h5 data/cocotalk_label.h5  --input_fc_dir data/cocobu_fc --input_att_dir data/cocobu_att --input_box_dir data/cocobu_box --input_rel_box_dir data/cocobu_box_relative --language_eval 1

Visualization

Visualize caption predictions

Place all your images of interest into a folder, e.g. images, and run the eval script:

$ python eval.py --dump_images 1 --num_images 10 --model log_relation_transformer_bu/model.pth --infos_path log_relation_transformer_bu/infos_relation_transformer_bu-best.pkl --image_root $IMAGE_ROOT --input_json data/cocotalk.json --input_label_h5 data/cocotalk_label.h5  --input_fc_dir data/cocobu_fc --input_att_dir data/cocobu_att --input_box_dir data/cocobu_box --input_rel_box_dir data/cocobu_box_relative

This tells the eval script to run up to 10 images from the given folder. If you have a big GPU you can speed up the evaluation by increasing batch_size. Use --num_images -1 to process all images. The eval script will create an vis.json file inside the vis folder, which can then be visualized with the provided HTML interface:

$ cd vis
$ python -m SimpleHTTPServer

Now visit localhost:8000 in your browser and you should see your predicted captions.

Generate reports from runs on MSCOCO

The create_report.py script can be used in order to generate HTML reports containing results from different runs. Please see the script for specific usage examples.

The script takes as input one or more pickle files containing results from runs on the MSCOCO dataset. It reads in the pickle files and creates a set of HTML files with tables and graphs generated from the different captioning evaluation metrics, as well as the generated image captions and corresponding metrics for individual images.

If more than one pickle file with results is provided as input, the script will also generate a report containing a comparison between the metrics generated by each pair of methods.

Model Zoo and Results

The table below presents links to our pre-trained models, as well as results from our paper on the Karpathy test split. Similar results should be obtained by running the respective commands in neurips_training_runs.sh. As learning rate scheduling was not fully optimized, these values should only serve as a reference/expectation rather than what can be achieved with additional tuning.

The models are Copyright Verizon Media, licensed under the terms of the CC-BY-4.0 license. See associated license file.

Algorithm CIDEr-D SPICE BLEU-1 BLEU-4 METEOR ROUGE-L
Up-Down + LSTM * 106.6 19.9 75.6 32.9 26.5 55.4
Up-Down + Transformer 111.0 20.9 75.0 32.8 27.5 55.6
Up-Down + Object Relation Transformer 112.6 20.8 75.6 33.5 27.6 56.0
Up-Down + Object Relation Transformer + Beamsize 2 115.4 21.2 76.6 35.5 28.0 56.6
Up-Down + Object Relation Transformer + Self-Critical + Beamsize 5 128.3 22.6 80.5 38.6 28.7 58.4

* Note that the pre-trained Up-Down + LSTM model above produces slightly better results than reported, as it came from a different training run. We kept the older LSTM results in the table above for consistency with our paper.

Comparative Analysis

In addition, in the paper we also present a head-to-head comparison of the Object Relation Transformer against the "Up-Down + Transformer" model. (Results from the latter model are also included in the table above). In the paper, we refer to this latter model as "Baseline Transformer", as it does not make use of geometry in its attention definition. The idea of the head-to-head comparison is to better understand the improvement obtained by adding geometric attention to the Transformer, both quantitatively and qualitatively. The comparison consists of a set of evaluation metrics computed for each model on a per-image basis, as well as aggregated over all images. It includes the results of paired t-tests, which test for statistically significant differences between the evaluation metrics resulting from each of the models. This comparison can be generated by running the commands in neurips_report_comands.sh. The commands first run the two aforementioned models on the MSCOCO test set and then generate the corresponding report containing the complete comparative analysis.

Citation

If you find this repo useful, please consider citing (no obligation at all):

@article{herdade2019image,
  title={Image Captioning: Transforming Objects into Words},
  author={Herdade, Simao and Kappeler, Armin and Boakye, Kofi and Soares, Joao},
  journal={arXiv preprint arXiv:1906.05963},
  year={2019}
}

Of course, please cite the original paper of models you are using (you can find references in the model files).

Contribute

Please refer to the contributing.md file for information about how to get involved. We welcome issues, questions, and pull requests.

Please be aware that we (the maintainers) are currently busy with other projects, so it make take some days before we are able to get back to you. We do not foresee big changes to this repository going forward.

Maintainers

Kofi Boakye: [email protected]

Simao Herdade: [email protected]

Joao Soares: [email protected]

License

This project is licensed under the terms of the MIT open source license. Please refer to LICENSE for the full terms.

Acknowledgments

Thanks to Ruotian Luo for the original code.

Owner
Yahoo
This organization is the home to many of the active open source projects published by engineers at Yahoo Inc.
Yahoo
A simple application that calculates the probability distribution of a normal distribution

probability-density-function General info An application that calculates the probability density and cumulative distribution of a normal distribution

1 Oct 25, 2022
A simple machine learning python sign language detection project.

SST Coursework 2022 About the app A python application that utilises the tensorflow object detection algorithm to achieve automatic detection of ameri

Xavier Koh 2 Jun 30, 2022
Retrieve annotated intron sequences and classify them as minor (U12-type) or major (U2-type)

(intron I nterrogator and C lassifier) intronIC is a program that can be used to classify intron sequences as minor (U12-type) or major (U2-type), usi

Graham Larue 4 Jul 26, 2022
使用数学和计算机知识投机倒把

偷鸡不成项目集锦 坦率地讲,涉及金融市场的好策略如果公开,必然导致使用的人多,最后策略变差。所以这个仓库只收集我目前失败了的案例。 加密货币组合套利 中国体育彩票预测 我赚不上钱的项目,也许可以帮助更有能力的人去赚钱。

Roy 28 Dec 29, 2022
MaD GUI is a basis for graphical annotation and computational analysis of time series data.

MaD GUI Machine Learning and Data Analytics Graphical User Interface MaD GUI is a basis for graphical annotation and computational analysis of time se

Machine Learning and Data Analytics Lab FAU 10 Dec 19, 2022
In this Repo a simple Sklearn Model will be trained and pushed to MLFlow

SKlearn_to_MLFLow In this Repo a simple Sklearn Model will be trained and pushed to MLFlow Install This Repo is based on poetry python3 -m venv .venv

1 Dec 13, 2021
My capstone project for Udacity's Machine Learning Nanodegree

MLND-Capstone My capstone project for Udacity's Machine Learning Nanodegree Lane Detection with Deep Learning In this project, I use a deep learning-b

Michael Virgo 407 Dec 12, 2022
ml4h is a toolkit for machine learning on clinical data of all kinds including genetics, labs, imaging, clinical notes, and more

ml4h is a toolkit for machine learning on clinical data of all kinds including genetics, labs, imaging, clinical notes, and more

Broad Institute 65 Dec 20, 2022
Simple, fast, and parallelized symbolic regression in Python/Julia via regularized evolution and simulated annealing

Parallelized symbolic regression built on Julia, and interfaced by Python. Uses regularized evolution, simulated annealing, and gradient-free optimization.

Miles Cranmer 924 Jan 03, 2023
Backtesting an algorithmic trading strategy using Machine Learning and Sentiment Analysis.

Trading Tesla with Machine Learning and Sentiment Analysis An interactive program to train a Random Forest Classifier to predict Tesla daily prices us

Renato Votto 31 Nov 17, 2022
The project's goal is to show a real world application of image segmentation using k means algorithm

The project's goal is to show a real world application of image segmentation using k means algorithm

2 Jan 22, 2022
30 Days Of Machine Learning Using Pytorch

Objective of the repository is to learn and build machine learning models using Pytorch. 30DaysofML Using Pytorch

Mayur 119 Nov 24, 2022
决策树分类与回归模型的实现和可视化

DecisionTree 决策树分类与回归模型,以及可视化 DecisionTree ID3 C4.5 CART 分类 回归 决策树绘制 分类树 回归树 调参 剪枝 ID3 ID3决策树是最朴素的决策树分类器: 无剪枝 只支持离散属性 采用信息增益准则 在data.py中,我们记录了一个小的西瓜数据

Welt Xing 10 Oct 22, 2022
MCML is a toolkit for semi-supervised dimensionality reduction and quantitative analysis of Multi-Class, Multi-Label data

MCML is a toolkit for semi-supervised dimensionality reduction and quantitative analysis of Multi-Class, Multi-Label data. We demonstrate its use

Pachter Lab 26 Nov 29, 2022
This is an auto-ML tool specialized in detecting of outliers

Auto-ML tool specialized in detecting of outliers Description This tool will allows you, with a Dash visualization, to compare 10 models of machine le

1 Nov 03, 2021
Distributed Evolutionary Algorithms in Python

DEAP DEAP is a novel evolutionary computation framework for rapid prototyping and testing of ideas. It seeks to make algorithms explicit and data stru

Distributed Evolutionary Algorithms in Python 4.9k Jan 05, 2023
Traingenerator 🧙 A web app to generate template code for machine learning ✨

Traingenerator 🧙 A web app to generate template code for machine learning ✨ 🎉 Traingenerator is now live! 🎉

Johannes Rieke 1.2k Jan 07, 2023
This repository contains full machine learning pipeline of the Zillow Houses competition on Kaggle platform.

Zillow-Houses This repository contains full machine learning pipeline of the Zillow Houses competition on Kaggle platform. Pipeline is consists of 10

2 Jan 09, 2022
PyNNDescent is a Python nearest neighbor descent for approximate nearest neighbors.

PyNNDescent PyNNDescent is a Python nearest neighbor descent for approximate nearest neighbors. It provides a python implementation of Nearest Neighbo

Leland McInnes 699 Jan 09, 2023