Moer Grounded Image Captioning by Distilling Image-Text Matching Model

Last update: Dec 16, 2022

Related tags

Deep Learning Grounded-Image-Captioning

Overview

Moer Grounded Image Captioning by Distilling Image-Text Matching Model

Requirements

Python 3.7
Pytorch 1.2

Prepare data

Please use git clone --recurse-submodules to clone this repository and remember to follow initialization steps in coco-caption/README.md. Then download and place the Flickr30k reference file under coco-caption/annotations. Also, download Stanford CoreNLP 3.9.1 for grounding evaluation and place the uncompressed folder under the tools/ directory.
Download the preprocessd dataset from this link and extract it to data/.
For Flickr30k-Entities, please download bottom-up visual feature extracted by Anderson's extractor (Zhou's extractor) from this link ( link) and place the uncompressed folders under data/flickrbu/. For MSCOCO, please follow this instruction to prepare the bottom-up features and place them under data/mscoco/.
Download the pretrained models from here and extract them to log/.
Download the pretrained SCAN models from this link and extract them to misc/SCAN/runs.

Evaluation

To reproduce the results reported in the paper, just simply run

bash eval_flickr.sh

fro Flickr30k-Entities and

bash eval_coco.sh

for MSCOCO.

Training

In the first training stage, run like

python train.py --id CE-scan-sup-0.1kl --caption_model topdown --input_json data/flickrtalk.json --input_fc_dir data/flickrbu/flickrbu_fc --input_att_dir data/flickrbu/flickrbu_att  --input_box_dir data/flickrbu/flickrbu_box  --input_label_h5 data/flickrtalk_label.h5 --batch_size 29 --learning_rate 5e-4 --learning_rate_decay_start 0 --scheduled_sampling_start 0 --checkpoint_path log/CE-scan-sup-0.1kl --save_checkpoint_every 1000 --val_images_use -1 --max_epochs 30  --att_supervise  True   --att_supervise_weight 0.1

In the second training stage, run like

python train.py --id sc-ground-CE-scan-sup-0.1kl --caption_model topdown --input_json data/flickrtalk.json --input_fc_dir data/flickrbu/flickrbu_fc --input_att_dir data/flickrbu/flickrbu_att  --input_box_dir data/flickrbu/flickrbu_box  --input_label_h5 data/flickrtalk_label.h5 --batch_size 29 --learning_rate 5e-5 --start_from log/CE-scan-sup-0.1kl --checkpoint_path log/sc-ground-CE-scan-sup-0.1kl --save_checkpoint_every 1000 --language_eval 1 --val_images_use -1 --self_critical_after 30  --max_epochs  110      --cider_reward_weight  1
--ground_reward_weight   1

Citation

@inproceedings{zhou2020grounded,
  title={More Grounded Image Captioning by Distilling Image-Text Matching Model},
  author={Zhou, Yuanen and Wang, Meng and Liu, Daqing and  Hu, Zhenzhen and Zhang, Hanwang},
  booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
  year={2020}
}

Acknowledgements

This repository is built upon self-critical.pytorch, SCAN and grounded-video-description. Thanks for their released code.

Moer Grounded Image Captioning by Distilling Image-Text Matching Model

Related tags

Overview

Moer Grounded Image Captioning by Distilling Image-Text Matching Model

Requirements

Prepare data

Evaluation

Training

Citation

Acknowledgements

Owner

YE Zhou

PyTorch implementation of a Real-ESRGAN model trained on custom dataset

A list of awesome PyTorch scholarship articles, guides, blogs, courses and other resources.

Nsdf: A mesh SDF with just some code we can directly paste into our raymarcher

Tensorflow implementation of our method: "Triangle Graph Interest Network for Click-through Rate Prediction".

Painting app using Python machine learning and vision technology.

Complete* list of autonomous driving related datasets

Code for our SIGCOMM'21 paper "Network Planning with Deep Reinforcement Learning".

This repo is for segmentation of T2 hyp regions in gliomas.

Implementation of Graph Convolutional Networks in TensorFlow

SOTR: Segmenting Objects with Transformers [ICCV 2021]

Code for the RA-L (ICRA) 2021 paper "SeqNet: Learning Descriptors for Sequence-Based Hierarchical Place Recognition"

Auto White-Balance Correction for Mixed-Illuminant Scenes

CNN Based Meta-Learning for Noisy Image Classification and Template Matching

Various operations like path tracking, counting, etc by using yolov5

The PyTorch re-implement of a 3D CNN Tracker to extract coronary artery centerlines with state-of-the-art (SOTA) performance. (paper: 'Coronary artery centerline extraction in cardiac CT angiography using a CNN-based orientation classiﬁer')

A PyTorch implementation for PyramidNets (Deep Pyramidal Residual Networks)

Code for the ECCV2020 paper "A Differentiable Recurrent Surface for Asynchronous Event-Based Data"

Using deep actor-critic model to learn best strategies in pair trading

A Multi-attribute Controllable Generative Model for Histopathology Image Synthesis

Space Ship Simulator using python