Image Captioning using CNN and Transformers

Overview

Image-Captioning

Keras/Tensorflow Image Captioning application using CNN and Transformer as encoder/decoder.
In particulary, the architecture consists of three models:

  1. A CNN: used to extract the image features. In this application, it used EfficientNetB0 pre-trained on imagenet.
  2. A TransformerEncoder: the extracted image features are then passed to a Transformer based encoder that generates a new representation of the inputs.
  3. A TransformerDecoder: this model takes the encoder output and the text data sequence as inputs and tries to learn to generate the caption.

Dataset

The model has been trained on 2014 Train/Val COCO dataset. You can download the dataset here. Note that test images are not required for this code to work.

Original dataset has 82783 train images and 40504 validation images; for each image there is a number of captions between 1 and 6. I have preprocessing the dataset per to keep only images that have exactly 5 captions. In fact, the model has been trained to ensure that 5 captions are assigned for each image. After this filtering, the final dataset has 68363 train images and 33432 validation images.
Finally, I serialized the dataset into two json files which you can find in:

COCO_dataset/captions_mapping_train.json
COCO_dataset/captions_mapping_valid.json

Each element in the captions_mapping_train.json file has such a structure :
"COCO_dataset/train2014/COCO_train2014_000000318556.jpg": ["caption1", "caption2", "caption3", "caption4", "caption5"], ...

In same way in the captions_mapping_valid.json :
"COCO_dataset/val2014/COCO_val2014_000000203564.jpg": ["caption1", "caption2", "caption3", "caption4", "caption5"], ...

Dependencies

I have used the following versions for code work:

  • python==3.8.8
  • tensorflow==2.4.1
  • tensorflow-gpu==2.4.1
  • numpy==1.19.1
  • h5py==2.10.0

Training

To train the model you need to follow the following steps :

  1. you have to make sure that the training set images are in the folder COCO_dataset/train2014/ and that validation set images are in COCO_dataset/val2014/.
  2. you have to enter all the parameters necessary for the training in the settings.py file.
  3. start the model training with python3 training.py

My settings

For my training session, I have get best results with this settings.py file :

# Desired image dimensions
IMAGE_SIZE = (299, 299)
# Max vocabulary size
MAX_VOCAB_SIZE = 2000000
# Fixed length allowed for any sequence
SEQ_LENGTH = 25
# Dimension for the image embeddings and token embeddings
EMBED_DIM = 512
# Number of self-attention heads
NUM_HEADS = 6
# Per-layer units in the feed-forward network
FF_DIM = 1024
# Shuffle dataset dim on tf.data.Dataset
SHUFFLE_DIM = 512
# Batch size
BATCH_SIZE = 64
# Numbers of training epochs
EPOCHS = 14

# Reduce Dataset
# If you want reduce number of train/valid images dataset, set 'REDUCE_DATASET=True'
# and set number of train/valid images that you want.
#### COCO dataset
# Max number train dataset images : 68363
# Max number valid dataset images : 33432
REDUCE_DATASET = False
# Number of train images -> it must be a value between [1, 68363]
NUM_TRAIN_IMG = None
# Number of valid images -> it must be a value between [1, 33432]
NUM_VALID_IMG = None
# Data augumention on train set
TRAIN_SET_AUG = True
# Data augmention on valid set
VALID_SET_AUG = False

# Load train_data.json pathfile
train_data_json_path = "COCO_dataset/captions_mapping_train.json"
# Load valid_data.json pathfile
valid_data_json_path = "COCO_dataset/captions_mapping_valid.json"
# Load text_data.json pathfile
text_data_json_path  = "COCO_dataset/text_data.json"

# Save training files directory
SAVE_DIR = "save_train_dir/"

I have training model on full dataset (68363 train images and 33432 valid images) but you can train the model on a smaller number of images by changing the NUM_TRAIN_IMG / NUM_VALID_IMG parameters to reduce the training time and hardware resources required.

Data augmention

I applied data augmentation on the training set during the training to reduce the generalization error, with this transformations (this code is write in dataset.py) :

trainAug = tf.keras.Sequential([
    	tf.keras.layers.experimental.preprocessing.RandomContrast(factor=(0.05, 0.15)),
    	tf.keras.layers.experimental.preprocessing.RandomTranslation(height_factor=(-0.10, 0.10), width_factor=(-0.10, 0.10)),
	tf.keras.layers.experimental.preprocessing.RandomZoom(height_factor=(-0.10, 0.10), width_factor=(-0.10, 0.10)),
	tf.keras.layers.experimental.preprocessing.RandomRotation(factor=(-0.10, 0.10))
])

You can customize your data augmentation by changing this code or disable data augmentation setting TRAIN_SET_AUG = False in setting.py.

My results

This is results of my best training :

Epoch 1/13
1069/1069 [==============================] - 1450s 1s/step - loss: 17.3777 - acc: 0.3511 - val_loss: 13.9711 - val_acc: 0.4819
Epoch 2/13
1069/1069 [==============================] - 1453s 1s/step - loss: 13.7338 - acc: 0.4850 - val_loss: 12.7821 - val_acc: 0.5133
Epoch 3/13
1069/1069 [==============================] - 1457s 1s/step - loss: 12.9772 - acc: 0.5069 - val_loss: 12.3980 - val_acc: 0.5229
Epoch 4/13
1069/1069 [==============================] - 1452s 1s/step - loss: 12.5683 - acc: 0.5179 - val_loss: 12.2659 - val_acc: 0.5284
Epoch 5/13
1069/1069 [==============================] - 1450s 1s/step - loss: 12.3292 - acc: 0.5247 - val_loss: 12.1828 - val_acc: 0.5316
Epoch 6/13
1069/1069 [==============================] - 1443s 1s/step - loss: 12.1614 - acc: 0.5307 - val_loss: 12.1410 - val_acc: 0.5341
Epoch 7/13
1069/1069 [==============================] - 1453s 1s/step - loss: 12.0461 - acc: 0.5355 - val_loss: 12.1234 - val_acc: 0.5354
Epoch 8/13
1069/1069 [==============================] - 1440s 1s/step - loss: 11.9533 - acc: 0.5407 - val_loss: 12.1086 - val_acc: 0.5367
Epoch 9/13
1069/1069 [==============================] - 1444s 1s/step - loss: 11.8838 - acc: 0.5427 - val_loss: 12.1235 - val_acc: 0.5373
Epoch 10/13
1069/1069 [==============================] - 1443s 1s/step - loss: 11.8114 - acc: 0.5460 - val_loss: 12.1574 - val_acc: 0.5367
Epoch 11/13
1069/1069 [==============================] - 1444s 1s/step - loss: 11.7543 - acc: 0.5486 - val_loss: 12.1518 - val_acc: 0.5371

These are good results considering that for each image given as input to the model during training, the error and the accuracy are averaged over 5 captions. However, I spent little time doing model selection and you can improve the results by trying better settings.
For example, you could :

  1. change CNN architecture.
  2. change SEQ_LENGTH, EMBED_DIM, NUM_HEADS, FF_DIM, BATCH_SIZE (etc...) parameters.
  3. change data augmentation transformations/parameters.
  4. etc...

N.B. I have saved my best training results files in the directory save_train_dir/.

Inference

After training and saving the model, you can restore it in a new session to inference captions on new images.
To generate a caption from a new image, you must :

  1. insert the parameters in the file settings_inference.py
  2. run python3 inference.py --image={image_path_file}

Results example

Examples of image output taken from the validation set.

a large passenger jet flying through the sky
a man in a white shirt and black shorts playing tennis
a person on a snowboard in the snow
a boy on a skateboard in the street
a black bear is walking through the grass
a train is on the tracks near a station
Owner
I love computer vision and NLP. I love artificial intelligence. Machine Learning and Big Data master's degree student.
[ICLR 2021] HW-NAS-Bench: Hardware-Aware Neural Architecture Search Benchmark

HW-NAS-Bench: Hardware-Aware Neural Architecture Search Benchmark Accepted as a spotlight paper at ICLR 2021. Table of content File structure Prerequi

72 Jan 03, 2023
šŸ¤— Transformers: State-of-the-art Natural Language Processing for Pytorch, TensorFlow, and JAX.

English | 简体中文 | 繁體中文 State-of-the-art Natural Language Processing for Jax, PyTorch and TensorFlow šŸ¤— Transformers provides thousands of pretrained mo

Hugging Face 77.2k Jan 02, 2023
Food recognition model using convolutional neural network & computer vision

Food recognition model using convolutional neural network & computer vision. The goal is to match or beat the DeepFood Research Paper

Hemanth Chandran 1 Jan 13, 2022
A PyTorch implementation of "Signed Graph Convolutional Network" (ICDM 2018).

SGCN ā € A PyTorch implementation of Signed Graph Convolutional Network (ICDM 2018). Abstract Due to the fact much of today's data can be represented as

Benedek Rozemberczki 251 Nov 30, 2022
[ICML 2021] ā€œ Self-Damaging Contrastive Learningā€, Ziyu Jiang, Tianlong Chen, Bobak Mortazavi, Zhangyang Wang

Self-Damaging Contrastive Learning Introduction The recent breakthrough achieved by contrastive learning accelerates the pace for deploying unsupervis

VITA 51 Dec 29, 2022
Fully Convolutional Refined Auto Encoding Generative Adversarial Networks for 3D Multi Object Scenes

Fully Convolutional Refined Auto-Encoding Generative Adversarial Networks for 3D Multi Object Scenes This repository contains the source code for Full

Yu Nishimura 106 Nov 21, 2022
Image-to-image regression with uncertainty quantification in PyTorch

Image-to-image regression with uncertainty quantification in PyTorch. Take any dataset and train a model to regress images to images with rigorous, distribution-free uncertainty quantification.

Anastasios Angelopoulos 25 Dec 26, 2022
Dynamica causal Bayesian optimisation

Dynamic Causal Bayesian Optimization This is a Python implementation of Dynamic Causal Bayesian Optimization as presented at NeurIPS 2021. Abstract Th

nd308 18 Nov 22, 2022
Point detection through multi-instance deep heatmap regression for sutures in endoscopy

Suture detection PyTorch This repo contains the reference implementation of suture detection model in PyTorch for the paper Point detection through mu

artificial intelligence in the area of cardiovascular healthcare 3 Jul 16, 2022
A Python-based development platform for automated trading systems - from backtesting to optimisation to livetrading.

AutoTrader AutoTrader is Python-based platform intended to help in the development, optimisation and deployment of automated trading systems. From sim

Kieran Mackle 485 Jan 09, 2023
This provides the R code and data to replicate results in "The USS Trustee’s risky strategy"

USSBriefs2021 This provides the R code and data to replicate results in "The USS Trustee’s risky strategy" by Neil M Davies, Jackie Grant and Chin Yan

1 Oct 30, 2021
Interpolation-based reduced-order models

Interpolation-reduced-order-models Interpolation-based reduced-order models High-fidelity computational fluid dynamics (CFD) solutions are time consum

Donovan Blais 1 Jan 10, 2022
TLDR; Train custom adaptive filter optimizers without hand tuning or extra labels.

AutoDSP TLDR; Train custom adaptive filter optimizers without hand tuning or extra labels. About Adaptive filtering algorithms are commonplace in sign

Jonah Casebeer 48 Sep 19, 2022
LLVM-based compiler for LightGBM gradient-boosted trees. Speeds up prediction by ≄10x.

LLVM-based compiler for LightGBM gradient-boosted trees. Speeds up prediction by ≄10x.

Simon Boehm 183 Jan 02, 2023
Face and other object detection using OpenCV and ML Yolo

Object-and-Face-Detection-Using-Yolo- Opencv and YOLO object and face detection is implemented. You only look once (YOLO) is a state-of-the-art, real-

Happy N. Monday 3 Feb 15, 2022
Quantization library for PyTorch. Support low-precision and mixed-precision quantization, with hardware implementation through TVM.

HAWQ: Hessian AWare Quantization HAWQ is an advanced quantization library written for PyTorch. HAWQ enables low-precision and mixed-precision uniform

Zhen Dong 293 Dec 30, 2022
A dual benchmarking study of visual forgery and visual forensics techniques

A dual benchmarking study of facial forgery and facial forensics In recent years, visual forgery has reached a level of sophistication that humans can

8 Jul 06, 2022
Code for "Solving Graph-based Public Good Games with Tree Search and Imitation Learning"

Code for "Solving Graph-based Public Good Games with Tree Search and Imitation Learning" This is the code for the paper Solving Graph-based Public Goo

Victor-Alexandru Darvariu 3 Dec 05, 2022
[NeurIPS 2021] Better Safe Than Sorry: Preventing Delusive Adversaries with Adversarial Training

Better Safe Than Sorry: Preventing Delusive Adversaries with Adversarial Training Code for NeurIPS 2021 paper "Better Safe Than Sorry: Preventing Delu

Lue Tao 29 Sep 20, 2022
Notepy is a full-featured Notepad Python app

Notepy A full featured python text-editor Notable features Autocompletion for parenthesis and quote Auto identation Syntax highlighting Compile and ru

Mirko Rovere 11 Sep 28, 2022