Simple image captioning model - CLIP prefix captioning.

Overview

CLIP prefix captioning.


Inference Notebook:

🥳 New: 🥳 Integrated to Huggingface Spaces with Gradio. See demo: Hugging Face Spaces

🥳 New: 🥳 Run it in the browser using replicate.ai UI

Description

Image captioning is a complicated task, where usually a pretrained detection network is used, requires additional supervision in the form of object annotation. The features of the detected objects are then fed to an additional network that is trained to output the correct caption. We present a new approach that does not requires additional information (i.e. requires only images and captions), thus can be applied to any data. In addition, our model's training time is much faster than similar methods while achieving close to state-of-the-art results, even for the Conceptual Captions dataset contains over 3M images.

In our work, we use the CLIP model, which was already trained over an extremely large number of images, thus is capable of generating semantic encodings for arbitrary images without additional supervision. To produce meaningful sentences we fine-tune a pretrained language model, which has been proven to be successful for other natural language tasks. The key idea is to use the CLIP encoding as a prefix to the textual captions by employing a simple Multi-Layer Perceptron (MLP) over the raw encoding, and then fine-tune our language model to generate a valid caption.

COCO Examples

A couple of people standing next to an elephant. A wooden table sitting in front of a window. A bunch of bananas sitting on top of a table.
A woman holding a plate with a piece of cake in front of her face. A wooden table topped with lots of wooden utensils. A red motorcycle parked on top of a dirt field.

Conceptual Captions Examples

3D render of a man holding a globe. Students enjoing the cherry blossoms Green leaf of lettuce on a white plate.
The hotel and casino on the waterfront. The triangle is a symbol of the soul. Cartoon boy in the bath.

Inference Notebooks

To help visualize the results we provide a Colab notebook found in notebooks/clip_prefix_captioning_inference.ipynb.
The notebook will download the pretrained models and run inference on a sample images or on images of your choosing. It is recommended to run this in Google Colab. Both COCO and Conceptual Captions pretrained models are available.

Inference GUI

Run it in the browser using replicate.ai UI.

COCO training

Clone, create environment and install dependencies:

git clone https://github.com/rmokady/CLIP_prefix_caption && cd CLIP_prefix_caption
conda env create -f environment.yml
conda activate clip_prefix_caption

Download train_captions to data/coco/annotations.

Download training images and validation images and unzip (We use Karpathy et el. split).

Extract CLIP features using (output is data/coco/oscar_split_train.pkl):

python parse_coco.py

Train:

python train.py --data ./data/coco/oscar_split_train.pkl --out_dir ./coco_train/

Qualitative results

COCO dataset

Method [email protected] [email protected] [email protected] [email protected] METEOR ROUGE-L CIDEr SPICE
Oscar* 75.59 60.09 46.89 36.58 30.40 58.56 124.12 23.17
Ours 74.12 57.40 43.11 32.15 27.10 55.02 108.35 20.12

* uses additional object annotations for training.

Conceptual Captions dataset

Method ROUGE-L CIDEr SPICE
VLP 24.35 77.57 16.59
Ours 26.71 87.26 18.5

Acknowledgments

This project was created by Ron Mokady and Amir Hertz for the Advanced-NLP course by Omer Levy @ TAU. This repository is heavily based on CLIP and Hugging-faces repositories. For training we used the data of COCO dataset and Conceptual Captions. The project was also inspired from this paper.

Contact

For any inquiry please contact us at our email addresses: [email protected] or [email protected].

Deep Learning pipeline for motor-imagery classification.

BCI-ToolBox 1. Introduction BCI-ToolBox is deep learning pipeline for motor-imagery classification. This repo contains five models: ShallowConvNet, De

DongHee 18 Oct 31, 2022
Evaluating Cross-lingual Sentence Representations

XNLI: The Cross-Lingual NLI Corpus XNLI is an evaluation corpus for language transfer and cross-lingual sentence classification in 15 languages. New:

Meta Research 395 Dec 19, 2022
PlaidML is a framework for making deep learning work everywhere.

A platform for making deep learning work everywhere. Documentation | Installation Instructions | Building PlaidML | Contributing | Troubleshooting | R

PlaidML 4.5k Jan 02, 2023
robomimic: A Modular Framework for Robot Learning from Demonstration

robomimic [Homepage]   [Documentation]   [Study Paper]   [Study Website]   [ARISE Initiative] Latest Updates [08/09/2021] v0.1.0: Initial code and pap

ARISE Initiative 178 Jan 05, 2023
Supervised multi-SNE (S-multi-SNE): Multi-view visualisation and classification

S-multi-SNE Supervised multi-SNE (S-multi-SNE): Multi-view visualisation and classification A repository containing the code to reproduce the findings

Theodoulos Rodosthenous 3 Apr 15, 2022
Pythonic particle-based (super-droplet) warm-rain/aqueous-chemistry cloud microphysics package with box, parcel & 1D/2D prescribed-flow examples in Python, Julia and Matlab

PySDM PySDM is a package for simulating the dynamics of population of particles. It is intended to serve as a building block for simulation systems mo

Atmospheric Cloud Simulation Group @ Jagiellonian University 32 Oct 18, 2022
Repo for "TableParser: Automatic Table Parsing with Weak Supervision from Spreadsheets" at [email protected]

TableParser Repo for "TableParser: Automatic Table Parsing with Weak Supervision from Spreadsheets" at DS3 Lab 11 Dec 13, 2022

Code implementation from my Medium blog post: [Transformers from Scratch in PyTorch]

transformer-from-scratch Code for my Medium blog post: Transformers from Scratch in PyTorch Note: This Transformer code does not include masked attent

Frank Odom 27 Dec 21, 2022
Cooperative Driving Dataset: a dataset for multi-agent driving scenarios

Cooperative Driving Dataset (CODD) The Cooperative Driving dataset is a synthetic dataset generated using CARLA that contains lidar data from multiple

Eduardo Henrique Arnold 124 Dec 28, 2022
CVAT is free, online, interactive video and image annotation tool for computer vision

Computer Vision Annotation Tool (CVAT) CVAT is free, online, interactive video and image annotation tool for computer vision. It is being used by our

OpenVINO Toolkit 8.6k Jan 04, 2023
TransZero++: Cross Attribute-guided Transformer for Zero-Shot Learning

TransZero++ This repository contains the testing code for the paper "TransZero++: Cross Attribute-guided Transformer for Zero-Shot Learning" submitted

Shiming Chen 6 Aug 16, 2022
This repository includes the official project for the paper: TransMix: Attend to Mix for Vision Transformers.

TransMix: Attend to Mix for Vision Transformers This repository includes the official project for the paper: TransMix: Attend to Mix for Vision Transf

Jie-Neng Chen 130 Jan 01, 2023
Finding Donors for CharityML

Finding-Donors-for-CharityML - Investigated factors that affect the likelihood of charity donations being made based on real census data.

Moamen Abdelkawy 1 Dec 30, 2021
Edge-oriented Convolution Block for Real-time Super Resolution on Mobile Devices, ACM Multimedia 2021

Codes for ECBSR Edge-oriented Convolution Block for Real-time Super Resolution on Mobile Devices Xindong Zhang, Hui Zeng, Lei Zhang ACM Multimedia 202

xindong zhang 236 Dec 26, 2022
ICRA 2021 "Towards Precise and Efficient Image Guided Depth Completion"

PENet: Precise and Efficient Depth Completion This repo is the PyTorch implementation of our paper to appear in ICRA2021 on "Towards Precise and Effic

232 Dec 25, 2022
Vision-Language Pre-training for Image Captioning and Question Answering

VLP This repo hosts the source code for our AAAI2020 work Vision-Language Pre-training (VLP). We have released the pre-trained model on Conceptual Cap

Luowei Zhou 373 Jan 03, 2023
Automated Melanoma Recognition in Dermoscopy Images via Very Deep Residual Networks

Introduction This repository contains the modified caffe library and network architectures for our paper "Automated Melanoma Recognition in Dermoscopy

Lequan Yu 47 Nov 24, 2022
This program automatically runs Python code copied in clipboard

CopyRun This program runs Python code which is copied in clipboard WARNING!! USE AT YOUR OWN RISK! NO GUARANTIES IF ANYTHING GETS BROKEN. DO NOT COPY

vertinski 4 Sep 10, 2021
Self-Supervised Image Denoising via Iterative Data Refinement

Self-Supervised Image Denoising via Iterative Data Refinement Yi Zhang1, Dasong Li1, Ka Lung Law2, Xiaogang Wang1, Hongwei Qin2, Hongsheng Li1 1CUHK-S

Zhang Yi 72 Jan 01, 2023
CONetV2: Efficient Auto-Channel Size Optimization for CNNs

CONetV2: Efficient Auto-Channel Size Optimization for CNNs Exciting News! CONetV2: Efficient Auto-Channel Size Optimization for CNNs has been accepted

Mahdi S. Hosseini 3 Dec 13, 2021