Cross-modal Retrieval using Transformer Encoder Reasoning Networks (TERN). With use of Metric Learning and FAISS for fast similarity search on GPU

Last update: Nov 05, 2022

Overview

Cross-modal Retrieval using Transformer Encoder Reasoning Networks

This project reimplements the idea from "Transformer Reasoning Network for Image-Text Matching and Retrieval". To solve the task of cross-modal retrieval, representative features from both modal are extracted using distinctive pipeline and then projected into the same embedding space. Because the features are sequence of vectors, Transformer-based model can be utilised to work best. In this repo, my highlight contribution is:

Reimplement TERN module, which exploits the effectiveness of using Transformer on bottom-up attention features and bert features.
Take advantage of facebookresearch's FAISS for efficient similarity search and clustering of dense vectors.
Experiment various metric learning loss objectives from KevinMusgrave's Pytorch Metric Learning

The figure below shows the overview of the architecture

Datasets

I trained TERN on Flickr30k dataset which contains 31,000 images collected from Flickr, together with 5 reference sentences provided by human annotators for each image. For each sample, visual and text features are pre-extracted as numpy files
Some samples from the dataset:

Images	Captions
	1. An elderly man is setting the table in front of an open door that leads outside to a garden. 2. The guy in the black sweater is looking onto the table below. 3. A man in a black jacket picking something up from a table. 4. An old man wearing a black jacket is looking on the table. 5. The gray-haired man is wearing a sweater.
	1. Two men are working on a bicycle on the side of the road. 2. Three men working on a bicycle on a cobblestone street. 3. Two men wearing shorts are working on a blue bike. 4. Three men inspecting a bicycle on a street. 5. Three men examining a bicycle.

Execution

Installation

pip install -r requirements.txt
apt install libomp-dev
pip install faiss-gpu

Specify dataset paths and configuration in the config file
For training

PYTHONPATH=. python tools/train.py

For evaluation

PYTHONPATH=. python tools/eval.py \
                --top_k= <top k similarity> \
                --weight= <model checkpoint> \

For inference
- See tools/inference.py script

Notebooks

Inference TERN on Flickr30k dataset
Use FasterRCNN to extract Bottom Up embeddings
Use BERT to extract text embeddings

Results

Validation m on Flickr30k dataset (trained for 100 epochs):

Model	Weights	i2t/[email protected]	t2i/[email protected]
TERN	link	0.5174	0.7496

Some visualization

Query text: Two dogs are running along the street

Query text: The woman is holding a violin

Query text: Young boys are playing baseball

Query text: A man is standing, looking at a lake

Paper References

@misc{messina2021transformer,
      title={Transformer Reasoning Network for Image-Text Matching and Retrieval}, 
      author={Nicola Messina and Fabrizio Falchi and Andrea Esuli and Giuseppe Amato},
      year={2021},
      eprint={2004.09144},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

@misc{anderson2018bottomup,
      title={Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering}, 
      author={Peter Anderson and Xiaodong He and Chris Buehler and Damien Teney and Mark Johnson and Stephen Gould and Lei Zhang},
      year={2018},
      eprint={1707.07998},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

@article{JDH17,
  title={Billion-scale similarity search with GPUs},
  author={Johnson, Jeff and Douze, Matthijs and J{\'e}gou, Herv{\'e}},
  journal={arXiv preprint arXiv:1702.08734},
  year={2017}
}

Cross-modal Retrieval using Transformer Encoder Reasoning Networks (TERN). With use of Metric Learning and FAISS for fast similarity search on GPU

Related tags

Overview

Cross-modal Retrieval using Transformer Encoder Reasoning Networks

Datasets

Execution

Notebooks

Results

Paper References

Code References

Owner

Minh-Khoi Pham

Experiments with differentiable stacks and queues in PyTorch

Code for visualizing the loss landscape of neural nets

Industrial knn-based anomaly detection for images. Visit streamlit link to check out the demo.

Library of deep learning models and datasets designed to make deep learning more accessible and accelerate ML research.

Java and SHACL code commented in the paper "Towards compliance checking in reified I/O logic via SHACL" submitted to ICAIL 2021

Using CNN to mimic the driver based on training data from Torcs

A platform for intelligent agent learning based on a 3D open-world FPS game developed by Inspir.AI.

PyTorch code for SENTRY: Selective Entropy Optimization via Committee Consistency for Unsupervised DA

Unofficial implementation of Google "CutPaste: Self-Supervised Learning for Anomaly Detection and Localization" in PyTorch

Joint-task Self-supervised Learning for Temporal Correspondence (NeurIPS 2019)

Official PyTorch implementation of MX-Font (Multiple Heads are Better than One: Few-shot Font Generation with Multiple Localized Experts)

Car Price Predictor App used to predict the price of the car based on certain input parameters created using python's scikit-learn, fastapi, numpy and joblib packages.

Aerial Single-View Depth Completion with Image-Guided Uncertainty Estimation (RA-L/ICRA 2020)

Code for NeurIPS 2021 paper 'Spatio-Temporal Variational Gaussian Processes'

Learned model to estimate number of distinct values (NDV) of a population using a small sample.

Official codebase for Legged Robots that Keep on Learning: Fine-Tuning Locomotion Policies in the Real World

Patch2Pix: Epipolar-Guided Pixel-Level Correspondences [CVPR2021]

All-in-one Docker container that allows a user to explore Nautobot in a lab environment.

Registration Loss Learning for Deep Probabilistic Point Set Registration

Building Ellee — A GPT-3 and Computer Vision Powered Talking Robotic Teddy Bear With Human Level Conversation Intelligence