Cross-modal Retrieval using Transformer Encoder Reasoning Networks (TERN). With use of Metric Learning and FAISS for fast similarity search on GPU

Overview

Cross-modal Retrieval using Transformer Encoder Reasoning Networks

This project reimplements the idea from "Transformer Reasoning Network for Image-Text Matching and Retrieval". To solve the task of cross-modal retrieval, representative features from both modal are extracted using distinctive pipeline and then projected into the same embedding space. Because the features are sequence of vectors, Transformer-based model can be utilised to work best. In this repo, my highlight contribution is:

  • Reimplement TERN module, which exploits the effectiveness of using Transformer on bottom-up attention features and bert features.
  • Take advantage of facebookresearch's FAISS for efficient similarity search and clustering of dense vectors.
  • Experiment various metric learning loss objectives from KevinMusgrave's Pytorch Metric Learning

The figure below shows the overview of the architecture

screen

Datasets

  • I trained TERN on Flickr30k dataset which contains 31,000 images collected from Flickr, together with 5 reference sentences provided by human annotators for each image. For each sample, visual and text features are pre-extracted as numpy files

  • Some samples from the dataset:

Images Captions
screen 1. An elderly man is setting the table in front of an open door that leads outside to a garden.
2. The guy in the black sweater is looking onto the table below.
3. A man in a black jacket picking something up from a table.
4. An old man wearing a black jacket is looking on the table.
5. The gray-haired man is wearing a sweater.
screen 1. Two men are working on a bicycle on the side of the road.
2. Three men working on a bicycle on a cobblestone street.
3. Two men wearing shorts are working on a blue bike.
4. Three men inspecting a bicycle on a street.
5. Three men examining a bicycle.

Execution

  • Installation
pip install -r requirements.txt
apt install libomp-dev
pip install faiss-gpu
  • Specify dataset paths and configuration in the config file

  • For training

PYTHONPATH=. python tools/train.py 
  • For evaluation
PYTHONPATH=. python tools/eval.py \
                --top_k= <top k similarity> \
                --weight= <model checkpoint> \

Notebooks

  • Notebook Inference TERN on Flickr30k dataset
  • Notebook Use FasterRCNN to extract Bottom Up embeddings
  • Notebook Use BERT to extract text embeddings

Results

  • Validation m on Flickr30k dataset (trained for 100 epochs):
Model Weights i2t/[email protected] t2i/[email protected]
TERN link 0.5174 0.7496
  • Some visualization
Query text: Two dogs are running along the street
screen
Query text: The woman is holding a violin
screen
Query text: Young boys are playing baseball
screen
Query text: A man is standing, looking at a lake
screen

Paper References

@misc{messina2021transformer,
      title={Transformer Reasoning Network for Image-Text Matching and Retrieval}, 
      author={Nicola Messina and Fabrizio Falchi and Andrea Esuli and Giuseppe Amato},
      year={2021},
      eprint={2004.09144},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}
@misc{anderson2018bottomup,
      title={Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering}, 
      author={Peter Anderson and Xiaodong He and Chris Buehler and Damien Teney and Mark Johnson and Stephen Gould and Lei Zhang},
      year={2018},
      eprint={1707.07998},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}
@article{JDH17,
  title={Billion-scale similarity search with GPUs},
  author={Johnson, Jeff and Douze, Matthijs and J{\'e}gou, Herv{\'e}},
  journal={arXiv preprint arXiv:1702.08734},
  year={2017}
}

Code References

Owner
Minh-Khoi Pham
Passionate Machine Learner
Minh-Khoi Pham
ThunderGBM: Fast GBDTs and Random Forests on GPUs

Documentations | Installation | Parameters | Python (scikit-learn) interface What's new? ThunderGBM won 2019 Best Paper Award from IEEE Transactions o

Xtra Computing Group 647 Jan 04, 2023
MohammadReza Sharifi 27 Dec 13, 2022
An adaptive hierarchical energy management strategy for hybrid electric vehicles

An adaptive hierarchical energy management strategy This project contains the source code of an adaptive hierarchical EMS combining heuristic equivale

19 Dec 13, 2022
natural image generation using ConvNets

The Eyescream Project Generating Natural Images using Neural Networks. For our research summary on this work, please read the Arxiv paper: http://arxi

Meta Archive 601 Nov 23, 2022
Implementation of the paper: "SinGAN: Learning a Generative Model from a Single Natural Image"

SinGAN This is an unofficial implementation of SinGAN from someone who's been sitting right next to SinGAN's creator for almost five years. Please ref

35 Nov 10, 2022
Stock-Prediction - prediction of stock market movements using sentiment analysis and deep learning.

Stock-Prediction- In this project, we aim to enhance the prediction of stock market movements using sentiment analysis and deep learning. We divide th

5 Jan 25, 2022
DCSL - Generalizable Crowd Counting via Diverse Context Style Learning

DCSL Generalizable Crowd Counting via Diverse Context Style Learning Requirement

3 Jun 13, 2022
Tzer: TVM Implementation of "Coverage-Guided Tensor Compiler Fuzzing with Joint IR-Pass Mutation (OOPSLA'22)“.

Artifact • Reproduce Bugs • Quick Start • Installation • Extend Tzer Coverage-Guided Tensor Compiler Fuzzing with Joint IR-Pass Mutation This is the s

12 Dec 29, 2022
improvement of CLIP features over the traditional resnet features on the visual question answering, image captioning, navigation and visual entailment tasks.

CLIP-ViL In our paper "How Much Can CLIP Benefit Vision-and-Language Tasks?", we show the improvement of CLIP features over the traditional resnet fea

310 Dec 28, 2022
🗣️ Microsoft Edge TTS for Home Assistant, no need for app_key

Microsoft Edge TTS for Home Assistant This component is based on the TTS service of Microsoft Edge browser, no need to apply for app_key. Install Down

152 Dec 31, 2022
DeconvNet : Learning Deconvolution Network for Semantic Segmentation

DeconvNet: Learning Deconvolution Network for Semantic Segmentation Created by Hyeonwoo Noh, Seunghoon Hong and Bohyung Han at POSTECH Acknowledgement

Hyeonwoo Noh 325 Oct 20, 2022
The fastai deep learning library

Welcome to fastai fastai simplifies training fast and accurate neural nets using modern best practices Important: This documentation covers fastai v2,

fast.ai 23.2k Jan 07, 2023
FNet Implementation with TensorFlow & PyTorch

FNet Implementation with TensorFlow & PyTorch. TensorFlow & PyTorch implementation of the paper "FNet: Mixing Tokens with Fourier Transforms". Overvie

Abdelghani Belgaid 1 Feb 12, 2022
Reusable constraint types to use with typing.Annotated

annotated-types PEP-593 added typing.Annotated as a way of adding context-specific metadata to existing types, and specifies that Annotated[T, x] shou

125 Dec 26, 2022
Age and Gender prediction using Keras

cnn_age_gender Age and Gender prediction using Keras Dataset example : Description : UTKFace dataset is a large-scale face dataset with long age span

XN3UR0N 58 May 03, 2022
Deep Sea Treasure Environment for Multi-Objective Optimization Research

DeepSeaTreasure Environment Installation In order to get started with this environment, you can install it using the following command: python3 -m pip

imec IDLab 6 Nov 14, 2022
A minimalist environment for decision-making in autonomous driving

highway-env A collection of environments for autonomous driving and tactical decision-making tasks An episode of one of the environments available in

Edouard Leurent 1.6k Jan 07, 2023
百度2021年语言与智能技术竞赛机器阅读理解Pytorch版baseline

项目说明: 百度2021年语言与智能技术竞赛机器阅读理解Pytorch版baseline 比赛链接:https://aistudio.baidu.com/aistudio/competition/detail/66?isFromLuge=true 官方的baseline版本是基于paddlepadd

周俊贤 54 Nov 23, 2022
K Closest Points and Maximum Clique Pruning for Efficient and Effective 3D Laser Scan Matching (To appear in RA-L 2022)

KCP The official implementation of KCP: k Closest Points and Maximum Clique Pruning for Efficient and Effective 3D Laser Scan Matching, accepted for p

Yu-Kai Lin 109 Dec 14, 2022
Official Implementation of CoSMo: Content-Style Modulation for Image Retrieval with Text Feedback

CoSMo.pytorch Official Implementation of CoSMo: Content-Style Modulation for Image Retrieval with Text Feedback, Seungmin Lee*, Dongwan Kim*, Bohyung

Seung Min Lee 54 Dec 08, 2022