This repository contains the database and code used in the paper Embedding Arithmetic for Text-driven Image Transformation

Related tags

Deep LearningSIMAT
Overview

This repository contains the database and code used in the paper Embedding Arithmetic for Text-driven Image Transformation (Guillaume Couairon, Holger Schwenk, Matthijs Douze, Matthieu Cord)

The inspiration for this work are the geometric properties of word embeddings, such as Queen ~ Woman + (King - Man). We extend this idea to multimodal embedding spaces (like CLIP), which let us semantically edit images via "delta vectors".

Transformed images can then be retrieved in a dataset of images.

The SIMAT Dataset

We build SIMAT, a dataset to evaluate the task of text-driven image transformation, for simple images that can be characterized by a single subject-relation-object annotation. A transformation query is a pair (image, query) where the query asks to change the subject, the relation or the object in the input image. SIMAT contains ~6k images and an average of 3 transformation queries per image.

The goal is to retrieve an image in the dataset that corresponds to the query specifications. We use OSCAR as an oracle to check whether retrieved images are correct with respect to the expected modifications.

Examples

Below are a few examples that are in the dataset, and images that were retrieved for our best-performing algorithm.

Download dataset

The SIMAT database is composed of crops of images from Visual Genome. You first need to install Visual Genome and then run the following command :

python prepare_dataset.py --VG_PATH=/path/to/visual/genome

Perform inference with CLIP ViT-B/32

In this example, we use the CLIP ViT-B/32 model to edit an image. Note that the dataset of clip embeddings is pre-computed.

import clip
from torchvision import datasets
from PIL import Image
from IPython.display import display

#hack to normalize tensors easily
torch.Tensor.normalize = lambda x:x/x.norm(dim=-1, keepdim=True)

# database to perform the retrieval step
dataset = datasets.ImageFolder('simat_db/images/')
db = torch.load('data/clip_simat.pt').float()

model, prep = clip.load('ViT-B/32', device='cuda:0', jit=False)

image = Image.open('simat_db/images/A cat sitting on a grass/98316.jpg')
img_enc = model.encode_image(prep(image).unsqueeze(0).to('cuda:0')).float().cpu().detach().normalize()

txt = ['cat', 'dog']
txt_enc = model.encode_text(clip.tokenize(txt).to('cuda:0')).float().cpu().detach().normalize()

# optionally, we can apply a linear layer on top of the embeddings
heads = torch.load(f'data/head_clip_t=0.1.pt')
img_enc = heads['img_head'](img_enc).normalize()
txt_enc = heads['txt_head'](txt_enc).normalize()
db = heads['img_head'](db).normalize()


# now we perform the transformation step
lbd = 1
target_enc = img_enc + lbd * (txt_enc[1] - txt_enc[0])


retrieved_idx = (db @ target_enc.float().T).argmax(0).item()


display(dataset[retrieved_idx][0])

Compute SIMAT scores with CLIP

You can run the evaluation script with the following command:

python eval.py --backbone clip --domain dev --tau 0.01 --lbd 1 2

It automatically load the adaptation layer relative to the value of tau.

Train adaptation layers on COCO

In this part, you can train linear layers after the CLIP encoder on the COCO dataset, to get a better alignment. Here is an example :

python adaptation.py --backbone ViT-B/32 --lr 0.001 --tau 0.1 --batch_size 512

Citation

If you find this paper or dataset useful for your research, please use the following.

@article{gco1embedding,
  title={Embedding Arithmetic for text-driven Image Transformation},
  author={Guillaume Couairon, Matthieu Cord, Matthijs Douze, Holger Schwenk},
  journal={arXiv preprint arXiv:2112.03162},
  year={2021}
}

References

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever. Learning Transferable Visual Models From Natural Language Supervision, OpenAI 2021

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, Fei-Fei Li. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations, IJCV 2017

Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, Yejin Choi, Jianfeng Gao, Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks, ECCV 2020

License

The SIMAT is released under the MIT license. See LICENSE for details.

Owner
Meta Research
Meta Research
A geometric deep learning pipeline for predicting protein interface contacts.

A geometric deep learning pipeline for predicting protein interface contacts.

44 Dec 30, 2022
SurfEmb (CVPR 2022) - SurfEmb: Dense and Continuous Correspondence Distributions

SurfEmb SurfEmb: Dense and Continuous Correspondence Distributions for Object Pose Estimation with Learnt Surface Embeddings Rasmus Laurvig Haugard, A

Rasmus Haugaard 56 Nov 19, 2022
Convert Pytorch model to onnx or tflite, and the converted model can be visualized by Netron

Convert Pytorch model to onnx or tflite, and the converted model can be visualized by Netron

Roxbili 5 Nov 19, 2022
CvT-ASSD: Convolutional vision-Transformerbased Attentive Single Shot MultiBox Detector (ICTAI 2021 CCF-C 会议)The 33rd IEEE International Conference on Tools with Artificial Intelligence

CvT-ASSD including extra CvT, CvT-SSD, VGG-ASSD models original-code-website: https://github.com/albert-jin/CvT-SSD new-code-website: https://github.c

金伟强 -上海大学人工智能小渣渣~ 5 Mar 07, 2022
Official repository of Semantic Image Matting

Semantic Image Matting This is the official repository of Semantic Image Matting (CVPR2021). Overview Natural image matting separates the foreground f

192 Dec 29, 2022
OneShot Learning-based hotword detection.

EfficientWord-Net Hotword detection based on one-shot learning Home assistants require special phrases called hotwords to get activated (eg:"ok google

ANT-BRaiN 102 Dec 25, 2022
BLEURT is a metric for Natural Language Generation based on transfer learning.

BLEURT: a Transfer Learning-Based Metric for Natural Language Generation BLEURT is an evaluation metric for Natural Language Generation. It takes a pa

Google Research 492 Jan 05, 2023
Towards Long-Form Video Understanding

Towards Long-Form Video Understanding Chao-Yuan Wu, Philipp Krähenbühl, CVPR 2021 [Paper] [Project Page] [Dataset] Citation @inproceedings{lvu2021,

Chao-Yuan Wu 69 Dec 26, 2022
The full training script for Enformer (Tensorflow Sonnet) on TPU clusters

Enformer TPU training script (wip) The full training script for Enformer (Tensorflow Sonnet) on TPU clusters, in an effort to migrate the model to pyt

Phil Wang 10 Oct 19, 2022
Multi-Scale Aligned Distillation for Low-Resolution Detection (CVPR2021)

MSAD Multi-Scale Aligned Distillation for Low-Resolution Detection Lu Qi*, Jason Kuen*, Jiuxiang Gu, Zhe Lin, Yi Wang, Yukang Chen, Yanwei Li, Jiaya J

Jia Research Lab 115 Dec 23, 2022
Some useful blender add-ons for SMPL skeleton's poses and global translation.

Blender add-ons for SMPL skeleton's poses and trans There are two blender add-ons for SMPL skeleton's poses and trans.The first is for making an offli

犹在镜中 154 Jan 04, 2023
A Repository of Community-Driven Natural Instructions

A Repository of Community-Driven Natural Instructions TLDR; this repository maintains a community effort to create a large collection of tasks and the

AI2 244 Jan 04, 2023
Code for NeurIPS 2020 article "Contrastive learning of global and local features for medical image segmentation with limited annotations"

Contrastive learning of global and local features for medical image segmentation with limited annotations The code is for the article "Contrastive lea

Krishna Chaitanya 152 Dec 22, 2022
Reverse engineering recurrent neural networks with Jacobian switching linear dynamical systems

Reverse engineering recurrent neural networks with Jacobian switching linear dynamical systems This repository is the official implementation of Rever

6 Aug 25, 2022
Train Yolov4 using NBX-Jobs

yolov4-trainer-nbox Train Yolov4 using NBX-Jobs. Use the powerfull functionality available in nbox-SDK repo to train a tiny-Yolo v4 model on Pascal VO

Yash Bonde 1 Jan 12, 2022
Simple PyTorch hierarchical models.

A python package adding basic hierarchal networks in pytorch for classification tasks. It implements a simple hierarchal network structure based on feed-backward outputs.

Rajiv Sarvepalli 5 Mar 06, 2022
Template repository for managing machine learning research projects built with PyTorch-Lightning

Tutorial Repository with a minimal example for showing how to deploy training across various compute infrastructure.

Sidd Karamcheti 3 Feb 11, 2022
Code for our NeurIPS 2021 paper Mining the Benefits of Two-stage and One-stage HOI Detection

CDN Code for our NeurIPS 2021 paper "Mining the Benefits of Two-stage and One-stage HOI Detection". Contributed by Aixi Zhang*, Yue Liao*, Si Liu, Mia

71 Dec 14, 2022
A Tensorflow implementation of CapsNet based on Geoffrey Hinton's paper Dynamic Routing Between Capsules

CapsNet-Tensorflow A Tensorflow implementation of CapsNet based on Geoffrey Hinton's paper Dynamic Routing Between Capsules Notes: The current version

Huadong Liao 3.8k Dec 29, 2022
SPTAG: A library for fast approximate nearest neighbor search

SPTAG: A library for fast approximate nearest neighbor search SPTAG SPTAG (Space Partition Tree And Graph) is a library for large scale vector approxi

Microsoft 4.3k Jan 01, 2023