OpenAI CLIP text encoders for multiple languages!

Last update: Dec 30, 2022

Related tags

Text Data & NLP Multilingual-CLIP

Overview

Multilingual-CLIP

OpenAI CLIP text encoders for any language

Colab Notebook · Pre-trained Models · Report Bug

Overview

OpenAI recently released the paper Learning Transferable Visual Models From Natural Language Supervision in which they present the CLIP (Contrastive Language–Image Pre-training) model. This model is trained to connect text and images, by matching their corresponding vector representations using a contrastive learning objective. CLIP consists of two separate models, a visual encoder and a text encoder. These were trained on a wooping 400 Million images and corresponding captions. OpenAI has since released a set of their smaller CLIP models, which can be found on the official CLIP Github.

We propose a fine-tuning to replace the original English text encoder with a pre-trained text model in any language. This method makes it possible to adapt the powerful CLIP model to any language in roughly 24 GPU hours.

This repository contains

Pytorch inference code
Tensorflow training code
Pre-trained CLIP-Text encoders for multiple languages
Training data and pre-computed CLIP text encodings for a large porton of the the image captions of GCC + MSCOCO + VizWiz

Requirements

While it is possible that other versions works equally fine, we have worked with the following:

Python = 3.6.9
Transformers = 4.1.1
Model Weights

Usage

Download CLIP Model

$ conda install --yes -c pytorch pytorch=1.7.1 torchvision cudatoolkit=11.0
$ pip install ftfy regex tqdm
$ pip install git+https://github.com/openai/CLIP.git

Replace cudatoolkit=11.0 above with the appropriate CUDA version on your machine or cpuonly when installing on a machine without a GPU. For more information please see the official CLIP repostitory.

Download Linear Weights

# Linear Model Weights
$ bash get-weights.sh

Inference

from src import multilingual_clip

print(multilingual_clip.AVAILABLE_MODELS.keys())

model = multilingual_clip.load_model('M-BERT-Distil-40')

embeddings = model(['Älgen är skogens konung!', 'Wie leben Eisbären in der Antarktis?', 'Вы знали, что все белые медведи левши?'])
print(embeddings.shape)
# Yields: torch.Size([3, 640])

For a more elaborate example, comparing the textual embeddings to the CLIP image embeddings see this colab notebook.

Pre-trained Models

Every text encoder is a Huggingface available transformer, with an additional linear layer on top. Neither of the models have been extensively tested, but for more information and qualitative test results for a specific model, click the Model Name to see its model card.

*** Make sure to update to the most recent version of the repostitory when downloading a new model, and re-run the shell script to download the Linear Weights. ***

M-BERT-Base-ViT-B

Name	Model Base	Vision Model	Pre-trained Languages	Target Languages	#Parameters
Multilingual
M-BERT Distil 40	M-BERT Distil	RN50x4	101 Languages	40 Languages	66 M
M-BERT Base 69	M-BERT Base	RN50x4	101 Languages	68 Languages	110 M
M-BERT Base ViT-B	M-BERT Base	ViT-B/32	101 Languages	68 Languages	110 M
Monolingual
Swe-CLIP 500k	KB-BERT	RN50x4	Swedish	Swedish	110 M
Swe-CLIP 2M	KB-BERT	RN50x4	Swedish	Swedish	110 M

Training a new model

This folder contains the code used for training the above models. If you wsh to train your own model you must do the following things:

Prepare a set of translated sentence pairs from English -> Your Language(s)
Compute regular CLIP-Text embeddings for the English sentences.
Edit Training.py to load your data.
Train a new CLIP-Text encoder via Teacher Learning

Pre-computed CLIP Embeddings & Translaton Data

This Google Drive folder contains both pre-computed CLIP-Text Embeddings for a large porton of the the image captions of GCC + MSCOCO + VizWiz.

The Google Drive folder also contains the translation data used to train the currently available models. Good Luck

Contribution

If you have trained a CLIP Text encoder specific to your language, or another model covering a language not supported here, Please feel free to contact us and we will either upload your model and credit you, or simply link to your already uploaded model.

Contact

If you have questions regarding the code or otherwise related to this Github page, please open an issue.

For other purposes, feel free to contact me directly at: [email protected]

Acknowledgements

License

Distributed under the MIT License. See LICENSE for more information.

Comments

1024 dim embedding model needed

Dear authors , Thx 4 your Great work ! But I'm working with AudioClip of which the embeddings are 1024 dims, But the models U've released have most 768 dims, Could U pls kindly release a model that can produce 1024 dims embedding ? Here is AudioClip : https://github.com/AndreyGuzhov/AudioCLIP With my best wish ! Looking forward to hearing from U !

opened by ithanwu 4
Release 1.0.0

Merge this only after doing this:

when you create a "Release x.y.z" it will release

you need to add a secret called PYPI_PASSWORD in the github repo and put inside a token you create at https://pypi.org/manage/account/token/

https://github.com/FreddeFrallan/Multilingual-CLIP/settings/secrets/actions/new

Choose the option "Squash and merge" on github when merging to create a single commit

opened by rom1504 4
Bibtex Citation

Amazing repo! I'd love to cite it. Do you have a desired bibtex by chance?

Perhaps:

@misc{multilingual-clip, author = {Carlsson, Fredrik}, title = {Multilingual CLIP}, year = 2021, publisher = {GitHub}, journal = {GitHub repository}, howpublished = {\url{https://github.com/SajjjadAyobi/CLIPfa}}, }

opened by Zasder3 1
Training a model for ViT-L/14 image embeddings

Hey, Thanks for providing this awesome multilingual clip-aligned text encoder. We used it to filter the 3 billions of (image, text) pairs of laion5B https://laion.ai/laion-5b-a-new-era-of-open-large-scale-multi-modal-datasets/ and it worked well. I'm also using this model to provide a multilingual search in https://rom1504.github.io/clip-retrieval/. For laion400m we used the ViT-B/32 model of openai to produce the index, but for laion5B we went with ViT-L/14 which is much more powerful. To provide the same multilingual search feature, it would be really helpful if I had a clip ViT-L/14 aligned multilingual text encoder.

Would you advise running https://github.com/FreddeFrallan/Multilingual-CLIP#training-a-new-model (and now I'm writing it, I guess I could use a subset of the multilingual set of laion5B for this) to align such a text encoder ?

opened by rom1504 1
XLM-Roberta Feature Request

Hi,

Great repo! Are you planning to release a model with XLMR (with ViT-B) anytime soon? It was better for small-resource-languages than the multilingual BERT.

opened by mezig351 1
Redo packaging

when you create a "Release x.y.z" it will release

you need to add a secret called PYPI_PASSWORD in the github repo and put inside a token you create at https://pypi.org/manage/account/token/

https://github.com/FreddeFrallan/Multilingual-CLIP/settings/secrets/actions/new

opened by rom1504 0
Data leak

Hello! According to XTD-10 repo, the test set contains 800 images from MSCOCO train set. During training you also use MSCOCO train set – it seems you have data leak. Or may be I don't understand something.

opened by kimihailv 1

model_type 'M-CLIP' is not in CONFIG_MAPPING

from transformers import AutoConfig

kwargs = {'_from_auto': True}
pretrained_model_name_or_path = 'M-CLIP/XLM-Roberta-Large-Vit-L-14'
config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)

Hi, I installed required transformers==4.8.1, and run the above code to get following error.

    config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
  File "/anaconda3/lib/python3.9/site-packages/transformers/models/auto/configuration_auto.py", line 448, in from_pretrained
    config_class = CONFIG_MAPPING[config_dict["model_type"]]
KeyError: 'M-CLIP'

seems like model_type 'M-CLIP' is not in the CONFIG_MAPPING, can anyone help to figure it out?

opened by wxywb 1

Issue in M-Bert-Base-ViT-B clip head linear layer size

I tried the following piece of code present in the repo at location https://github.com/FreddeFrallan/Multilingual-CLIP/blob/main/src/multilingual_clip.py

The only changes I made is that I added print statements in between.

` import pickle

import torch import transformers

AVAILABLE_MODELS = { 'M-BERT-Distil-40': { 'model_name': 'M-CLIP/M-BERT-Distil-40', 'tokenizer_name': 'M-CLIP/M-BERT-Distil-40', 'head_name': 'M-BERT Distil 40 Linear Weights.pkl' },

'M-BERT-Base-69': {
    'model_name': 'M-CLIP/M-BERT-Base-69',
    'tokenizer_name': 'M-CLIP/M-BERT-Base-69',
    'head_name': 'M-BERT-Base-69 Linear Weights.pkl'
},

'Swe-CLIP-500k': {
    'model_name': 'M-CLIP/Swedish-500k',
    'tokenizer_name': 'M-CLIP/Swedish-500k',
    'head_name': 'Swedish-500k Linear Weights.pkl'
},

'Swe-CLIP-2M': {
    'model_name': 'M-CLIP/Swedish-2M',
    'tokenizer_name': 'M-CLIP/Swedish-2M',
    'head_name': 'Swedish-2M Linear Weights.pkl'
},

'M-BERT-Base-ViT-B': {
    'model_name': 'M-CLIP/M-BERT-Base-ViT-B',
    'tokenizer_name': 'M-CLIP/M-BERT-Base-ViT-B',
    'head_name': 'M-BERT-Base-69-ViT Linear Weights.pkl'
},

}

class MultilingualClip2(torch.nn.Module): def init(self, model_name, tokenizer_name, head_name, weights_dir='data/weights/'): super().init() self.model_name = model_name self.tokenizer_name = tokenizer_name self.head_path = weights_dir + head_name

    self.tokenizer = transformers.AutoTokenizer.from_pretrained(tokenizer_name)
    self.transformer = transformers.AutoModel.from_pretrained(model_name)
    self.clip_head = torch.nn.Linear(in_features=768, out_features=640)
    self._load_head()

def forward(self, txt):
    txt_tok = self.tokenizer(txt, padding=True, return_tensors='pt').to(device)
    embs = self.transformer(**txt_tok)[0]
    print('embs_text')
    print(embs.size())
    att = txt_tok['attention_mask']
    print('att_text')
    print(att.size())
    embs = (embs * att.unsqueeze(2)).sum(dim=1) / att.sum(dim=1)[:, None]
    print('embs_text')
    print(embs.size())
    p =  self.clip_head(embs)
    print('clip head obj')
    print(self.clip_head)
    print('cliphed_text')
    print(p.size())
    return p

def _load_head(self):
    with open(self.head_path, 'rb') as f:
        lin_weights = pickle.loads(f.read())
    self.clip_head.weight = torch.nn.Parameter(torch.tensor(lin_weights[0]).float().t())
    self.clip_head.bias = torch.nn.Parameter(torch.tensor(lin_weights[1]).float())
    print('ok')
    print(self.clip_head.weight.size())
    print(self.clip_head.bias.size())

def load_model2(name): config = AVAILABLE_MODELS[name] return MultilingualClip2(**config)

mod = load_model2('M-BERT-Base-ViT-B') z = mod(Query[0]) `

Output for this code : ok torch.Size([512, 768]) torch.Size([512]) embs_text torch.Size([1, 6, 768]) att_text torch.Size([1, 6]) embs_text torch.Size([1, 768]) clip head obj Linear(in_features=768, out_features=640, bias=True) cliphed_text torch.Size([1, 512])

This output suggest that the file 'M-BERT-Base-69-ViT Linear Weights.pkl' doesn't have the size of 640 X 768 but 512 X 768

Is there any issue with the config then ?

opened by shreyajain4 2

some questisons about finetune

i have finetune the text_encode use 300000 texts and its embedding,but i find the result is so bad ,could you give me some advertise to improve the result

opened by Soulscb 1
some confuse for "Pre-trained CLIP-Text encoders for multiple languages"

if i have <other language text , image, label> pair data, can i directly use 'distilbert-base-multilingual-cased' to pre-train clip model? Why re-train a model for englist to other languages ?

opened by moluchase 1

Releases(1.0.10)

1.0.10(Jun 2, 2022)

Source code(tar.gz)
Source code(zip)
1.0.8(Jun 2, 2022)

Source code(tar.gz)
Source code(zip)
1.0.7(Jun 2, 2022)

Source code(tar.gz)
Source code(zip)
1.0.6(Jun 2, 2022)

Source code(tar.gz)
Source code(zip)
1.0.5(Jun 2, 2022)

Source code(tar.gz)
Source code(zip)
1.0.4(Jun 2, 2022)

Source code(tar.gz)
Source code(zip)
1.0.3(Jun 2, 2022)

Source code(tar.gz)
Source code(zip)
1.0.2(Jun 2, 2022)

Source code(tar.gz)
Source code(zip)
1.0.1(Jun 2, 2022)

Source code(tar.gz)
Source code(zip)
1.0.0(Jun 2, 2022)

Source code(tar.gz)
Source code(zip)

Owner

Fredrik Carlsson

GitHub Repository

Submit issues and feature requests for our API here.

AIx GPT API Submit issues and feature requests for our API here. See https://apps.aixsolutionsgroup.com for more info. Python Quick Start pip install

7 Mar 27, 2022

Applying "Load What You Need: Smaller Versions of Multilingual BERT" to LaBSE

smaller-LaBSE LaBSE(Language-agnostic BERT Sentence Embedding) is a very good method to get sentence embeddings across languages. But it is hard to fi

13 Sep 02, 2022

A linter to manage all your python exceptions and try/except blocks (limited only for those who like dinosaurs).

Manage your exceptions in Python like a PRO Currently in BETA. Inspired by this blog post. I shared the building process of this tool here. “For those

353 Dec 31, 2022

Library for Russian imprecise rhymes generation

TOM RHYMER Library for Russian imprecise rhymes generation. Quick Start Generate rhymes by any given rhyme scheme (aabb, abab, aaccbb, etc ...): from

6 Oct 18, 2022

Finds snippets in iambic pentameter in English-language text and tries to combine them to a rhyming sonnet.

Sonnet finder Finds snippets in iambic pentameter in English-language text and tries to combine them to a rhyming sonnet. Usage This is a Python scrip

11 Sep 25, 2022

Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context This repository contains the code in both PyTorch and TensorFlow for our paper

3.3k Dec 28, 2022

XLNet: Generalized Autoregressive Pretraining for Language Understanding

Introduction XLNet is a new unsupervised language representation learning method based on a novel generalized permutation language modeling objective.

6k Jan 07, 2023

Rich Prosody Diversity Modelling with Phone-level Mixture Density Network

Phone Level Mixture Density Network for TTS This repo contains pytorch implementation of paper Rich Prosody Diversity Modelling with Phone-level Mixtu

42 Dec 13, 2022

PyTorch Implementation of the paper Single Image Texture Translation for Data Augmentation

SITT The repo contains official PyTorch Implementation of the paper Single Image Texture Translation for Data Augmentation. Authors: Boyi Li Yin Cui T

52 Jan 05, 2023

Simple text to phones converter for multiple languages

Phonemizer -- foʊnmaɪzɚ The phonemizer allows simple phonemization of words and texts in many languages. Provides both the phonemize command-line tool

762 Dec 29, 2022

Unofficial Parallel WaveGAN (+ MelGAN & Multi-band MelGAN & HiFi-GAN & StyleMelGAN) with Pytorch

Parallel WaveGAN implementation with Pytorch This repository provides UNOFFICIAL pytorch implementations of the following models: Parallel WaveGAN Mel

1.2k Dec 23, 2022

Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning

GenSen Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning Sandeep Subramanian, Adam Trischler, Yoshua B

309 Oct 19, 2022

Ecommerce product title recognition package

revizor This package solves task of splitting product title string into components, like type, brand, model and article (or SKU or product code or you

16 Mar 03, 2022

In this project, we aim to achieve the task of predicting emojis from tweets. We aim to investigate the relationship between words and emojis.

Making Emojis More Predictable by Karan Abrol, Karanjot Singh and Pritish Wadhwa, Natural Language Processing (CSE546) under the guidance of Dr. Shad

2 Jan 17, 2022

Refactored version of FastSpeech2

Refactored version of FastSpeech2. An implementation of Microsoft's "FastSpeech 2: Fast and High-Quality End-to-End Text to Speech"

10 May 26, 2022

Tevatron is a simple and efficient toolkit for training and running dense retrievers with deep language models.

Tevatron Tevatron is a simple and efficient toolkit for training and running dense retrievers with deep language models. The toolkit has a modularized

193 Jan 04, 2023

Python Implementation of ``Modeling the Influence of Verb Aspect on the Activation of Typical Event Locations with BERT'' (Findings of ACL: ACL 2021)

BERT-for-Surprisal Python Implementation of ``Modeling the Influence of Verb Aspect on the Activation of Typical Event Locations with BERT'' (Findings

7 Dec 05, 2022

OpenAI CLIP text encoders for multiple languages!

Related tags

Overview

Multilingual-CLIP

OpenAI CLIP text encoders for any language

Overview

This repository contains

Requirements

Usage

Download CLIP Model

Download Linear Weights

Inference

Pre-trained Models

Training a new model

Pre-computed CLIP Embeddings & Translaton Data

Contribution

Contact

Acknowledgements

License

Comments

Releases(1.0.10)

1.0.10(Jun 2, 2022)

1.0.8(Jun 2, 2022)

1.0.7(Jun 2, 2022)

1.0.6(Jun 2, 2022)

1.0.5(Jun 2, 2022)

1.0.4(Jun 2, 2022)

1.0.3(Jun 2, 2022)

1.0.2(Jun 2, 2022)

1.0.1(Jun 2, 2022)

1.0.0(Jun 2, 2022)

Owner

Fredrik Carlsson

Submit issues and feature requests for our API here.

Applying "Load What You Need: Smaller Versions of Multilingual BERT" to LaBSE

A linter to manage all your python exceptions and try/except blocks (limited only for those who like dinosaurs).

Library for Russian imprecise rhymes generation

Finds snippets in iambic pentameter in English-language text and tries to combine them to a rhyming sonnet.

Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

XLNet: Generalized Autoregressive Pretraining for Language Understanding

Rich Prosody Diversity Modelling with Phone-level Mixture Density Network

PyTorch Implementation of the paper Single Image Texture Translation for Data Augmentation

Simple text to phones converter for multiple languages

Unofficial Parallel WaveGAN (+ MelGAN & Multi-band MelGAN & HiFi-GAN & StyleMelGAN) with Pytorch

Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning

Ecommerce product title recognition package

In this project, we aim to achieve the task of predicting emojis from tweets. We aim to investigate the relationship between words and emojis.

Refactored version of FastSpeech2

Tevatron is a simple and efficient toolkit for training and running dense retrievers with deep language models.

Python Implementation of ``Modeling the Influence of Verb Aspect on the Activation of Typical Event Locations with BERT'' (Findings of ACL: ACL 2021)

Nested Named Entity Recognition

Code for the project carried out fulfilling the course requirements for Fall 2021 NLP at NYU

News-Articles-and-Essays - NLP (Topic Modeling and Clustering)