Applying "Load What You Need: Smaller Versions of Multilingual BERT" to LaBSE

Overview

smaller-LaBSE

LaBSE(Language-agnostic BERT Sentence Embedding) is a very good method to get sentence embeddings across languages. But it is hard to fine-tune due to the parameter size(~=471M) of this model. For instance, if I fine-tune this model with Adam optimizer, I need the GPU that has VRAM at least 7.5GB = 471M * (parameters 4 bytes + gradients 4 bytes + momentums 4 bytes + variances 4 bytes). So I applied "Load What You Need: Smaller Multilingual Transformers" method to LaBSE to reduce parameter size since most of this model's parameter is the word embedding table(~=385M).

The smaller version of LaBSE is evaluated for 14 languages using tatoeba dataset. It shows we can reduce LaBSE's parameters to 47% without a big performance drop.

If you need the PyTorch version, see https://github.com/Geotrend-research/smaller-transformers. I followed most of the steps in the paper.

Model #param(transformer) #param(word embedding) #param(model) vocab size
tfhub_LaBSE 85.1M 384.9M 470.9M 501,153
15lang_LaBSE 85.1M 133.1M 219.2M 173,347

Used Languages

  • English (en or eng)
  • French (fr or fra)
  • Spanish (es or spa)
  • German (de or deu)
  • Chinese (zh, zh_classical or cmn)
  • Arabic (ar or ara)
  • Italian (it or ita)
  • Japanese (ja or jpn)
  • Korean (ko or kor)
  • Dutch (nl or nld)
  • Polish (pl or pol)
  • Portuguese (pt or por)
  • Thai (th or tha)
  • Turkish (tr or tur)
  • Russian (ru or rus)

I selected the languages multilingual-USE supports.

Scripts

A smaller version of the vocab was constructed based on the frequency of tokens using Wikipedia dump data. I followed most of the algorithms in the paper to extract proper vocab for each language and rewrite it for TensorFlow.

Convert weight

mkdir -p downloads/labse-2
curl -L https://tfhub.dev/google/LaBSE/2?tf-hub-format=compressed -o downloads/labse-2.tar.gz
tar -xf downloads/labse-2.tar.gz -C downloads/labse-2/
python save_as_weight_from_saved_model.py

Select vocabs

./download_dataset.sh
python select_vocab.py

Make smaller LaBSE

./make_smaller_labse.py

Evaluate tatoeba

./download_tatoeba_dataset.sh
# evaluate TFHub LaBSE
./evaluate_tatoeba.sh
# evaluate the smaller LaBSE
./evaluate_tatoeba.sh \
    --model models/LaBSE_en-fr-es-de-zh-ar-zh_classical-it-ja-ko-nl-pl-pt-th-tr-ru/1/ \
    --preprocess models/LaBSE_en-fr-es-de-zh-ar-zh_classical-it-ja-ko-nl-pl-pt-th-tr-ru_preprocess/1/

Results

Tatoeba

Model fr es de zh ar it ja ko nl pl pt th tr ru avg
tfHub_LaBSE(en→xx) 95.90 98.10 99.30 96.10 90.70 95.30 96.40 94.10 97.50 97.90 95.70 82.85 98.30 95.30 95.25
tfHub_LaBSE(xx→en) 96.00 98.80 99.40 96.30 91.20 94.00 96.50 92.90 97.00 97.80 95.40 83.58 98.50 95.30 95.19
15lang_LaBSE(en→xx) 95.20 98.00 99.20 96.10 90.50 95.20 96.30 93.50 97.50 97.90 95.80 82.85 98.30 95.40 95.13
15lang_LaBSE(xx→en) 95.40 98.70 99.40 96.30 91.10 94.00 96.30 92.70 96.70 97.80 95.40 83.58 98.50 95.20 95.08
  • Accuracy(%) of the Tatoeba datasets.
  • If the strategy to select vocabs is changed or the corpus used in the selection step is changed to the corpus similar to the evaluation dataset, it is expected to reduce the performance drop.

References

You might also like...
Comments
  • Training time  and  Machine configuration

    Training time and Machine configuration

    Hi, thanks for your sharing model. I want to make a smaller model, just contains two languages(en, zh). And I want to know the kind of machine GPU and how long does it need to cost?

    opened by QzzIsCoding 2
  • Publish model to HuggingFace Model Hub?

    Publish model to HuggingFace Model Hub?

    I migrated the full LaBSE model from TF to PyTorch and uploaded them to the HuggingFace model hub. I saw this model on the TF hub and started migrating it for uploading to the HF Hub. I realized then that this wasn't published by Google but by @jeongukjae, so wanted to check with you before uploading it.

    I have exported the model locally. I'm happy to check the changes in and upload the exported model if that's fine for you :).

    opened by setu4993 2
Owner
Jeong Ukjae
Jeong Ukjae
Pytorch version of BERT-whitening

BERT-whitening This is the Pytorch implementation of "Whitening Sentence Representations for Better Semantics and Faster Retrieval". BERT-whitening is

Weijie Liu 255 Dec 27, 2022
Official PyTorch implementation of SegFormer

SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers Figure 1: Performance of SegFormer-B0 to SegFormer-B5. Project page

NVIDIA Research Projects 1.4k Dec 29, 2022
Data and code to support "Applied Natural Language Processing" (INFO 256, Fall 2021, UC Berkeley)

anlp21 Course materials for "Applied Natural Language Processing" (INFO 256, Fall 2021, UC Berkeley) Syllabus: http://people.ischool.berkeley.edu/~dba

David Bamman 48 Dec 06, 2022
VMD Audio/Text control with natural language

This repository is a proof of principle for performing Molecular Dynamics analysis, in this case with the program VMD, via natural language commands.

Andrew White 13 Jun 09, 2022
NLPIR tutorial: pretrain for IR. pre-train on raw textual corpus, fine-tune on MS MARCO Document Ranking

pretrain4ir_tutorial NLPIR tutorial: pretrain for IR. pre-train on raw textual corpus, fine-tune on MS MARCO Document Ranking 用作NLPIR实验室, Pre-training

ZYMa 12 Apr 07, 2022
Crie tokens de autenticação íntegros e seguros com UToken.

UToken - Tokens seguros. UToken (ou Unhandleable Token) é uma bilioteca criada para ser utilizada na geração de tokens seguros e íntegros, ou seja, nã

Jaedson Silva 0 Nov 29, 2022
AI-Broad-casting - AI Broad casting with python

Basic Code 1. Use The Code Configuration Environment conda create -n code_base p

Code for EMNLP 2021 main conference paper "Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification"

Code for EMNLP 2021 main conference paper "Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification"

LancoPKU 105 Jan 03, 2023
2021 AI CUP Competition on Traditional Chinese Scene Text Recognition - Intermediate Contest

繁體中文場景文字辨識 程式碼說明 組別:這就是我 成員:蔣明憲 唐碩謙 黃玥菱 林冠霆 蕭靖騰 目錄 環境套件 安裝方式 資料夾布局 前處理-製作偵測訓練註解檔 前處理-製作分類訓練樣本 part.py : 從 json 裁切出分類訓練樣本 Class.py : 將切出來的樣本按照文字分類到各資料夾

HuanyueTW 3 Jan 14, 2022
A python framework to transform natural language questions to queries in a database query language.

__ _ _ _ ___ _ __ _ _ / _` | | | |/ _ \ '_ \| | | | | (_| | |_| | __/ |_) | |_| | \__, |\__,_|\___| .__/ \__, | |_| |_| |___/

Machinalis 1.2k Dec 18, 2022
This script just scrapes the most recent Nepali news from Kathmandu Post and notifies the user about current events at regular intervals.It sends out the most recent news at random!

Nepali-news-notifier This script just scrapes the most recent Nepali news from Kathmandu Post and notifies the user about current events at regular in

Sachit Yadav 1 Feb 11, 2022
[AAAI 21] Curriculum Labeling: Revisiting Pseudo-Labeling for Semi-Supervised Learning

◥ Curriculum Labeling ◣ Revisiting Pseudo-Labeling for Semi-Supervised Learning Paola Cascante-Bonilla, Fuwen Tan, Yanjun Qi, Vicente Ordonez. In the

UVA Computer Vision 113 Dec 15, 2022
PyTorch implementation of Microsoft's text-to-speech system FastSpeech 2: Fast and High-Quality End-to-End Text to Speech.

An implementation of Microsoft's "FastSpeech 2: Fast and High-Quality End-to-End Text to Speech"

Chung-Ming Chien 1k Dec 30, 2022
A unified tokenization tool for Images, Chinese and English.

ICE Tokenizer Token id [0, 20000) are image tokens. Token id [20000, 20100) are common tokens, mainly punctuations. E.g., icetk[20000] == 'unk', ice

THUDM 42 Dec 27, 2022
Official code repository of the paper Linear Transformers Are Secretly Fast Weight Programmers.

Linear Transformers Are Secretly Fast Weight Programmers This repository contains the code accompanying the paper Linear Transformers Are Secretly Fas

Imanol Schlag 77 Dec 19, 2022
Scikit-learn style model finetuning for NLP

Scikit-learn style model finetuning for NLP Finetune is a library that allows users to leverage state-of-the-art pretrained NLP models for a wide vari

indico 665 Dec 17, 2022
Source code and dataset for ACL 2019 paper "ERNIE: Enhanced Language Representation with Informative Entities"

ERNIE Source code and dataset for "ERNIE: Enhanced Language Representation with Informative Entities" Reqirements: Pytorch=0.4.1 Python3 tqdm boto3 r

THUNLP 1.3k Dec 30, 2022
A sentence aligner for comparable corpora

About Yalign is a tool for extracting parallel sentences from comparable corpora. Statistical Machine Translation relies on parallel corpora (eg.. eur

Machinalis 128 Aug 24, 2022
The implementation of Parameter Differentiation based Multilingual Neural Machine Translation

The implementation of Parameter Differentiation based Multilingual Neural Machine Translation .

Qian Wang 21 Dec 17, 2022
Kestrel Threat Hunting Language

Kestrel Threat Hunting Language What is Kestrel? Why we need it? How to hunt with XDR support? What is the science behind it? You can find all the ans

Open Cybersecurity Alliance 201 Dec 16, 2022