Language-Agnostic SEntence Representations

Last update: Jan 04, 2023

Related tags

Text Data & NLP LASER

Overview

LASER Language-Agnostic SEntence Representations

LASER is a library to calculate and use multilingual sentence embeddings.

NEWS

2019/11/08 CCMatrix is available: Mining billions of high-quality parallel sentences on the WEB [8]
2019/07/31 Gilles Bodard and Jérémy Rapin provided a Docker environment to use LASER
2019/07/11 WikiMatrix is available: bitext extraction for 1620 language pairs in WikiPedia [7]
2019/03/18 switch to BSD license
2019/02/13 The code to perform bitext mining is now available

CURRENT VERSION:

We now provide an encoder which was trained on 93 languages, written in 23 different alphabets [6]. This includes all European languages, many Asian and Indian languages, Arabic, Persian, Hebrew, ..., as well as various minority languages and dialects.
We provide a test set for more than 100 languages based on the Tatoeba corpus.
Switch to PyTorch 1.0

All these languages are encoded by the same BiLSTM encoder, and there is no need to specify the input language (but tokenization is language specific). According to our experience, the sentence encoder also supports code-switching, i.e. the same sentences can contain words in several different languages.

We have also some evidence that the encoder can generalizes to other languages which have not been seen during training, but which are in a language family which is covered by other languages.

A detailed description how the multilingual sentence embeddings are trained can be found in [6], together with an extensive experimental evaluation.

Dependencies

Python 3.6
PyTorch 1.0
NumPy, tested with 1.15.4
Cython, needed by Python wrapper of FastBPE, tested with 0.29.6
Faiss, for fast similarity search and bitext mining
transliterate 1.10.2, only used for Greek (pip install transliterate)
jieba 0.39, Chinese segmenter (pip install jieba)
mecab 0.996, Japanese segmenter
tokenization from the Moses encoder (installed automatically)
FastBPE, fast C++ implementation of byte-pair encoding (installed automatically)

Installation

set the environment variable 'LASER' to the root of the installation, e.g. export LASER="${HOME}/projects/laser"
download encoders from Amazon s3 by bash ./install_models.sh
download third party software by bash ./install_external_tools.sh
download the data used in the example tasks (see description for each task)

Applications

We showcase several applications of multilingual sentence embeddings with code to reproduce our results (in the directory "tasks").

Cross-lingual document classification using the MLDoc corpus [2,6]
WikiMatrix Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia [7]
Bitext mining using the BUCC corpus [3,5]
Cross-lingual NLI using the XNLI corpus [4,5,6]
Multilingual similarity search [1,6]
Sentence embedding of text files example how to calculate sentence embeddings for arbitrary text files in any of the supported language.

For all tasks, we use exactly the same multilingual encoder, without any task specific optimization or fine-tuning.

License

LASER is BSD-licensed, as found in the LICENSE file in the root directory of this source tree.

Supported languages

Our model was trained on the following languages:

Afrikaans, Albanian, Amharic, Arabic, Armenian, Aymara, Azerbaijani, Basque, Belarusian, Bengali, Berber languages, Bosnian, Breton, Bulgarian, Burmese, Catalan, Central/Kadazan Dusun, Central Khmer, Chavacano, Chinese, Coastal Kadazan, Cornish, Croatian, Czech, Danish, Dutch, Eastern Mari, English, Esperanto, Estonian, Finnish, French, Galician, Georgian, German, Greek, Hausa, Hebrew, Hindi, Hungarian, Icelandic, Ido, Indonesian, Interlingua, Interlingue, Irish, Italian, Japanese, Kabyle, Kazakh, Korean, Kurdish, Latvian, Latin, Lingua Franca Nova, Lithuanian, Low German/Saxon, Macedonian, Malagasy, Malay, Malayalam, Maldivian (Divehi), Marathi, Norwegian (Bokmål), Occitan, Persian (Farsi), Polish, Portuguese, Romanian, Russian, Serbian, Sindhi, Sinhala, Slovak, Slovenian, Somali, Spanish, Swahili, Swedish, Tagalog, Tajik, Tamil, Tatar, Telugu, Thai, Turkish, Uighur, Ukrainian, Urdu, Uzbek, Vietnamese, Wu Chinese and Yue Chinese.

We have also observed that the model seems to generalize well to other (minority) languages or dialects, e.g.

Asturian, Egyptian Arabic, Faroese, Kashubian, North Moluccan Malay, Nynorsk Norwegian, Piedmontese, Sorbian, Swabian, Swiss German or Western Frisian.

References

[1] Holger Schwenk and Matthijs Douze, Learning Joint Multilingual Sentence Representations with Neural Machine Translation, ACL workshop on Representation Learning for NLP, 2017

[2] Holger Schwenk and Xian Li, A Corpus for Multilingual Document Classification in Eight Languages, LREC, pages 3548-3551, 2018.

[3] Holger Schwenk, Filtering and Mining Parallel Data in a Joint Multilingual Space ACL, July 2018

[4] Alexis Conneau, Guillaume Lample, Ruty Rinott, Adina Williams, Samuel R. Bowman, Holger Schwenk and Veselin Stoyanov, XNLI: Cross-lingual Sentence Understanding through Inference, EMNLP, 2018.

[5] Mikel Artetxe and Holger Schwenk, Margin-based Parallel Corpus Mining with Multilingual Sentence Embeddings arXiv, Nov 3 2018.

[6] Mikel Artetxe and Holger Schwenk, Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond arXiv, Dec 26 2018.

[7] Holger Schwenk, Vishrav Chaudhary, Shuo Sun, Hongyu Gong and Paco Guzman, WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia arXiv, July 11 2019.

[8] Holger Schwenk, Guillaume Wenzek, Sergey Edunov, Edouard Grave and Armand Joulin CCMatrix: Mining Billions of High-Quality Parallel Sentences on the WEB

Language-Agnostic SEntence Representations

Related tags

Overview

LASER Language-Agnostic SEntence Representations

Dependencies

Installation

Applications

License

Supported languages

References

Owner

Facebook Research

A simple Speech Emotion Recognition (SER) API created using Flask and running in a Docker container.

Unsupervised Abstract Reasoning for Raven’s Problem Matrices

Search for documents in a domain through Google. The objective is to extract metadata

🚀 RocketQA, dense retrieval for information retrieval and question answering, including both Chinese and English state-of-the-art models.

Beyond the Imitation Game collaborative benchmark for enormous language models

p-tuning for few-shot NLU task

Open-source offline translation library written in Python. Uses OpenNMT for translations

A simple visual front end to the Maya UE4 RBF plugin delivered with MetaHumans

Document processing using transformers

Différents programmes créant une interface graphique a l'aide de Tkinter pour simplifier la vie des étudiants.

मराठी भाषा वाचविण्याचा एक प्रयास. इंग्रजी ते मराठीचा शब्दकोश. An attempt to preserve the Marathi language. A lightweight and ad free English to Marathi thesaurus.

AI Assistant for Building Reliable, High-performing and Fair Multilingual NLP Systems

This repository contains the code for running the character-level Sandwich Transformers from our ACL 2020 paper on Improving Transformer Models by Reordering their Sublayers.

An Explainable Leaderboard for NLP

GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training

A cross platform OCR Library based on PaddleOCR & OnnxRuntime

A design of MIDI language for music generation task, specifically for Natural Language Processing (NLP) models.

Code for papers "Generation-Augmented Retrieval for Open-Domain Question Answering" and "Reader-Guided Passage Reranking for Open-Domain Question Answering", ACL 2021

Convolutional 2D Knowledge Graph Embeddings resources

EdiTTS: Score-based Editing for Controllable Text-to-Speech