BPEmb is a collection of pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) and trained on Wikipedia.

Overview

BPEmb

BPEmb is a collection of pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) and trained on Wikipedia. Its intended use is as input for neural models in natural language processing.

WebsiteUsageDownloadMultiBPEmbPaper (pdf)Citing BPEmb

Usage

Install BPEmb with pip:

pip install bpemb

Embeddings and SentencePiece models will be downloaded automatically the first time you use them.

>>> from bpemb import BPEmb
# load English BPEmb model with default vocabulary size (10k) and 50-dimensional embeddings
>>> bpemb_en = BPEmb(lang="en", dim=50)
downloading https://nlp.h-its.org/bpemb/en/en.wiki.bpe.vs10000.model
downloading https://nlp.h-its.org/bpemb/en/en.wiki.bpe.vs10000.d50.w2v.bin.tar.gz

You can do two main things with BPEmb. The first is subword segmentation:

>> bpemb_zh = BPEmb(lang="zh", vs=100000) # apply Chinese BPE subword segmentation model >>> bpemb_zh.encode("这是一个中文句子") # "This is a Chinese sentence." ['▁这是一个', '中文', '句子'] # ["This is a", "Chinese", "sentence"] ">
# apply English BPE subword segmentation model
>>> bpemb_en.encode("Stratford")
['▁strat', 'ford']
# load Chinese BPEmb model with vocabulary size 100k and default (100-dim) embeddings
>>> bpemb_zh = BPEmb(lang="zh", vs=100000)
# apply Chinese BPE subword segmentation model
>>> bpemb_zh.encode("这是一个中文句子")  # "This is a Chinese sentence."
['▁这是一个', '中文', '句子']  # ["This is a", "Chinese", "sentence"]

If / how a word gets split depends on the vocabulary size. Generally, a smaller vocabulary size will yield a segmentation into many subwords, while a large vocabulary size will result in frequent words not being split:

vocabulary size segmentation
1000 ['▁str', 'at', 'f', 'ord']
3000 ['▁str', 'at', 'ford']
5000 ['▁str', 'at', 'ford']
10000 ['▁strat', 'ford']
25000 ['▁stratford']
50000 ['▁stratford']
100000 ['▁stratford']
200000 ['▁stratford']

The second purpose of BPEmb is to provide pretrained subword embeddings:

>> type(bpemb_en.vectors) numpy.ndarray >>> bpemb_en.vectors.shape (10000, 50) >>> bpemb_zh.vectors.shape (100000, 100) ">
# Embeddings are wrapped in a gensim KeyedVectors object
>>> type(bpemb_zh.emb)
gensim.models.keyedvectors.Word2VecKeyedVectors
# You can use BPEmb objects like gensim KeyedVectors
>>> bpemb_en.most_similar("ford")
[('bury', 0.8745079040527344),
 ('ton', 0.8725000619888306),
 ('well', 0.871537446975708),
 ('ston', 0.8701574206352234),
 ('worth', 0.8672043085098267),
 ('field', 0.859795331954956),
 ('ley', 0.8591548204421997),
 ('ington', 0.8126075267791748),
 ('bridge', 0.8099068999290466),
 ('brook', 0.7979353070259094)]
>>> type(bpemb_en.vectors)
numpy.ndarray
>>> bpemb_en.vectors.shape
(10000, 50)
>>> bpemb_zh.vectors.shape
(100000, 100)

To use subword embeddings in your neural network, either encode your input into subword IDs:

>> bpemb_zh.vectors[ids].shape (3, 100) ">
>>> ids = bpemb_zh.encode_ids("这是一个中文句子")
[25950, 695, 20199]
>>> bpemb_zh.vectors[ids].shape
(3, 100)

Or use the embed method:

# apply Chinese subword segmentation and perform embedding lookup
>>> bpemb_zh.embed("这是一个中文句子").shape
(3, 100)

Downloads for each language

ab (Abkhazian)ace (Achinese)ady (Adyghe)af (Afrikaans)ak (Akan)als (Alemannic)am (Amharic)an (Aragonese)ang (Old English)ar (Arabic)arc (Official Aramaic)arz (Egyptian Arabic)as (Assamese)ast (Asturian)atj (Atikamekw)av (Avaric)ay (Aymara)az (Azerbaijani)azb (South Azerbaijani)

ba (Bashkir)bar (Bavarian)bcl (Central Bikol)be (Belarusian)bg (Bulgarian)bi (Bislama)bjn (Banjar)bm (Bambara)bn (Bengali)bo (Tibetan)bpy (Bishnupriya)br (Breton)bs (Bosnian)bug (Buginese)bxr (Russia Buriat)

ca (Catalan)cdo (Min Dong Chinese)ce (Chechen)ceb (Cebuano)ch (Chamorro)chr (Cherokee)chy (Cheyenne)ckb (Central Kurdish)co (Corsican)cr (Cree)crh (Crimean Tatar)cs (Czech)csb (Kashubian)cu (Church Slavic)cv (Chuvash)cy (Welsh)

da (Danish)de (German)din (Dinka)diq (Dimli)dsb (Lower Sorbian)dty (Dotyali)dv (Dhivehi)dz (Dzongkha)

ee (Ewe)el (Modern Greek)en (English)eo (Esperanto)es (Spanish)et (Estonian)eu (Basque)ext (Extremaduran)

fa (Persian)ff (Fulah)fi (Finnish)fj (Fijian)fo (Faroese)fr (French)frp (Arpitan)frr (Northern Frisian)fur (Friulian)fy (Western Frisian)

ga (Irish)gag (Gagauz)gan (Gan Chinese)gd (Scottish Gaelic)gl (Galician)glk (Gilaki)gn (Guarani)gom (Goan Konkani)got (Gothic)gu (Gujarati)gv (Manx)

ha (Hausa)hak (Hakka Chinese)haw (Hawaiian)he (Hebrew)hi (Hindi)hif (Fiji Hindi)hr (Croatian)hsb (Upper Sorbian)ht (Haitian)hu (Hungarian)hy (Armenian)

ia (Interlingua)id (Indonesian)ie (Interlingue)ig (Igbo)ik (Inupiaq)ilo (Iloko)io (Ido)is (Icelandic)it (Italian)iu (Inuktitut)

ja (Japanese)jam (Jamaican Creole English)jbo (Lojban)jv (Javanese)

ka (Georgian)kaa (Kara-Kalpak)kab (Kabyle)kbd (Kabardian)kbp (Kabiyè)kg (Kongo)ki (Kikuyu)kk (Kazakh)kl (Kalaallisut)km (Central Khmer)kn (Kannada)ko (Korean)koi (Komi-Permyak)krc (Karachay-Balkar)ks (Kashmiri)ksh (Kölsch)ku (Kurdish)kv (Komi)kw (Cornish)ky (Kirghiz)

la (Latin)lad (Ladino)lb (Luxembourgish)lbe (Lak)lez (Lezghian)lg (Ganda)li (Limburgan)lij (Ligurian)lmo (Lombard)ln (Lingala)lo (Lao)lrc (Northern Luri)lt (Lithuanian)ltg (Latgalian)lv (Latvian)

mai (Maithili)mdf (Moksha)mg (Malagasy)mh (Marshallese)mhr (Eastern Mari)mi (Maori)min (Minangkabau)mk (Macedonian)ml (Malayalam)mn (Mongolian)mr (Marathi)mrj (Western Mari)ms (Malay)mt (Maltese)mwl (Mirandese)my (Burmese)myv (Erzya)mzn (Mazanderani)

na (Nauru)nap (Neapolitan)nds (Low German)ne (Nepali)new (Newari)ng (Ndonga)nl (Dutch)nn (Norwegian Nynorsk)no (Norwegian)nov (Novial)nrm (Narom)nso (Pedi)nv (Navajo)ny (Nyanja)

oc (Occitan)olo (Livvi)om (Oromo)or (Oriya)os (Ossetian)

pa (Panjabi)pag (Pangasinan)pam (Pampanga)pap (Papiamento)pcd (Picard)pdc (Pennsylvania German)pfl (Pfaelzisch)pi (Pali)pih (Pitcairn-Norfolk)pl (Polish)pms (Piemontese)pnb (Western Panjabi)pnt (Pontic)ps (Pushto)pt (Portuguese)

qu (Quechua)

rm (Romansh)rmy (Vlax Romani)rn (Rundi)ro (Romanian)ru (Russian)rue (Rusyn)rw (Kinyarwanda)

sa (Sanskrit)sah (Yakut)sc (Sardinian)scn (Sicilian)sco (Scots)sd (Sindhi)se (Northern Sami)sg (Sango)sh (Serbo-Croatian)si (Sinhala)sk (Slovak)sl (Slovenian)sm (Samoan)sn (Shona)so (Somali)sq (Albanian)sr (Serbian)srn (Sranan Tongo)ss (Swati)st (Southern Sotho)stq (Saterfriesisch)su (Sundanese)sv (Swedish)sw (Swahili)szl (Silesian)

ta (Tamil)tcy (Tulu)te (Telugu)tet (Tetum)tg (Tajik)th (Thai)ti (Tigrinya)tk (Turkmen)tl (Tagalog)tn (Tswana)to (Tonga)tpi (Tok Pisin)tr (Turkish)ts (Tsonga)tt (Tatar)tum (Tumbuka)tw (Twi)ty (Tahitian)tyv (Tuvinian)

udm (Udmurt)ug (Uighur)uk (Ukrainian)ur (Urdu)uz (Uzbek)

ve (Venda)vec (Venetian)vep (Veps)vi (Vietnamese)vls (Vlaams)vo (Volapük)

wa (Walloon)war (Waray)wo (Wolof)wuu (Wu Chinese)

xal (Kalmyk)xh (Xhosa)xmf (Mingrelian)

yi (Yiddish)yo (Yoruba)

za (Zhuang)zea (Zeeuws)zh (Chinese)zu (Zulu)

MultiBPEmb

multi (multilingual)

Citing BPEmb

If you use BPEmb in academic work, please cite:

@InProceedings{heinzerling2018bpemb,
  author = {Benjamin Heinzerling and Michael Strube},
  title = "{BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages}",
  booktitle = {Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)},
  year = {2018},
  month = {May 7-12, 2018},
  address = {Miyazaki, Japan},
  editor = {Nicoletta Calzolari (Conference chair) and Khalid Choukri and Christopher Cieri and Thierry Declerck and Sara Goggi and Koiti Hasida and Hitoshi Isahara and Bente Maegaard and Joseph Mariani and Hélène Mazo and Asuncion Moreno and Jan Odijk and Stelios Piperidis and Takenobu Tokunaga},
  publisher = {European Language Resources Association (ELRA)},
  isbn = {979-10-95546-00-9},
  language = {english}
  }
An open-source NLP research library, built on PyTorch.

An Apache 2.0 NLP research library, built on PyTorch, for developing state-of-the-art deep learning models on a wide variety of linguistic tasks. Quic

AI2 11.4k Jan 01, 2023
Non-Autoregressive Predictive Coding

Non-Autoregressive Predictive Coding This repository contains the implementation of Non-Autoregressive Predictive Coding (NPC) as described in the pre

Alexander H. Liu 43 Nov 15, 2022
The source code of HeCo

HeCo This repo is for source code of KDD 2021 paper "Self-supervised Heterogeneous Graph Neural Network with Co-contrastive Learning". Paper Link: htt

Nian Liu 106 Dec 27, 2022
NLTK Source

Natural Language Toolkit (NLTK) NLTK -- the Natural Language Toolkit -- is a suite of open source Python modules, data sets, and tutorials supporting

Natural Language Toolkit 11.4k Jan 04, 2023
This project uses word frequency and Term Frequency-Inverse Document Frequency to summarize a text.

Text Summarizer This project uses word frequency and Term Frequency-Inverse Document Frequency to summarize a text. Team Members This mini-project was

1 Nov 16, 2021
AI-Broad-casting - AI Broad casting with python

Basic Code 1. Use The Code Configuration Environment conda create -n code_base p

Awesome Treasure of Transformers Models Collection

💁 Awesome Treasure of Transformers Models for Natural Language processing contains papers, videos, blogs, official repo along with colab Notebooks. 🛫☑️

Ashish Patel 577 Jan 07, 2023
A collection of Korean Text Datasets ready to use using Tensorflow-Datasets.

tfds-korean A collection of Korean Text Datasets ready to use using Tensorflow-Datasets. TensorFlow-Datasets를 이용한 한국어/한글 데이터셋 모음입니다. Dataset Catalog |

Jeong Ukjae 20 Jul 11, 2022
Search Git commits in natural language

NaLCoS - NAtural Language COmmit Search Search commit messages in your repository in natural language. NaLCoS (NAtural Language COmmit Search) is a co

Pushkar Patel 50 Mar 22, 2022
🧪 Cutting-edge experimental spaCy components and features

spacy-experimental: Cutting-edge experimental spaCy components and features This package includes experimental components and features for spaCy v3.x,

Explosion 65 Dec 30, 2022
2021 AI CUP Competition on Traditional Chinese Scene Text Recognition - Intermediate Contest

繁體中文場景文字辨識 程式碼說明 組別:這就是我 成員:蔣明憲 唐碩謙 黃玥菱 林冠霆 蕭靖騰 目錄 環境套件 安裝方式 資料夾布局 前處理-製作偵測訓練註解檔 前處理-製作分類訓練樣本 part.py : 從 json 裁切出分類訓練樣本 Class.py : 將切出來的樣本按照文字分類到各資料夾

HuanyueTW 3 Jan 14, 2022
Full Spectrum Bioinformatics - a free online text designed to introduce key topics in Bioinformatics using the Python

Full Spectrum Bioinformatics is a free online text designed to introduce key topics in Bioinformatics using the Python programming language. The text is written in interactive Jupyter Notebooks, whic

Jesse Zaneveld 33 Dec 28, 2022
Ray-based parallel data preprocessing for NLP and ML.

Wrangl Ray-based parallel data preprocessing for NLP and ML. pip install wrangl # for latest pip install git+https://github.com/vzhong/wrangl See exa

Victor Zhong 33 Dec 27, 2022
A 30000+ Chinese MRC dataset - Delta Reading Comprehension Dataset

Delta Reading Comprehension Dataset 台達閱讀理解資料集 Delta Reading Comprehension Dataset (DRCD) 屬於通用領域繁體中文機器閱讀理解資料集。 本資料集期望成為適用於遷移學習之標準中文閱讀理解資料集。 本資料集從2,108篇

272 Dec 15, 2022
Accurately generate all possible forms of an English word e.g "election" --> "elect", "electoral", "electorate" etc.

Accurately generate all possible forms of an English word Word forms can accurately generate all possible forms of an English word. It can conjugate v

Dibya Chakravorty 570 Dec 31, 2022
The official implementation of VAENAR-TTS, a VAE based non-autoregressive TTS model.

VAENAR-TTS This repo contains code accompanying the paper "VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis". Sa

THUHCSI 138 Oct 28, 2022
Shellcode antivirus evasion framework

Schrodinger's Cat Schrodinger'sCat is a Shellcode antivirus evasion framework Technical principle Please visit my blog https://idiotc4t.com/ How to us

idiotc4t 27 Jul 09, 2022
The PyTorch based implementation of continuous integrate-and-fire (CIF) module.

CIF-PyTorch This is a PyTorch based implementation of continuous integrate-and-fire (CIF) module for end-to-end (E2E) automatic speech recognition (AS

Minglun Han 24 Dec 29, 2022
NLP command-line assistant powered by OpenAI

NLP command-line assistant powered by OpenAI

Axel 16 Dec 09, 2022