ACL22 paper: Imputing Out-of-Vocabulary Embeddings with LOVE Makes Language Models Robust with Little Cost

Overview

Imputing Out-of-Vocabulary Embeddings with LOVE Makes Language Models Robust with Little Cost

LOVE is accpeted by ACL22 main conference as a long paper (oral). This is a Pytorch implementation of our paper.

What is LOVE?

LOVE, Learning Out-of-Vocabulary Embeddings, is the name of our beautiful model given by Fabian Suchanek.

LOVE can produce word embeddings for arbitrary words, including out-of-vocabulary words like misspelled words, rare words, domain-specific words.....

Specifically, LOVE follows the principle of mimick-like models [2] to generate vectors for unseen words, by learning the behavior of pre-trained embeddings using only the surface form of words, as shown in the below figure.

mimic_model

To our best knowledge, LOVE is the first one to use contrastive learning for word-level representations. The framework is shown in the below figure, and it uses various data augmentations to generate positive samples. Another distinction is that LOVE adopts a novel fully attention-based encoder named PAM to mimic the vectors from pre-trained embeddings. You can find all details in our paper. mimic_model

The benefits of LOVE?

1. Impute vectors for unseen words

As we know, pre-trained embeddings like FastText use a fixed-size vocabulary, which means the performance decreases a lot when dealing with OOV words.

LOVE can mimic the behavior of pre-trained language models (including BERT) and impute vectors for any words.

For example, mispleling is a typo word, and LOVE can impute a reasonable vector for it:

from produce_emb import produce

oov_word = 'mispleling'
emb = produce(oov_word)
print(emb[oov_word][:10])

## output [-0.0582502  -0.11268596 -0.12599416  0.09926333  0.02513208  0.01140639
 -0.02326127 -0.007608    0.01973115  0.12448607]

2. Make LMs robust with little cost

LOVE can be used in a plug-and-play fashion with FastText and BERT, where it significantly improves their robustness. For example, LOVE with 6.5M can work with FastText (900+M) together and improve its robustness, as shown in the figure: mimic_model

The usage of LOVE

Clone the repository and set up the environment via "requirements.txt". Here we use python3.6.

pip install -r requirements.txt

Data preparation

In our experiments, we use the FastText as target vectors [1]. Downlaod. After downloading, put the embedding file in the path data/

Training

First you can use -help to show the arguments

python train.py -help

Once completing the data preparation and environment setup, we can train the model via train.py. We have also provided sample datasets, you can just run the mode without downloading.

python train.py -dataset data/wiki_100.vec

Evaulation

To show the intrinsic results of our model, you can use the following command and we have provided the trained model we used in our paper.

python evaluate.py

## expected output
model parameters:~6.5M
[RareWord]: [plugin], 42.6476207426462 
[MEN  ]: [plugin], 68.47815031602434 
[SimLex]: [plugin], 35.02258000865248 
[rel353]: [plugin], 55.8950046345804 
[simverb]: [plugin], 28.7233237185531 
[muturk]: [plugin], 63.77020916555088 

Reference

[1] Bojanowski, Piotr, et al. "Enriching word vectors with subword information." Transactions of the Association for Computational Linguistics 5 (2017): 135-146.

[2] Pinter, Yuval, Robert Guthrie, and Jacob Eisenstein. "Mimicking Word Embeddings using Subword RNNs." Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2017.

Owner
Lihu Chen
A PhD student of IP Paris! Enjoy Coding!
Lihu Chen
jel - Japanese Entity Linker - is Bi-encoder based entity linker for japanese.

jel: Japanese Entity Linker jel - Japanese Entity Linker - is Bi-encoder based entity linker for japanese. Usage Currently, link and question methods

izuna385 10 Jan 06, 2023
Source code of paper "BP-Transformer: Modelling Long-Range Context via Binary Partitioning"

BP-Transformer This repo contains the code for our paper BP-Transformer: Modeling Long-Range Context via Binary Partition Zihao Ye, Qipeng Guo, Quan G

Zihao Ye 119 Nov 14, 2022
Legal text retrieval for python

legal-text-retrieval Overview This system contains 2 steps: generate training data containing negative sample found by mixture score of cosine(tfidf)

Nguyễn Minh Phương 22 Dec 06, 2022
A simple tool to update bib entries with their official information (e.g., DBLP or the ACL anthology).

Rebiber: A tool for normalizing bibtex with official info. We often cite papers using their arXiv versions without noting that they are already PUBLIS

(Bill) Yuchen Lin 2k Jan 01, 2023
Creating a chess engine using GPT-3

GPT3Chess Creating a chess engine using GPT-3 Code for my article : https://towardsdatascience.com/gpt-3-play-chess-d123a96096a9 My game (white) vs GP

19 Dec 17, 2022
This is an incredibly powerful calculator that is capable of many useful day-to-day functions.

Description 💻 This is an incredibly powerful calculator that is capable of many useful day-to-day functions. Such functions include solving basic ari

Jordan Leich 37 Nov 19, 2022
NLP-based analysis of poor Chinese movie reviews on Douban

douban_embedding 豆瓣中文影评差评分析 1. NLP NLP(Natural Language Processing)是指自然语言处理,他的目的是让计算机可以听懂人话。 下面是我将2万条豆瓣影评训练之后,随意输入一段新影评交给神经网络,最终AI推断出的结果。 "很好,演技不错

3 Apr 15, 2022
ElasticBERT: A pre-trained model with multi-exit transformer architecture.

This repository contains finetuning code and checkpoints for ElasticBERT. Towards Efficient NLP: A Standard Evaluation and A Strong Baseli

fastNLP 48 Dec 14, 2022
This project uses unsupervised machine learning to identify correlations between daily inoculation rates in the USA and twitter sentiment in regards to COVID-19.

Twitter COVID-19 Sentiment Analysis Members: Christopher Bach | Khalid Hamid Fallous | Jay Hirpara | Jing Tang | Graham Thomas | David Wetherhold Pro

4 Oct 15, 2022
NLP techniques such as named entity recognition, sentiment analysis, topic modeling, text classification with Python to predict sentiment and rating of drug from user reviews.

This file contains the following documents sumbited for Baruch CIS9665 group 9 fall 2021. 1. Dataset: drug_reviews.csv 2. python codes for text classi

Aarif Munwar Jahan 2 Jan 04, 2023
Chinese real time voice cloning (VC) and Chinese text to speech (TTS).

Chinese real time voice cloning (VC) and Chinese text to speech (TTS). 好用的中文语音克隆兼中文语音合成系统,包含语音编码器、语音合成器、声码器和可视化模块。

Kuang Dada 6 Nov 08, 2022
KoBART model on huggingface transformers

KoBART-Transformers SKT에서 공개한 KoBART를 편리하게 사용할 수 있게 transformers로 포팅하였습니다. Install (Optional) BartModel과 PreTrainedTokenizerFast를 이용하면 설치하실 필요 없습니다. p

Hyunwoong Ko 58 Dec 07, 2022
Open-Source Toolkit for End-to-End Speech Recognition leveraging PyTorch-Lightning and Hydra.

OpenSpeech provides reference implementations of various ASR modeling papers and three languages recipe to perform tasks on automatic speech recogniti

Soohwan Kim 26 Dec 14, 2022
The official implementation of "BERT is to NLP what AlexNet is to CV: Can Pre-Trained Language Models Identify Analogies?, ACL 2021 main conference"

BERT is to NLP what AlexNet is to CV This is the official implementation of BERT is to NLP what AlexNet is to CV: Can Pre-Trained Language Models Iden

Asahi Ushio 20 Nov 03, 2022
AI-Broad-casting - AI Broad casting with python

Basic Code 1. Use The Code Configuration Environment conda create -n code_base p

Reading Wikipedia to Answer Open-Domain Questions

DrQA This is a PyTorch implementation of the DrQA system described in the ACL 2017 paper Reading Wikipedia to Answer Open-Domain Questions. Quick Link

Facebook Research 4.3k Jan 01, 2023
An Open-Source Package for Neural Relation Extraction (NRE)

OpenNRE We have a DEMO website (http://opennre.thunlp.ai/). Try it out! OpenNRE is an open-source and extensible toolkit that provides a unified frame

THUNLP 3.9k Jan 03, 2023
A minimal Conformer ASR implementation adapted from ESPnet.

Conformer ASR A minimal Conformer ASR implementation adapted from ESPnet. Introduction I want to use the pre-trained English ASR model provided by ESP

Niu Zhe 3 Jan 24, 2022
Tokenizer - Module python d'analyse syntaxique et de grammaire, tokenization

Tokenizer Le Tokenizer est un analyseur lexicale, il permet, comme Flex and Yacc par exemple, de tokenizer du code, c'est à dire transformer du code e

Manolo 1 Aug 15, 2022
American Sign Language (ASL) to Text Converter

Signterpreter American Sign Language (ASL) to Text Converter Recommendations Although there is grayscale and gaussian blur, we recommend that you use

0 Feb 20, 2022