[EMNLP 2021] Mirror-BERT: Converting Pretrained Language Models to universal text encoders without labels.

Overview

Mirror-BERT

Code repo for the EMNLP 2021 paper:
Fast, Effective, and Self-Supervised: Transforming Masked Language Models into Universal Lexical and Sentence Encoders
by Fangyu Liu, Ivan Vulić, Anna Korhonen, and Nigel Collier.

Mirror-BERT is an unsupervised contrastive learning method that converts pretrained language models (PLMs) into universal text encoders. It takes a PLM and a txt file containing raw text as input, and output a strong text embedding model, in just 20-30 seconds. It works well for not only sentence, but also word and phrase representation learning.

Hugginface pretrained models

Sentence enocders:

model STS avg.
baseline: sentence-bert (supervised) 74.89
mirror-bert-base-uncased-sentence 74.51
mirror-roberta-base-sentence 75.08
mirror-bert-base-uncased-sentence-drophead 75.16
mirror-roberta-base-sentence-drophead 76.67

Word encoder:

model Multi-SimLex (ENG)
baseline: fasttext 52.80
mirror-bert-base-uncased-word 55.60

(Note that the released models would not replicate the exact numbers in the paper, since the reported numbers in the paper are average of three runs.)

Train

For training sentence representations:

>> ./mirror_scripts/mirror_sentence_bert.sh 0,1

where 0,1 are GPU indices. This script should complete in 20-30 seconds on two NVIDIA 2080Ti/3090 GPUs. If you encounter out-of-memory error, consider reducing max_length in the script. Scripts for replicating other models are availible in mirror_scripts/.

Custom data: For training with your custom corpus, simply set --train_dir in the script to your own txt file (one sentence per line). When you do have raw sentences from your target domain, we recommend you always use the in-domain data for optimal performance. E.g., if you aim to create a conversational encoder, sample 10k utterances to train your model!

Supervised training: Organise your training data in the format of text1||text2 and store them one pair per line in a txt file. Then turn on the --pairwise option. text1 and text2 will be regarded as a positive pair in contrastive learning. You can be creative in finding such training pairs and it would be the best if they are from your application domain. E.g., to build an e-commerce QA encoder, the question||answer pairs from the Amazon quesrion-answer dataset could work quite well. Example training script: mirror_scripts/mirror_sentence_roberta_supervised_amazon_qa.sh. Note that when tuned on your in-domain data, you shouldn't expect the model to be good at STS. Instead, the models need to be evaluated on your in-domain task.

Word-level training: Use mirror_scripts/mirror_word_bert.sh.

Encode

It's easy to compute your own sentence embeddings:

from src.mirror_bert import MirrorBERT

model_name = "cambridgeltl/mirror-roberta-base-sentence-drophead"
mirror_bert = MirrorBERT()
mirror_bert.load_model(path=model_name, use_cuda=True)

embeddings = mirror_bert.get_embeddings([
    "I transform pre-trained language models into universal text encoders.",
], agg_mode="cls")
print (embeddings.shape)

Evaluate

Evaluate sentence representations:

>> python evaluation/eval.py \
	--model_dir "cambridgeltl/mirror-roberta-base-sentence-drophead" \
	--agg_mode "cls" \
	--dataset sent_all

Evaluate word representations:

>> python evaluation/eval.py \
	--model_dir "cambridgeltl/mirror-bert-base-uncased-word" \
	--agg_mode "cls" \
	--dataset multisimlex_ENG

To test models on other languages, replace ENG to your custom languages. See here for all supported languages on Multi-SimLex.

Citation

@inproceedings{liu2021fast,
  title={Fast, Effective, and Self-Supervised: Transforming Masked Language Models into Universal Lexical and Sentence Encoders},
  author={Liu, Fangyu and Vuli{\'c}, Ivan and Korhonen, Anna and Collier, Nigel},
  booktitle={EMNLP 2021},
  year={2021}
}
Owner
Cambridge Language Technology Lab
Cambridge Language Technology Lab
Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.

Pattern Pattern is a web mining module for Python. It has tools for: Data Mining: web services (Google, Twitter, Wikipedia), web crawler, HTML DOM par

Computational Linguistics Research Group 8.4k Dec 30, 2022
Contact Extraction with Question Answering.

contactsQA Extraction of contact entities from address blocks and imprints with Extractive Question Answering. Goal Input: Dr. Max Mustermann Hauptstr

Jan 2 Apr 20, 2022
A calibre plugin that generates Word Wise and X-Ray files then sends them to Kindle. Supports KFX, AZW3 and MOBI eBooks. X-Ray supports 18 languages.

WordDumb A calibre plugin that generates Word Wise and X-Ray files then sends them to Kindle. Supports KFX, AZW3 and MOBI eBooks. Languages X-Ray supp

172 Dec 29, 2022
Bot to connect a real Telegram user, simulating responses with OpenAI's davinci GPT-3 model.

AI-BOT Bot to connect a real Telegram user, simulating responses with OpenAI's davinci GPT-3 model.

Thempra 2 Dec 21, 2022
New Modeling The Background CodeBase

Modeling the Background for Incremental Learning in Semantic Segmentation This is the updated official PyTorch implementation of our work: "Modeling t

Fabio Cermelli 9 Dec 28, 2022
A natural language modeling framework based on PyTorch

Overview PyText is a deep-learning based NLP modeling framework built on PyTorch. PyText addresses the often-conflicting requirements of enabling rapi

Meta Research 6.4k Jan 08, 2023
Neural-Machine-Translation - Implementation of revolutionary machine translation models

Neural Machine Translation Framework: PyTorch Repository contaning my implementa

Utkarsh Jain 1 Feb 17, 2022
A paper list of pre-trained language models (PLMs).

Large-scale pre-trained language models (PLMs) such as BERT and GPT have achieved great success and become a milestone in NLP.

RUCAIBox 124 Jan 02, 2023
PG-19 Language Modelling Benchmark

PG-19 Language Modelling Benchmark This repository contains the PG-19 language modeling benchmark. It includes a set of books extracted from the Proje

DeepMind 161 Oct 30, 2022
Code voor mijn Master project omtrent VideoBERT

Code voor masterproef Deze repository bevat de code voor het project van mijn masterproef omtrent VideoBERT. De code in deze repository is gebaseerd o

35 Oct 18, 2021
Python interface for converting Penn Treebank trees to Stanford Dependencies and Universal Depenencies

PyStanfordDependencies Python interface for converting Penn Treebank trees to Universal Dependencies and Stanford Dependencies. Example usage Start by

David McClosky 64 May 08, 2022
Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing

Trankit: A Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing Trankit is a light-weight Transformer-based Pyth

652 Jan 06, 2023
An end to end ASR Transformer model training repo

END TO END ASR TRANSFORMER 本项目基于transformer 6*encoder+6*decoder的基本结构构造的端到端的语音识别系统 Model Instructions 1.数据准备: 自行下载数据,遵循文件结构如下: ├── data │ ├── train │

旷视天元 MegEngine 10 Jul 19, 2022
Code for ACL 2020 paper "Rigid Formats Controlled Text Generation"

SongNet SongNet: SongCi + Song (Lyrics) + Sonnet + etc. @inproceedings{li-etal-2020-rigid, title = "Rigid Formats Controlled Text Generation",

Piji Li 212 Dec 17, 2022
Yet Another Sequence Encoder - Encode sequences to vector of vector in python !

Yase Yet Another Sequence Encoder - encode sequences to vector of vectors in python ! Why Yase ? Yase enable you to encode any sequence which can be r

Pierre PACI 12 Aug 19, 2021
BMInf (Big Model Inference) is a low-resource inference package for large-scale pretrained language models (PLMs).

BMInf (Big Model Inference) is a low-resource inference package for large-scale pretrained language models (PLMs).

OpenBMB 377 Jan 02, 2023
Code for the paper in Findings of EMNLP 2021: "EfficientBERT: Progressively Searching Multilayer Perceptron via Warm-up Knowledge Distillation".

This repository contains the code for the paper in Findings of EMNLP 2021: "EfficientBERT: Progressively Searching Multilayer Perceptron via Warm-up Knowledge Distillation".

Chenhe Dong 28 Nov 10, 2022
This repo stores the codes for topic modeling on palliative care journals.

This repo stores the codes for topic modeling on palliative care journals. Data Preparation You first need to download the journal papers. bash 1_down

3 Dec 20, 2022
Text Classification Using LSTM

Text classification is the task of assigning a set of predefined categories to free text. Text classifiers can be used to organize, structure, and categorize pretty much anything. For example, new ar

KrishArul26 3 Jan 03, 2023
Text preprocessing, representation and visualization from zero to hero.

Text preprocessing, representation and visualization from zero to hero. From zero to hero • Installation • Getting Started • Examples • API • FAQ • Co

Jonathan Besomi 2.7k Jan 08, 2023