SentAugment is a data augmentation technique for semi-supervised learning in NLP.

Overview

SentAugment

SentAugment is a data augmentation technique for semi-supervised learning in NLP. It uses state-of-the-art sentence embeddings to structure the information of a very large bank of sentences. The large-scale sentence embedding space is then used to retrieve in-domain unannotated sentences for any language understanding task such that semi-supervised learning techniques like self-training and knowledge-distillation can be leveraged. This means you do not need to assume the presence of unannotated sentences to use semi-supervised learning techniques. In our paper Self-training Improves Pre-training for Natural Language Understanding, we show that SentAugment provides strong gains on multiple language understanding tasks when used in combination with self-training or knowledge distillation.

Model

Dependencies

I. The large-scale bank of sentences

Our approach is based on a large bank of CommonCrawl web sentences. We use SentAugment to filter domain-specific unannotated data for semi-supervised learning NLP methods. This data can be found here and can be recovered from CommonCrawl by the ccnet repository. It consists of 5 billion sentences, each file containing 100M sentences. As an example, we are going to use 100M sentences from the first file:

mkdir data && cd data
wget http://www.statmt.org/cc-english/x01.cc.5b.tar.gz

Then untar files and put all sentences into a single file:

tar -xvf *.tar.gz
cat *.5b > keys.txt

Then, for fast indexing, create a memory map (mmap) of this text file:

python src/compress_text.py --input data/keys.txt &

We will use this data as the bank of sentences.

II. The SentAugment sentence embedding space (SASE)

Our sentence encoder is based on the Transformer implementation of XLM. It obtains state-of-the-art performance on several STS benchmarks. To use it, first clone XLM:

git clone https://github.com/facebookresearch/XLM

Then, download the SentAugment sentence encoder (SASE), and its sentencepiece model:

cd data
wget https://dl.fbaipublicfiles.com/sentaugment/sase.pth
wget https://dl.fbaipublicfiles.com/sentaugment/sase.spm

Then to embed sentences, you can run for instance:

input=data/keys.txt  # input text file
output=data/keys.pt  # output pytorch file

# Encode sentence from $input file and save it to $output
python src/sase.py --input $input --model data/sase.pth --spm_model data/sase.spm --batch_size 64 --cuda "True" --output $output

This will output a torch file containing sentence embeddings (dim=256).

III. Retrieving nearest neighbor sentences from a query

Now that you have constructed a sentence embedding space by encoding many sentences from CommonCrawl, you can leverage that "bank of sentences" with similarity search. From an input query sentence, you can retrieve nearest neighbors from the bank by running:

nn.txt & ">
bank=data/keys.txt.ref.bin64  # compressed text file (bank)
emb=data/keys.pt  # embeddings of sentences (keys)
K=10000  # number of sentences to retrieve per query

## encode input sentences as sase embedding
input=sentence.txt  # input file containing a few (query) sentences
python src/sase.py --input $input --model data/sase.pth --spm_model data/sase.spm --batch_size 64 --cuda "True" --output $input.pt

## use embedding to retrieve nearest neighbors
input=sentence.txt  # input file containing a few (query) sentences
python src/flat_retrieve.py --input $input.pt --bank $bank --emb data/keys.pt --K $K > nn.txt &

Sentences in nn.txt can be used for semi-supervised learning as unannotated in-domain data. They also provide good paraphrases (use the cosine similarity score to filter good paraphrase pairs).

In the next part, we provide fast nearest-neighbor indexes for faster retrieval of similar sentences.

IV. Fast K-nearest neighbor search

Fast K-nearest neighbor search is particularly important when considering a large bank of sentences. We use FAISS indexes to optimize the memory usage and query time.

IV.1 - The KNN index bestiary

For fast nearest-neighbor search, we provide pretrained FAISS indexes (see Table below). Each index enables fast NN search based on different compression schemes. The embeddings are compressed using for instance scalar quantization (SQ4 or SQ8), PCA reduction (PCAR: 14, 40, 256), and search is sped up with k-means clustering (32k or 262k). Please consider looking at the FAISS documentation for more information on indexes and how to train them.

FAISS index #Sentences #Clusters Quantization #PCAR Machine Size
100M_1GPU_16GB 100M 32768 SQ4 256 1GPU16 14GiB
100M_1GPU_32GB 100M 32768 SQ8 256 1GPU32 26GiB
1B_1GPU_16GB 1B 262144 SQ4 14 1GPU16 15GiB
1B_1GPU_32GB 1B 262144 SQ4 40 1GPU32 28GiB
1B_8GPU_32GB 1B 262144 SQ4 256 8GPU32 136GiB

We provide indexes that fit either on 1 GPU with 16GiB memory (1GPU16) up to a larger index that fits on 1 GPU with 32 GiB memory (1GPU32) and one that fits on 8 GPUs (32GB). Indexes that use 100M sentences are built from the first file "x01.cc.5b.tar.gz", and 1B indexes use the first ten files. All indexes are based on SASE embeddings.

IV.2 - How to use an index to query nearest neighbors

You can get K nearest neighbors for each sentence of an input text file by running:

nn.txt & ">
## encode input sentences as sase embedding
input=sentence.txt  # input file containing a few (query) sentences
python src/sase.py --input $input --model data/sase.pth --spm_model data/sase.spm --batch_size 64 --cuda "True" --output $input.pt

index=data/100M_1GPU_16GB.faiss.idx  # FAISS index path
input=sentences.pt  # embeddings of input sentences
bank=data/keys.txt  # text file with all the data (the compressed file keys.ref.bin64 should also be present in the same folder)
K=10  # number of sentences to retrieve per query
NPROBE=1024 # number of probes for querying the index

python src/faiss_retrieve.py --input $input --bank $bank --index $index --K $K --nprobe $NPROBE --gpu "True" > nn.txt &

This can also be used for paraphrase mining.

Reference

If you found the resources here useful, please consider citing our paper:

@article{du2020self,
  title={Self-training Improves Pre-training for Natural Language Understanding},
  author={Du, Jingfei and Grave, Edouard and Gunel, Beliz and Chaudhary, Vishrav and Celebi, Onur and Auli, Michael and Stoyanov, Ves and Conneau, Alexis},
  journal={arXiv preprint arXiv:2010.02194},
  year={2020}
}

License

See the LICENSE file for more details. The majority of SentAugment is licensed under CC-BY-NC. However, license information for PyTorch code is available at https://github.com/pytorch/pytorch/blob/master/LICENSE

Owner
Meta Research
Meta Research
A python package for deep multilingual punctuation prediction.

This python library predicts the punctuation of English, Italian, French and German texts. We developed it to restore the punctuation of transcribed spoken language.

Oliver Guhr 27 Dec 22, 2022
BERT, LDA, and TFIDF based keyword extraction in Python

BERT, LDA, and TFIDF based keyword extraction in Python kwx is a toolkit for multilingual keyword extraction based on Google's BERT and Latent Dirichl

Andrew Tavis McAllister 41 Dec 27, 2022
Extract Keywords from sentence or Replace keywords in sentences.

FlashText This module can be used to replace keywords in sentences or extract keywords from sentences. It is based on the FlashText algorithm. Install

Vikash Singh 5.3k Jan 01, 2023
This simple Python program calculates a love score based on your and your crush's full names in English

This simple Python program calculates a love score based on your and your crush's full names in English. There is no logic or reason in the calculation behind the love score. The calculation could ha

p.katekomol 1 Jan 24, 2022
Implementation of Multistream Transformers in Pytorch

Multistream Transformers Implementation of Multistream Transformers in Pytorch. This repository deviates slightly from the paper, where instead of usi

Phil Wang 47 Jul 26, 2022
Programme de chiffrement et de déchiffrement inverse d'un message en python3.

Chiffrement Inverse En Python3 Programme de chiffrement et de déchiffrement inverse d'un message en python3. Explication du chiffrement inverse avec c

Malik Makkes 2 Mar 26, 2022
Natural Language Processing Specialization

Natural Language Processing Specialization In this folder, Natural Language Processing Specialization projects and notes can be found. WHAT I LEARNED

Kaan BOKE 3 Oct 06, 2022
HAIS_2GNN: 3D Visual Grounding with Graph and Attention

HAIS_2GNN: 3D Visual Grounding with Graph and Attention This repository is for the HAIS_2GNN research project. Tao Gu, Yue Chen Introduction The motiv

Yue Chen 1 Nov 26, 2022
This repository details the steps in creating a Part of Speech tagger using Trigram Hidden Markov Models and the Viterbi Algorithm without using external libraries.

POS-Tagger This repository details the creation of a Part-of-Speech tagger using Trigram Hidden Markov Models to predict word tags in a word sequence.

Raihan Ahmed 1 Dec 09, 2021
基于GRU网络的句子判断程序/A program based on GRU network for judging sentences

SentencesJudger SentencesJudger 是一个基于GRU神经网络的句子判断程序,基本的功能是判断文章中的某一句话是否为一个优美的句子。 English 如何使用SentencesJudger 确认Python运行环境 安装pyTorch与LTP python3 -m pip

8 Mar 24, 2022
Applying "Load What You Need: Smaller Versions of Multilingual BERT" to LaBSE

smaller-LaBSE LaBSE(Language-agnostic BERT Sentence Embedding) is a very good method to get sentence embeddings across languages. But it is hard to fi

Jeong Ukjae 13 Sep 02, 2022
Abhijith Neil Abraham 2 Nov 05, 2021
Understand Text Summarization and create your own summarizer in python

Automatic summarization is the process of shortening a text document with software, in order to create a summary with the major points of the original document. Technologies that can make a coherent

Sreekanth M 1 Oct 18, 2022
Simple and efficient RevNet-Library with DeepSpeed support

RevLib Simple and efficient RevNet-Library with DeepSpeed support Features Half the constant memory usage and faster than RevNet libraries Less memory

Lucas Nestler 112 Dec 05, 2022
LOT: A Benchmark for Evaluating Chinese Long Text Understanding and Generation

LOT: A Benchmark for Evaluating Chinese Long Text Understanding and Generation Tasks | Datasets | LongLM | Baselines | Paper Introduction LOT is a ben

46 Dec 28, 2022
中文医疗信息处理基准CBLUE: A Chinese Biomedical LanguageUnderstanding Evaluation Benchmark

English | 中文说明 CBLUE AI (Artificial Intelligence) is playing an indispensabe role in the biomedical field, helping improve medical technology. For fur

452 Dec 30, 2022
SentAugment is a data augmentation technique for semi-supervised learning in NLP.

SentAugment SentAugment is a data augmentation technique for semi-supervised learning in NLP. It uses state-of-the-art sentence embeddings to structur

Meta Research 363 Dec 30, 2022
A music comments dataset, containing 39,051 comments for 27,384 songs.

Music Comments Dataset A music comments dataset, containing 39,051 comments for 27,384 songs. For academic research use only. Introduction This datase

Zhang Yixiao 2 Jan 10, 2022
Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.

Kashgari Overview | Performance | Installation | Documentation | Contributing 🎉 🎉 🎉 We released the 2.0.0 version with TF2 Support. 🎉 🎉 🎉 If you

Eliyar Eziz 2.3k Dec 29, 2022
Creating an Audiobook (mp3 file) using a Ebook (epub) using BeautifulSoup and Google Text to Speech

epub2audiobook Creating an Audiobook (mp3 file) using a Ebook (epub) using BeautifulSoup and Google Text to Speech Input examples qual a pasta do seu

7 Aug 25, 2022