source code for paper: WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach.

Last update: Dec 17, 2022

Related tags

Overview

WhiteningBERT

Source code and data for paper WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach.

Preparation

git clone https://github.com/Jun-jie-Huang/WhiteningBERT.git
pip install -r requirements.txt
cd examples/evaluation

Usage

Datasets

We use seven STS datasets, including STSBenchmark, SICK-Relatedness, STS12, STS13, STS14, STS15, STS16.

The processed data can be found in ./examples/datasets/.

Run

To run a quick demo:

python evaluation_stsbenchmark.py \
			--pooling aver \
			--layer_num 1,12 \
			--whitening \
			--encoder_name bert-base-cased

Specify --pooing with cls or aver to choose whether use the [CLS] token or averaging all tokens. Also specify --layer_num to combine layers, separated by a comma.

To enumerate all possible combinations of two layers and automatically evaluate the combinations consequently:

python evaluation_stsbenchmark_layer2.py \
			--pooling aver \
			--whitening \
			--encoder_name bert-base-cased

To enumerate all possible combinations of N layers:

python evaluation_stsbenchmark_layerN.py \
			--pooling aver \
			--whitening \
			--encoder_name bert-base-cased\
			--combination_num 4

You can also save the embeddings of the sentences

python evaluation_stsbenchmark_save_embed.py \
			--pooling aver \
			--layer_num 1,12 \
			--whitening \
			--encoder_name bert-base-cased \
			--summary_dir ./save_embeddings

A list of PLMs you can select:

bert-base-uncased , bert-large-uncased
roberta-base, roberta-large
bert-base-multilingual-uncased
sentence-transformers/LaBSE
albert-base-v1 , albert-large-v1
microsoft/layoutlm-base-uncased , microsoft/layoutlm-large-uncased
SpanBERT/spanbert-base-cased , SpanBERT/spanbert-large-cased
microsoft/deberta-base , microsoft/deberta-large
google/electra-base-discriminator
google/mobilebert-uncased
microsoft/DialogRPT-human-vs-rand
distilbert-base-uncased
......

Acknowledgements

Codes are adapted from the repos of the EMNLP19 paper Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks and the EMNLP20 paper An Unsupervised Sentence Embedding Method by Mutual Information Maximization

source code for paper: WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach.

Related tags

Overview

WhiteningBERT

Preparation

Usage

Datasets

Run

A list of PLMs you can select:

Acknowledgements

Owner

Create a semantic search engine with a neural network (i.e. BERT) whose knowledge base can be updated

LSTM based Sentiment Classification using Tensorflow - Amazon Reviews Rating

Official PyTorch implementation of Time-aware Large Kernel (TaLK) Convolutions (ICML 2020)

Code for Text Prior Guided Scene Text Image Super-Resolution

CDLA: A Chinese document layout analysis (CDLA) dataset

Machine learning classifiers to predict American Sign Language .

⛵️The official PyTorch implementation for "BERT-of-Theseus: Compressing BERT by Progressive Module Replacing" (EMNLP 2020).

This is a modification of the OpenAI-CLIP repository of moein-shariatnia

customer care chatbot made with Rasa Open Source.

DANeS is an open-source E-newspaper dataset by collaboration between DATASET JSC (dataset.vn) and AIV Group (aivgroup.vn)

Phrase-Based & Neural Unsupervised Machine Translation

Implementation of paper Does syntax matter? A strong baseline for Aspect-based Sentiment Analysis with RoBERTa.

Addon for adding subtitle files to blender VSE as Text sequences. Using pysub2 python module.

SHAS: Approaching optimal Segmentation for End-to-End Speech Translation

原神抽卡记录数据集-Genshin Impact gacha data

jel - Japanese Entity Linker - is Bi-encoder based entity linker for japanese.

WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

pytorch-kaldi is a project for developing state-of-the-art DNN/RNN hybrid speech recognition systems. The DNN part is managed by pytorch, while feature extraction, label computation, and decoding are performed with the kaldi toolkit.

Biterm Topic Model (BTM): modeling topics in short texts

Use fastai-v2 with HuggingFace's pretrained transformers