Pytorch version of BERT-whitening

Overview

BERT-whitening

This is the Pytorch implementation of "Whitening Sentence Representations for Better Semantics and Faster Retrieval".

BERT-whitening is very practical in text semantic search, in which the whitening operation not only improves the performance of unsupervised semantic vector matching, but also reduces the vector dimension, which is beneficial to reduce memory usage and improve retrieval efficiency for vector search engines, e.g., FAISS.

This method was first proposed by Jianlin Su in his blog[1].

Reproduce the experimental results

Preparation

Download datasets

$ cd data/
$ ./download_datasets.sh
$ cd ../

Download models

$ cd model/
$ ./download_models.sh
$ cd ../

After the datasets and models are downloaded, the data/ and model/ directories are as follows:

├── data
│   ├── AllNLI.tsv
│   ├── download_datasets.sh
│   └── downstream
│       ├── COCO
│       ├── CR
│       ├── get_transfer_data.bash
│       ├── MPQA
│       ├── MR
│       ├── MRPC
│       ├── SICK
│       ├── SNLI
│       ├── SST
│       ├── STS
│       ├── SUBJ
│       ├── tokenizer.sed
│       └── TREC
├── model
│   ├── bert-base-nli-mean-tokens
│   ├── bert-base-uncased
│   ├── bert-large-nli-mean-tokens
│   ├── bert-large-uncased
│   └── download_models.sh

BERT without whitening

$ python3 ./eval_without_whitening.py

Results:

Model STS-12 STS-13 STS-14 STS-15 STS-16 SICK-R STS-B
BERTbase-cls 0.3062 0.2638 0.2765 0.3605 0.5180 0.4242 0.2029
BERTbase-first_last_avg 0.5785 0.6196 0.6250 0.7096 0.6979 0.6375 0.5904
BERTlarge-cls 0.3240 0.2621 0.2629 0.3554 0.4439 0.4343 0.2675
BERTlarge-first_last_avg 0.5773 0.6116 0.6117 0.6806 0.7030 0.6034 0.5959

BERT with whitening(target)

$ python3 ./eval_with_whitening\(target\).py

Results:

Model STS-12 STS-13 STS-14 STS-15 STS-16 SICK-R STS-B
BERTbase-whiten-256(target) 0.6390 0.7375 0.6909 0.7459 0.7442 0.6223 0.7143
BERTlarge-whiten-384(target) 0.6435 0.7460 0.6964 0.7468 0.7594 0.6081 0.7247
SBERTbase-nli-whiten-256(target) 0.6912 0.7931 0.7805 0.8165 0.7958 0.7500 0.8074
SBERTlarge-nli-whiten-384(target) 0.7126 0.8061 0.7852 0.8201 0.8036 0.7402 0.8199

BERT with whitening(NLI)

$ python3 ./eval_with_whitening\(nli\).py

Results:

Model STS-12 STS-13 STS-14 STS-15 STS-16 SICK-R STS-B
BERTbase-whiten(nli) 0.6169 0.6571 0.6605 0.7516 0.7320 0.6829 0.6365
BERTbase-whiten-256(nli) 0.6148 0.6672 0.6622 0.7483 0.7222 0.6757 0.6496
BERTlarge-whiten(nli) 0.6254 0.6737 0.6715 0.7503 0.7636 0.6865 0.6250
BERTlarge-whiten-348(nli) 0.6231 0.6784 0.6701 0.7548 0.7546 0.6866 0.6381
SBERTbase-nli-whiten(nli) 0.6868 0.7646 0.7626 0.8230 0.7964 0.7896 0.7653
SBERTbase-nli-whiten-256(nli) 0.6891 0.7703 0.7658 0.8229 0.7828 0.7880 0.7678
SBERTlarge-nli-whiten(nli) 0.7074 0.7756 0.7720 0.8285 0.8080 0.7910 0.7589
SBERTlarge-nli-whiten-384(nli) 0.7123 0.7893 0.7790 0.8355 0.8057 0.8037 0.7689

Semantic retrieve with FAISS

An important function of BERT-whitening is that it can not only improve the effect of semantic similarity retrieval, but also reduce memory usage and increase retrieval speed. In this experiment, we use Quora Duplicate Questions Dataset and FAISS, a vector retrieval engine, to measure the retrieval effect and efficiency of different models. The dataset contains more than 400,000 pairs of question1-question2, and it is marked whether they are similar. We extract all the semantic vectors of question2 and store them in FAISS (299,364 vectors in total), and then use the semantic vectors of question1 to retrieve them in FAISS (290,654 vectors in total). [email protected] is used to measure the effect of retrieval, Average Retrieve Time (ms) is used to measure retrieval efficiency, and Memory Usage (GB) is used to measure memory usage. FAISS is configured in CPU mode, nlist = 1024'' and nprobe = 5'', and the CPU is Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz.

Modify model_name'' in qqp_search_with_faiss.py'', and then execute:

$ python3 qqp_search_with_faiss.py

The experimental results of different models are as follows:

Model [email protected] Average Retrieve Time (ms) Memory Usage (GB)
BERTbase-XX
BERTbase-first_last_avg 0.5531 0.7488 0.8564
BERTbase-whiten(nli) 0.5571 0.9735 0.8564
BERTbase-whiten-256(nli) 0.5616 0.2698 0.2854
BERTbase-whiten(target) 0.6104 0.8436 0.8564
BERTbase-whiten-256(target) 0.5957 0.1910 0.2854
BERTlarge-XX
BERTlarge-first_last_avg 0.5667 1.2015 1.1419
BERTlarge-whiten(nli) 0.5783 1.3458 1.1419
BERTlarge-whiten-384(nli) 0.5798 0.4118 0.4282
BERTlarge-whiten(target) 0.6178 1.1418 1.1419
BERTlarge-whiten-384(target) 0.6194 0.3301 0.4282

From the experimental results, the use of whitening to reduce the vector sizes of BERTbase and BERTlarge to 256 and 384, respectively, can significantly reduce memory usage and retrieval time, while improving retrieval results. The memory usage is strictly proportional to the vector dimension, while the average retrieval time is not strictly proportional to the vector dimension. This is because FAISS has a difference in clustering question2, which will cause some fluctuations in retrieval efficiency, but in general, the lower its dimensionality, the higher the retrieval efficiency.

References

[1] 苏剑林, 你可能不需要BERT-flow:一个线性变换媲美BERT-flow, 2020.

[2] 苏剑林, Keras版本BERT-whitening, 2020.

Owner
Weijie Liu
NLP and KG
Weijie Liu
A Python package implementing a new model for text classification with visualization tools for Explainable AI :octocat:

A Python package implementing a new model for text classification with visualization tools for Explainable AI 🍣 Online live demos: http://tworld.io/s

Sergio Burdisso 285 Jan 02, 2023
TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset.

TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset. TunBERT was applied to three NLP downstream tasks: Sentiment Analysis (S

InstaDeep Ltd 72 Dec 09, 2022
This repository collects together basic linguistic processing data for using dataset dumps from the Common Voice project

Common Voice Utils This repository collects together basic linguistic processing data for using dataset dumps from the Common Voice project. It aims t

Francis Tyers 40 Dec 20, 2022
[EMNLP 2021] LM-Critic: Language Models for Unsupervised Grammatical Error Correction

LM-Critic: Language Models for Unsupervised Grammatical Error Correction This repo provides the source code & data of our paper: LM-Critic: Language M

Michihiro Yasunaga 98 Nov 24, 2022
ASCEND Chinese-English code-switching dataset

ASCEND (A Spontaneous Chinese-English Dataset) introduces a high-quality resource of spontaneous multi-turn conversational dialogue Chinese-English code-switching corpus collected in Hong Kong.

CAiRE 11 Dec 09, 2022
pkuseg多领域中文分词工具; The pkuseg toolkit for multi-domain Chinese word segmentation

pkuseg:一个多领域中文分词工具包 (English Version) pkuseg 是基于论文[Luo et. al, 2019]的工具包。其简单易用,支持细分领域分词,有效提升了分词准确度。 目录 主要亮点 编译和安装 各类分词工具包的性能对比 使用方式 论文引用 作者 常见问题及解答 主要

LancoPKU 6k Dec 29, 2022
Tool to check whether a GCP bucket is public or not.

Tool to check publicly accessible GCP bucket. Blog https://justm0rph3u5.medium.com/gcp-inspector-auditing-publicly-exposed-gcp-bucket-ac6cad55618c Wha

DIVYANSHU SHUKLA 7 Nov 24, 2022
Official implementation of Meta-StyleSpeech and StyleSpeech

Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation Dongchan Min, Dong Bok Lee, Eunho Yang, and Sung Ju Hwang This is an official code

min95 169 Jan 05, 2023
A curated list of efficient attention modules

awesome-fast-attention A curated list of efficient attention modules

Sepehr Sameni 891 Dec 22, 2022
a CTF web challenge about making screenshots

screenshotter (web) A CTF web challenge about making screenshots. It is inspired by a bug found in real life. The challenge was created by @LiveOverfl

219 Jan 02, 2023
An easy to use Natural Language Processing library and framework for predicting, training, fine-tuning, and serving up state-of-the-art NLP models.

Welcome to AdaptNLP A high level framework and library for running, training, and deploying state-of-the-art Natural Language Processing (NLP) models

Novetta 407 Jan 03, 2023
Ecommerce product title recognition package

revizor This package solves task of splitting product title string into components, like type, brand, model and article (or SKU or product code or you

Bureaucratic Labs 16 Mar 03, 2022
AI-powered literature discovery and review engine for medical/scientific papers

AI-powered literature discovery and review engine for medical/scientific papers paperai is an AI-powered literature discovery and review engine for me

NeuML 819 Dec 30, 2022
DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism (SVS & TTS); AAAI 2022

DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism This repository is the official PyTorch implementation of our AAAI-2022 paper, in

Jinglin Liu 829 Jan 07, 2023
ZUNIT - Toward Zero-Shot Unsupervised Image-to-Image Translation

ZUNIT Dependencies you can install all the dependencies by pip install -r requirements.txt Datasets Download CUB dataset. Unzip the birds.zip at ./da

Chen Yuanqi 9 Jun 24, 2022
Just a basic Telegram AI chat bot written in Python using Pyrogram.

Nikko ChatBot Just a basic Telegram AI chat bot written in Python using Pyrogram. Requirements Python 3.7 or higher. A bot token. Installation $ https

ʀᴇxɪɴᴀᴢᴏʀ 2 Oct 21, 2022
nlabel is a library for generating, storing and retrieving tagging information and embedding vectors from various nlp libraries through a unified interface.

nlabel is a library for generating, storing and retrieving tagging information and embedding vectors from various nlp libraries through a unified interface.

Bernhard Liebl 2 Jun 10, 2022
Official code repository of the paper Linear Transformers Are Secretly Fast Weight Programmers.

Linear Transformers Are Secretly Fast Weight Programmers This repository contains the code accompanying the paper Linear Transformers Are Secretly Fas

Imanol Schlag 77 Dec 19, 2022
[ICCV 2021] Instance-level Image Retrieval using Reranking Transformers

Instance-level Image Retrieval using Reranking Transformers Fuwen Tan, Jiangbo Yuan, Vicente Ordonez, ICCV 2021. Abstract Instance-level image retriev

UVA Computer Vision 86 Dec 28, 2022
💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

Rasa Open Source Rasa is an open source machine learning framework to automate text-and voice-based conversations. With Rasa, you can build contextual

Rasa 15.3k Dec 30, 2022