Pytorch version of BERT-whitening

Overview

BERT-whitening

This is the Pytorch implementation of "Whitening Sentence Representations for Better Semantics and Faster Retrieval".

BERT-whitening is very practical in text semantic search, in which the whitening operation not only improves the performance of unsupervised semantic vector matching, but also reduces the vector dimension, which is beneficial to reduce memory usage and improve retrieval efficiency for vector search engines, e.g., FAISS.

This method was first proposed by Jianlin Su in his blog[1].

Reproduce the experimental results

Preparation

Download datasets

$ cd data/
$ ./download_datasets.sh
$ cd ../

Download models

$ cd model/
$ ./download_models.sh
$ cd ../

After the datasets and models are downloaded, the data/ and model/ directories are as follows:

├── data
│   ├── AllNLI.tsv
│   ├── download_datasets.sh
│   └── downstream
│       ├── COCO
│       ├── CR
│       ├── get_transfer_data.bash
│       ├── MPQA
│       ├── MR
│       ├── MRPC
│       ├── SICK
│       ├── SNLI
│       ├── SST
│       ├── STS
│       ├── SUBJ
│       ├── tokenizer.sed
│       └── TREC
├── model
│   ├── bert-base-nli-mean-tokens
│   ├── bert-base-uncased
│   ├── bert-large-nli-mean-tokens
│   ├── bert-large-uncased
│   └── download_models.sh

BERT without whitening

$ python3 ./eval_without_whitening.py

Results:

Model STS-12 STS-13 STS-14 STS-15 STS-16 SICK-R STS-B
BERTbase-cls 0.3062 0.2638 0.2765 0.3605 0.5180 0.4242 0.2029
BERTbase-first_last_avg 0.5785 0.6196 0.6250 0.7096 0.6979 0.6375 0.5904
BERTlarge-cls 0.3240 0.2621 0.2629 0.3554 0.4439 0.4343 0.2675
BERTlarge-first_last_avg 0.5773 0.6116 0.6117 0.6806 0.7030 0.6034 0.5959

BERT with whitening(target)

$ python3 ./eval_with_whitening\(target\).py

Results:

Model STS-12 STS-13 STS-14 STS-15 STS-16 SICK-R STS-B
BERTbase-whiten-256(target) 0.6390 0.7375 0.6909 0.7459 0.7442 0.6223 0.7143
BERTlarge-whiten-384(target) 0.6435 0.7460 0.6964 0.7468 0.7594 0.6081 0.7247
SBERTbase-nli-whiten-256(target) 0.6912 0.7931 0.7805 0.8165 0.7958 0.7500 0.8074
SBERTlarge-nli-whiten-384(target) 0.7126 0.8061 0.7852 0.8201 0.8036 0.7402 0.8199

BERT with whitening(NLI)

$ python3 ./eval_with_whitening\(nli\).py

Results:

Model STS-12 STS-13 STS-14 STS-15 STS-16 SICK-R STS-B
BERTbase-whiten(nli) 0.6169 0.6571 0.6605 0.7516 0.7320 0.6829 0.6365
BERTbase-whiten-256(nli) 0.6148 0.6672 0.6622 0.7483 0.7222 0.6757 0.6496
BERTlarge-whiten(nli) 0.6254 0.6737 0.6715 0.7503 0.7636 0.6865 0.6250
BERTlarge-whiten-348(nli) 0.6231 0.6784 0.6701 0.7548 0.7546 0.6866 0.6381
SBERTbase-nli-whiten(nli) 0.6868 0.7646 0.7626 0.8230 0.7964 0.7896 0.7653
SBERTbase-nli-whiten-256(nli) 0.6891 0.7703 0.7658 0.8229 0.7828 0.7880 0.7678
SBERTlarge-nli-whiten(nli) 0.7074 0.7756 0.7720 0.8285 0.8080 0.7910 0.7589
SBERTlarge-nli-whiten-384(nli) 0.7123 0.7893 0.7790 0.8355 0.8057 0.8037 0.7689

Semantic retrieve with FAISS

An important function of BERT-whitening is that it can not only improve the effect of semantic similarity retrieval, but also reduce memory usage and increase retrieval speed. In this experiment, we use Quora Duplicate Questions Dataset and FAISS, a vector retrieval engine, to measure the retrieval effect and efficiency of different models. The dataset contains more than 400,000 pairs of question1-question2, and it is marked whether they are similar. We extract all the semantic vectors of question2 and store them in FAISS (299,364 vectors in total), and then use the semantic vectors of question1 to retrieve them in FAISS (290,654 vectors in total). [email protected] is used to measure the effect of retrieval, Average Retrieve Time (ms) is used to measure retrieval efficiency, and Memory Usage (GB) is used to measure memory usage. FAISS is configured in CPU mode, nlist = 1024'' and nprobe = 5'', and the CPU is Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz.

Modify model_name'' in qqp_search_with_faiss.py'', and then execute:

$ python3 qqp_search_with_faiss.py

The experimental results of different models are as follows:

Model [email protected] Average Retrieve Time (ms) Memory Usage (GB)
BERTbase-XX
BERTbase-first_last_avg 0.5531 0.7488 0.8564
BERTbase-whiten(nli) 0.5571 0.9735 0.8564
BERTbase-whiten-256(nli) 0.5616 0.2698 0.2854
BERTbase-whiten(target) 0.6104 0.8436 0.8564
BERTbase-whiten-256(target) 0.5957 0.1910 0.2854
BERTlarge-XX
BERTlarge-first_last_avg 0.5667 1.2015 1.1419
BERTlarge-whiten(nli) 0.5783 1.3458 1.1419
BERTlarge-whiten-384(nli) 0.5798 0.4118 0.4282
BERTlarge-whiten(target) 0.6178 1.1418 1.1419
BERTlarge-whiten-384(target) 0.6194 0.3301 0.4282

From the experimental results, the use of whitening to reduce the vector sizes of BERTbase and BERTlarge to 256 and 384, respectively, can significantly reduce memory usage and retrieval time, while improving retrieval results. The memory usage is strictly proportional to the vector dimension, while the average retrieval time is not strictly proportional to the vector dimension. This is because FAISS has a difference in clustering question2, which will cause some fluctuations in retrieval efficiency, but in general, the lower its dimensionality, the higher the retrieval efficiency.

References

[1] 苏剑林, 你可能不需要BERT-flow:一个线性变换媲美BERT-flow, 2020.

[2] 苏剑林, Keras版本BERT-whitening, 2020.

Owner
Weijie Liu
NLP and KG
Weijie Liu
A simple Speech Emotion Recognition (SER) API created using Flask and running in a Docker container.

keyword_searching Steps to use this Python scripts: (1)Paste this script into the file folder containing the PDF files you need to search from; (2)Thi

2 Nov 11, 2022
GPT-Code-Clippy (GPT-CC) is an open source version of GitHub Copilot, a language model

GPT-Code-Clippy (GPT-CC) is an open source version of GitHub Copilot, a language model -- based on GPT-3, called GPT-Codex -- that is fine-tuned on publicly available code from GitHub.

Nathan Cooper 2.3k Jan 01, 2023
Diaformer: Automatic Diagnosis via Symptoms Sequence Generation

Diaformer Diaformer: Automatic Diagnosis via Symptoms Sequence Generation (AAAI 2022) Diaformer is an efficient model for automatic diagnosis via symp

Junying Chen 20 Dec 13, 2022
Data and code to support "Applied Natural Language Processing" (INFO 256, Fall 2021, UC Berkeley)

anlp21 Course materials for "Applied Natural Language Processing" (INFO 256, Fall 2021, UC Berkeley) Syllabus: http://people.ischool.berkeley.edu/~dba

David Bamman 48 Dec 06, 2022
Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

Fairseq(-py) is a sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization, language mod

20.5k Jan 08, 2023
1 Jun 28, 2022
The PyTorch based implementation of continuous integrate-and-fire (CIF) module.

CIF-PyTorch This is a PyTorch based implementation of continuous integrate-and-fire (CIF) module for end-to-end (E2E) automatic speech recognition (AS

Minglun Han 24 Dec 29, 2022
An ActivityWatch watcher to pose questions to the user and record her answers.

aw-watcher-ask An ActivityWatch watcher to pose questions to the user and record her answers. This watcher uses Zenity to present dialog boxes to the

Bernardo Chrispim Baron 33 Dec 03, 2022
Non-Autoregressive Translation with Layer-Wise Prediction and Deep Supervision

Deeply Supervised, Layer-wise Prediction-aware (DSLP) Transformer for Non-autoregressive Neural Machine Translation

Chenyang Huang 37 Jan 04, 2023
Backend for the Autocomplete platform. An AI assisted coding platform.

Introduction A custom predictor allows you to deploy your own prediction implementation, useful when the existing serving implementations don't fit yo

Tatenda Christopher Chinyamakobvu 1 Jan 31, 2022
NLP tool to extract emotional phrase from tweets 🤩

Emotional phrase extractor Extract phrase in the given text that is used to express the sentiment. Capturing sentiment in language is important in the

Shahul ES 38 Oct 17, 2022
The guide to tackle with the Text Summarization

The guide to tackle with the Text Summarization

Takahiro Kubo 1.2k Dec 30, 2022
构建一个多源(公众号、RSS)、干净、个性化的阅读环境

2C 构建一个多源(公众号、RSS)、干净、个性化的阅读环境 作为一名微信公众号的重度用户,公众号一直被我设为汲取知识的地方。随着使用程度的增加,相信大家或多或少会有一个比较头疼的问题——广告问题。 假设你关注的公众号有十来个,若一个公众号两周接一次广告,理论上你会面临二十多次广告,实际上会更多,运

howie.hu 678 Dec 28, 2022
This repository contains the code for running the character-level Sandwich Transformers from our ACL 2020 paper on Improving Transformer Models by Reordering their Sublayers.

Improving Transformer Models by Reordering their Sublayers This repository contains the code for running the character-level Sandwich Transformers fro

Ofir Press 53 Sep 26, 2022
Fine-tune GPT-3 with a Google Chat conversation history

Google Chat GPT-3 This repo will help you fine-tune GPT-3 with a Google Chat conversation history. The trained model will be able to converse as one o

Nate Baer 7 Dec 10, 2022
뉴스 도메인 질의응답 시스템 (21-1학기 졸업 프로젝트)

뉴스 도메인 질의응답 시스템 본 프로젝트는 뉴스기사에 대한 질의응답 서비스 를 제공하기 위해서 진행한 프로젝트입니다. 약 3개월간 ( 21. 03 ~ 21. 05 ) 진행하였으며 Transformer 아키텍쳐 기반의 Encoder를 사용하여 한국어 질의응답 데이터셋으로

TaegyeongEo 4 Jul 08, 2022
A simple Streamlit App to classify swahili news into different categories.

Swahili News Classifier Streamlit App A simple app to classify swahili news into different categories. Installation Install all streamlit requirements

Davis David 4 May 01, 2022
天池中药说明书实体识别挑战冠军方案;中文命名实体识别;NER; BERT-CRF & BERT-SPAN & BERT-MRC;Pytorch

天池中药说明书实体识别挑战冠军方案;中文命名实体识别;NER; BERT-CRF & BERT-SPAN & BERT-MRC;Pytorch

zxx飞翔的鱼 751 Dec 30, 2022
Implementation of TTS with combination of Tacotron2 and HiFi-GAN

Tacotron2-HiFiGAN-master Implementation of TTS with combination of Tacotron2 and HiFi-GAN for Mandarin TTS. Inference In order to inference, we need t

SunLu Z 7 Nov 11, 2022