SNCSE: Contrastive Learning for Unsupervised Sentence Embedding with Soft Negative Samples

Last update: Jan 02, 2023

Related tags

Overview

SNCSE

SNCSE: Contrastive Learning for Unsupervised Sentence Embedding with Soft Negative Samples

This is the repository for SNCSE.

SNCSE aims to alleviate feature suppression in contrastive learning for unsupervised sentence embedding. In the field, feature suppression means the models fail to distinguish and decouple textual similarity and semantic similarity. As a result, they may overestimate the semantic similarity of any pairs with similar textual regardless of the actual semantic difference between them. And the models may underestimate the semantic similarity of pairs with less words in common. (Please refer to Section 5 of our paper for several instances and detailed analysis.) To this end, we propose to take the negation of original sentences as soft negative samples, and introduce them into the traditional contrastive learning framework through bidirectional margin loss (BML). The structure of SNCSE is as follows:

The performance of SNCSE on STS task with different encoders is:

To reproduce above results, please download the files and unzip it to replace the original file folder. Then download the models, modify the file path variables and run:

python bert_prediction.py
python roberta_prediction.py

To train SNCSE, please download the training file, and put it at /SNCSE/data. You can either run:

python generate_soft_negative_samples.py

to generate soft negative samples, or use our files in /Files/soft_negative_samples.txt. Then you may modify and run train_SNCSE.sh.

To evaluate the checkpoints saved during training on the development set of STSB task, please run:

python bert_evaluation.py
python roberta_evaluation.py

Feel free to contact the authors at [email protected] for any questions.

Please cite SNCSE as

{

Hao Wang, Yangguang Li, Zhen Huang, Yong Dou, Lingpeng Kong, Jing Shao.

SNCSE: Contrastive Learning for Unsupervised Sentence Embedding with Soft Negative Samples.

CoRR, abs/2201.05979, 2022.

}

SNCSE: Contrastive Learning for Unsupervised Sentence Embedding with Soft Negative Samples

Related tags

Overview

SNCSE

Owner

Sense-GVT

Transformer related optimization, including BERT, GPT

Weaviate demo with the text2vec-openai module

BiNE: Bipartite Network Embedding

Large-scale Knowledge Graph Construction with Prompting

Journey is a NLP-Powered Developer assistant

Implementation of "Adversarial purification with Score-based generative models", ICML 2021

RuCLIP-SB (Russian Contrastive Language–Image Pretraining SWIN-BERT) is a multimodal model for obtaining images and text similarities and rearranging captions and pictures. Unlike other versions of the model we use BERT for text encoder and SWIN transformer for image encoder.

PUA Programming Language written in Python.

Must-read papers on improving efficiency for pre-trained language models.

ProteinBERT is a universal protein language model pretrained on ~106M proteins from the UniRef90 dataset.

End-2-end speech synthesis with recurrent neural networks

Code Generation using a large neural network called GPT-J

An official repository for tutorials of Probabilistic Modelling and Reasoning (2021/2022) - a University of Edinburgh master's course.

Accurately generate all possible forms of an English word e.g "election" --> "elect", "electoral", "electorate" etc.

Chinese named entity recognization (bert/roberta/macbert/bert_wwm with Keras)

Gathers machine learning and Tensorflow deep learning models for NLP problems, 1.13 < Tensorflow < 2.0

A Fast Command Analyser based on Dict and Pydantic

PatrickStar enables Larger, Faster, Greener Pretrained Models for NLP. Democratize AI for everyone.

Hostapd-mac-tod-acl - Setup a hostapd AP with MAC ToD ACL

pkuseg多领域中文分词工具; The pkuseg toolkit for multi-domain Chinese word segmentation