State of the art Semantic Sentence Embeddings

Last update: Dec 30, 2022

Related tags

Deep Learning Contrastive-Tension

Overview

Contrastive Tension

State of the art Semantic Sentence Embeddings

Published Paper · Huggingface Models · Report Bug

Overview

This is the official code accompanied with the paper Semantic Re-Tuning via Contrastive Tension.
The paper was accepted at ICLR-2021 and official reviews and responses can be found at OpenReview.

Contrastive Tension(CT) is a fully self-supervised algorithm for re-tuning already pre-trained transformer Language Models, and achieves State-Of-The-Art(SOTA) sentence embeddings for Semantic Textual Similarity(STS). All that is required is hence a pre-trained model and a modestly large text corpus. The results presented in the paper sampled text data from Wikipedia.

This repository contains:

Tensorflow 2 implementation of the CT algorithm
State of the art pre-trained STS models
Tensorflow 2 inference code
PyTorch inference code

Requirements

While it is possible that other versions works equally fine, we have worked with the following:

Python = 3.6.9
Transformers = 4.1.1

Usage

All the models and tokenizers are available via the Huggingface interface, and can be loaded for both Tensorflow and PyTorch:

import transformers

tokenizer = transformers.AutoTokenizer.from_pretrained('Contrastive-Tension/RoBerta-Large-CT-STSb')

TF_model = transformers.TFAutoModel.from_pretrained('Contrastive-Tension/RoBerta-Large-CT-STSb')
PT_model = transformers.AutoModel.from_pretrained('Contrastive-Tension/RoBerta-Large-CT-STSb')

Inference

To perform inference with the pre-trained models (or other Huggigface models) please see the script ExampleBatchInference.py.
The most important thing to remember when running inference is to apply the attention_masks on the batch output vector before mean pooling, as is done in the example script.

CT Training

To run CT on your own models and text data see ExampleTraining.py for a comprehensive example. This file currently creates a dummy corpus of random text. Simply replace this to whatever corpus you like.

Pre-trained Models

Note that these models are not trained with the exact hyperparameters as those disclosed in the original CT paper. Rather, the parameters are from a short follow-up paper currently under review, which once again pushes the SOTA.

All evaluation is done using the SentEval framework, and shows the: (Pearson / Spearman) correlations

Unsupervised / Zero-Shot

As both the training of BERT, and CT itself is fully self-supervised, the models only tuned with CT require no labeled data whatsoever.
The NLI models however, are first fine-tuned towards a natural language inference task, which requires labeled data.

Model	Avg Unsupervised STS	STS-b	#Parameters
Fully Unsupervised
BERT-Distil-CT	75.12 / 75.04	78.63 / 77.91	66 M
BERT-Base-CT	73.55 / 73.36	75.49 / 73.31	108 M
BERT-Large-CT	77.12 / 76.93	80.75 / 79.82	334 M
Using NLI Data
BERT-Distil-NLI-CT	76.65 / 76.63	79.74 / 81.01	66 M
BERT-Base-NLI-CT	76.05 / 76.28	79.98 / 81.47	108 M
BERT-Large-NLI-CT	77.42 / 77.41	80.92 / 81.66	334 M

Supervised

These models are fine-tuned directly with STS data, using a modified version of the supervised training object proposed by S-BERT.
To our knowledge our RoBerta-Large-STSb is the current SOTA model for STS via sentence embeddings.

Model	STS-b	#Parameters
BERT-Distil-CT-STSb	84.85 / 85.46	66 M
BERT-Base-CT-STSb	85.31 / 85.76	108 M
BERT-Large-CT-STSb	85.86 / 86.47	334 M
RoBerta-Large-CT-STSb	87.56 / 88.42	334 M

Other Languages

Model	Language	#Parameters
BERT-Base-Swe-CT-STSb	Swedish	108 M

License

Distributed under the MIT License. See LICENSE for more information.

Contact

If you have questions regarding the paper, please consider creating a comment via the official OpenReview submission.
If you have questions regarding the code or otherwise related to this Github page, please open an issue.

For other purposes, feel free to contact me directly at: [email protected]

State of the art Semantic Sentence Embeddings

Related tags

Overview

Contrastive Tension

State of the art Semantic Sentence Embeddings

Overview

Requirements

Usage

Inference

CT Training

Pre-trained Models

Unsupervised / Zero-Shot

Supervised

Other Languages

License

Contact

Acknowledgements

Owner

Fredrik Carlsson

Fast and Easy Infinite Neural Networks in Python

PyTorch implementation of CVPR 2020 paper (Reference-Based Sketch Image Colorization using Augmented-Self Reference and Dense Semantic Correspondence) and pre-trained model on ImageNet dataset

Official Implementation of DE-CondDETR and DELA-CondDETR in "Towards Data-Efficient Detection Transformers"

This is a re-implementation of TransGAN: Two Pure Transformers Can Make One Strong GAN (CVPR 2021) in PyTorch.

A whale detector design for the Kaggle whale-detector challenge!

Per-Pixel Classification is Not All You Need for Semantic Segmentation

The official repository for Deep Image Matting with Flexible Guidance Input

audioLIME: Listenable Explanations Using Source Separation

PyTorch implementation of the paper: Label Noise Transition Matrix Estimation for Tasks with Lower-Quality Features

NumPy로 구현한 딥러닝 라이브러리입니다. (자동 미분 지원)

Improving Query Representations for DenseRetrieval with Pseudo Relevance Feedback:A Reproducibility Study.

The repo contains the code to train and evaluate a system which extracts relations and explanations from dialogue.

A simple and extensible library to create Bayesian Neural Network layers on PyTorch.

Identifying a Training-Set Attack’s Target Using Renormalized Influence Estimation

Breast Cancer Classification Model is applied on a different dataset

This is the source code of the solver used to compete in the International Timetabling Competition 2019.

FedScale: Benchmarking Model and System Performance of Federated Learning

BlockUnexpectedPackets - Preventing BungeeCord CPU overload due to Layer 7 DDoS attacks by scanning BungeeCord's logs

FinGAT: A Financial Graph Attention Networkto Recommend Top-K Profitable Stocks

A large dataset of 100k Google Satellite and matching Map images, resembling pix2pix's Google Maps dataset.