Implementing SimCSE(paper, official repository) using TensorFlow 2 and KR-BERT.

Last update: Dec 12, 2022

Related tags

Text Data & NLP paper-implementations

Overview

KR-BERT-SimCSE

Implementing SimCSE(paper, official repository) using TensorFlow 2 and KR-BERT.

Training

Unsupervised

python train_unsupervised.py --mixed_precision

I used Korean Wikipedia Corpus that is divided into sentences in advance. (Check out tfds-korean catalog page for details)

Settings
- KR-BERT character
- peak learning rate 3e-5
- batch size 64
- Total steps: 25,000
- 0.05 warmup rate, and linear decay learning rate scheduler
- temperature 0.05
- evalaute on KLUE STS and KorSTS every 250 steps
- max sequence length 64
- Use pooled outputs for training, and [CLS] token's representations for inference

The hyperparameters were not tuned and mostly followed the values in the paper.

Supervised

python train_supervised.py --mixed_precision

I used KorNLI for supervised training. (Check out tfds-korean catalog page)

Settings
- KR-BERT character
- batch size 128
- epoch 3
- peak learning rate 5e-5
- 0.05 warmup rate, and linear decay learning rate scheduler
- temperature 0.05
- evalaute on KLUE STS and KorSTS every 125 steps
- max sequence length 48
- Use pooled outputs for training, and [CLS] token's representations for inference

The hyperparameters were not tuned and mostly followed the values in the paper.

Results

KorSTS (dev set results)

model			100 X Spearman correlation
KR-BERT base SimCSE	unsupervised	bi encoding	79.99
KR-BERT base SimCSE-supervised	trained on KorNLI	bi encoding	84.88

SRoBERTa base*	unsupervised	bi encoding	63.34
SRoBERTa base*	trained on KorNLI	bi encoding	76.48
SRoBERTa base*	trained on KorSTS	bi encoding	83.68
SRoBERTa base*	trained on KorNLI -> KorSTS	bi encoding	83.54

SRoBERTa large*	trained on KorNLI	bi encoding	77.95
SRoBERTa large*	trained on KorSTS	bi encoding	84.74
SRoBERTa large*	trained on KorNLI -> KorSTS	bi encoding	84.21

*: results from Ham et al., 2020.

KorSTS (test set results)

model			100 X Spearman correlation
KR-BERT base SimCSE	unsupervised	bi encoding	73.25
KR-BERT base SimCSE-supervised	trained on KorNLI	bi encoding	80.72

SRoBERTa base*	unsupervised	bi encoding	48.96
SRoBERTa base*	trained on KorNLI	bi encoding	74.19
SRoBERTa base*	trained on KorSTS	bi encoding	78.94
SRoBERTa base*	trained on KorNLI -> KorSTS	bi encoding	80.29

SRoBERTa large*	trained on KorNLI	bi encoding	75.46
SRoBERTa large*	trained on KorSTS	bi encoding	79.55
SRoBERTa large*	trained on KorNLI -> KorSTS	bi encoding	80.49

SRoBERTa base*	trained on KorSTS	cross encoding	83.00
SRoBERTa large*	trained on KorSTS	cross encoding	85.27

*: results from Ham et al., 2020.

KLUE STS (dev set results)

model			100 X Pearson's correlation
KR-BERT base SimCSE	unsupervised	bi encoding	74.45
KR-BERT base SimCSE-supervised	trained on KorNLI	bi encoding	79.42

KR-BERT base*	supervised	cross encoding	87.50

*: results from Park et al., 2021.

References

@misc{gao2021simcse,
    title={SimCSE: Simple Contrastive Learning of Sentence Embeddings},
    author={Tianyu Gao and Xingcheng Yao and Danqi Chen},
    year={2021},
    eprint={2104.08821},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

@misc{ham2020kornli,
    title={KorNLI and KorSTS: New Benchmark Datasets for Korean Natural Language Understanding},
    author={Jiyeon Ham and Yo Joong Choe and Kyubyong Park and Ilji Choi and Hyungjoon Soh},
    year={2020},
    eprint={2004.03289},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

@misc{park2021klue,
    title={KLUE: Korean Language Understanding Evaluation},
    author={Sungjoon Park and Jihyung Moon and Sungdong Kim and Won Ik Cho and Jiyoon Han and Jangwon Park and Chisung Song and Junseong Kim and Yongsook Song and Taehwan Oh and Joohong Lee and Juhyun Oh and Sungwon Lyu and Younghoon Jeong and Inkwon Lee and Sangwoo Seo and Dongjun Lee and Hyunwoo Kim and Myeonghwa Lee and Seongbo Jang and Seungwon Do and Sunkyoung Kim and Kyungtae Lim and Jongwon Lee and Kyumin Park and Jamin Shin and Seonghyun Kim and Lucy Park and Alice Oh and Jung-Woo Ha and Kyunghyun Cho},
    year={2021},
    eprint={2105.09680},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

Implementing SimCSE(paper, official repository) using TensorFlow 2 and KR-BERT.

Related tags

Overview

KR-BERT-SimCSE

Training

Unsupervised

Supervised

Results

KorSTS (dev set results)

KorSTS (test set results)

KLUE STS (dev set results)

References

Owner

Jeong Ukjae

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

Programme de chiffrement et de déchiffrement inverse d'un message en python3.

Unet-TTS: Improving Unseen Speaker and Style Transfer in One-shot Voice Cloning

A demo for end-to-end English and Chinese text spotting using ABCNet.

A Neural Language Style Transfer framework to transfer natural language text smoothly between fine-grained language styles like formal/casual, active/passive, and many more. Created by Prithiviraj Damodaran. Open to pull requests and other forms of collaboration.

BERN2: an advanced neural biomedical namedentity recognition and normalization tool

(ACL 2022) The source code for the paper "Towards Abstractive Grounded Summarization of Podcast Transcripts"

An implementation of the Pay Attention when Required transformer

An automated program that helps customers of Pizza Palour place their pizza orders

CorNet Correlation Networks for Extreme Multi-label Text Classification

Code for our paper "Transfer Learning for Sequence Generation: from Single-source to Multi-source" in ACL 2021.

Levenshtein and Hamming distance computation

AutoGluon: AutoML for Text, Image, and Tabular Data

Code for Emergent Translation in Multi-Agent Communication

A library for Multilingual Unsupervised or Supervised word Embeddings

Exploration of BERT-based models on twitter sentiment classifications

Concept Modeling: Topic Modeling on Images and Text

this repository has datasets containing information of Uber pickups in NYC from April 2014 to September 2014 and January to June 2015. data Analysis , virtualization and some insights are gathered here

a test times augmentation toolkit based on paddle2.0.

CYGNUS, the Cynical AI, combines snarky responses with uncanny aggression.