Korean Sentence Embedding Repository

Overview

Korean-Sentence-Embedding

๐Ÿญ Korean sentence embedding repository. You can download the pre-trained models and inference right away, also it provides environments where individuals can train models.

Baseline Models

Baseline models used for korean sentence embedding - KLUE-PLMs

Model Embedding size Hidden size # Layers # Heads
KLUE-BERT-base 768 768 12 12
KLUE-RoBERTa-base 768 768 12 12

NOTE: All the pretrained models are uploaded in Huggingface Model Hub. Check https://huggingface.co/klue.

How to start

  • Get datasets to train or test.
bash get_model_dataset.sh
  • If you want to do inference quickly, download the pre-trained models and then you can start some downstream tasks.
bash get_model_checkpoint.sh
cd KoSBERT/
python SemanticSearch.py

Available Models

  1. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks [SBERT]-[EMNLP 2019]
  2. SimCSE: Simple Contrastive Learning of Sentence Embeddings [SimCSE]-[EMNLP 2021]

KoSentenceBERT

  • ๐Ÿค— Model Training
  • Dataset
    • Train: snli_1.0_train.ko.tsv (First phase, training NLI), sts-train.tsv (Second phase, continued training STS)
    • Valid: sts-dev.tsv
    • Test: sts-test.tsv

KoSimCSE

  • ๐Ÿค— Model Training
  • Dataset
    • Train: snli_1.0_train.ko.tsv + multinli.train.ko.tsv
    • Valid: sts-dev.tsv
    • Test: sts-test.tsv

Performance

  • Semantic Textual Similarity test set results
Model Cosine Pearson Cosine Spearman Euclidean Pearson Euclidean Spearman Manhattan Pearson Manhattan Spearman Dot Pearson Dot Spearman
KoSBERTโ€ SKT 78.81 78.47 77.68 77.78 77.71 77.83 75.75 75.22
KoSBERTbase 82.13 82.25 80.67 80.75 80.69 80.78 77.96 77.90
KoSRoBERTabase 80.70 81.03 80.97 81.06 80.84 80.97 79.20 78.93
KoSimCSE-BERTโ€ SKT 82.12 82.56 81.84 81.63 81.99 81.74 79.55 79.19
KoSimCSE-BERTbase 82.73 83.51 82.32 82.78 82.43 82.88 77.86 76.70
KoSimCSE-RoBERTabase 83.64 84.05 83.32 83.84 83.33 83.79 80.92 79.84

Downstream Tasks

  • KoSBERT: Semantic Search, Clustering
python SemanticSearch.py
python Clustering.py
  • KoSimCSE: Semantic Search
python SemanticSearch.py

Semantic Search (KoSBERT)

from sentence_transformers import SentenceTransformer, util
import numpy as np

model_path = '../Checkpoint/KoSBERT/kosbert-klue-bert-base'

embedder = SentenceTransformer(model_path)

# Corpus with example sentences
corpus = ['ํ•œ ๋‚จ์ž๊ฐ€ ์Œ์‹์„ ๋จน๋Š”๋‹ค.',
          'ํ•œ ๋‚จ์ž๊ฐ€ ๋นต ํ•œ ์กฐ๊ฐ์„ ๋จน๋Š”๋‹ค.',
          '๊ทธ ์—ฌ์ž๊ฐ€ ์•„์ด๋ฅผ ๋Œ๋ณธ๋‹ค.',
          'ํ•œ ๋‚จ์ž๊ฐ€ ๋ง์„ ํƒ„๋‹ค.',
          'ํ•œ ์—ฌ์ž๊ฐ€ ๋ฐ”์ด์˜ฌ๋ฆฐ์„ ์—ฐ์ฃผํ•œ๋‹ค.',
          '๋‘ ๋‚จ์ž๊ฐ€ ์ˆ˜๋ ˆ๋ฅผ ์ˆฒ ์†ฆ์œผ๋กœ ๋ฐ€์—ˆ๋‹ค.',
          'ํ•œ ๋‚จ์ž๊ฐ€ ๋‹ด์œผ๋กœ ์‹ธ์ธ ๋•…์—์„œ ๋ฐฑ๋งˆ๋ฅผ ํƒ€๊ณ  ์žˆ๋‹ค.',
          '์›์ˆญ์ด ํ•œ ๋งˆ๋ฆฌ๊ฐ€ ๋“œ๋Ÿผ์„ ์—ฐ์ฃผํ•œ๋‹ค.',
          '์น˜ํƒ€ ํ•œ ๋งˆ๋ฆฌ๊ฐ€ ๋จน์ด ๋’ค์—์„œ ๋‹ฌ๋ฆฌ๊ณ  ์žˆ๋‹ค.']

corpus_embeddings = embedder.encode(corpus, convert_to_tensor=True)

# Query sentences:
queries = ['ํ•œ ๋‚จ์ž๊ฐ€ ํŒŒ์Šคํƒ€๋ฅผ ๋จน๋Š”๋‹ค.',
           '๊ณ ๋ฆด๋ผ ์˜์ƒ์„ ์ž…์€ ๋ˆ„๊ตฐ๊ฐ€๊ฐ€ ๋“œ๋Ÿผ์„ ์—ฐ์ฃผํ•˜๊ณ  ์žˆ๋‹ค.',
           '์น˜ํƒ€๊ฐ€ ๋“คํŒ์„ ๊ฐ€๋กœ ์งˆ๋Ÿฌ ๋จน์ด๋ฅผ ์ซ“๋Š”๋‹ค.']

# Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity
top_k = 5
for query in queries:
    query_embedding = embedder.encode(query, convert_to_tensor=True)
    cos_scores = util.pytorch_cos_sim(query_embedding, corpus_embeddings)[0]
    cos_scores = cos_scores.cpu()

    #We use np.argpartition, to only partially sort the top_k results
    top_results = np.argpartition(-cos_scores, range(top_k))[0:top_k]

    print("\n\n======================\n\n")
    print("Query:", query)
    print("\nTop 5 most similar sentences in corpus:")

    for idx in top_results[0:top_k]:
        print(corpus[idx].strip(), "(Score: %.4f)" % (cos_scores[idx]))
  • Results are as follows :

Query: ํ•œ ๋‚จ์ž๊ฐ€ ํŒŒ์Šคํƒ€๋ฅผ ๋จน๋Š”๋‹ค.

Top 5 most similar sentences in corpus:
ํ•œ ๋‚จ์ž๊ฐ€ ์Œ์‹์„ ๋จน๋Š”๋‹ค. (Score: 0.6141)
ํ•œ ๋‚จ์ž๊ฐ€ ๋นต ํ•œ ์กฐ๊ฐ์„ ๋จน๋Š”๋‹ค. (Score: 0.5952)
ํ•œ ๋‚จ์ž๊ฐ€ ๋ง์„ ํƒ„๋‹ค. (Score: 0.1231)
ํ•œ ๋‚จ์ž๊ฐ€ ๋‹ด์œผ๋กœ ์‹ธ์ธ ๋•…์—์„œ ๋ฐฑ๋งˆ๋ฅผ ํƒ€๊ณ  ์žˆ๋‹ค. (Score: 0.0752)
๋‘ ๋‚จ์ž๊ฐ€ ์ˆ˜๋ ˆ๋ฅผ ์ˆฒ ์†ฆ์œผ๋กœ ๋ฐ€์—ˆ๋‹ค. (Score: 0.0486)


======================


Query: ๊ณ ๋ฆด๋ผ ์˜์ƒ์„ ์ž…์€ ๋ˆ„๊ตฐ๊ฐ€๊ฐ€ ๋“œ๋Ÿผ์„ ์—ฐ์ฃผํ•˜๊ณ  ์žˆ๋‹ค.

Top 5 most similar sentences in corpus:
์›์ˆญ์ด ํ•œ ๋งˆ๋ฆฌ๊ฐ€ ๋“œ๋Ÿผ์„ ์—ฐ์ฃผํ•œ๋‹ค. (Score: 0.6656)
์น˜ํƒ€ ํ•œ ๋งˆ๋ฆฌ๊ฐ€ ๋จน์ด ๋’ค์—์„œ ๋‹ฌ๋ฆฌ๊ณ  ์žˆ๋‹ค. (Score: 0.2988)
ํ•œ ์—ฌ์ž๊ฐ€ ๋ฐ”์ด์˜ฌ๋ฆฐ์„ ์—ฐ์ฃผํ•œ๋‹ค. (Score: 0.1566)
ํ•œ ๋‚จ์ž๊ฐ€ ๋ง์„ ํƒ„๋‹ค. (Score: 0.1112)
ํ•œ ๋‚จ์ž๊ฐ€ ๋‹ด์œผ๋กœ ์‹ธ์ธ ๋•…์—์„œ ๋ฐฑ๋งˆ๋ฅผ ํƒ€๊ณ  ์žˆ๋‹ค. (Score: 0.0262)


======================


Query: ์น˜ํƒ€๊ฐ€ ๋“คํŒ์„ ๊ฐ€๋กœ ์งˆ๋Ÿฌ ๋จน์ด๋ฅผ ์ซ“๋Š”๋‹ค.

Top 5 most similar sentences in corpus:
์น˜ํƒ€ ํ•œ ๋งˆ๋ฆฌ๊ฐ€ ๋จน์ด ๋’ค์—์„œ ๋‹ฌ๋ฆฌ๊ณ  ์žˆ๋‹ค. (Score: 0.7570)
๋‘ ๋‚จ์ž๊ฐ€ ์ˆ˜๋ ˆ๋ฅผ ์ˆฒ ์†ฆ์œผ๋กœ ๋ฐ€์—ˆ๋‹ค. (Score: 0.3658)
์›์ˆญ์ด ํ•œ ๋งˆ๋ฆฌ๊ฐ€ ๋“œ๋Ÿผ์„ ์—ฐ์ฃผํ•œ๋‹ค. (Score: 0.3583)
ํ•œ ๋‚จ์ž๊ฐ€ ๋ง์„ ํƒ„๋‹ค. (Score: 0.0505)
๊ทธ ์—ฌ์ž๊ฐ€ ์•„์ด๋ฅผ ๋Œ๋ณธ๋‹ค. (Score: -0.0087)

Clustering (KoSBERT)

from sentence_transformers import SentenceTransformer, util
import numpy as np

model_path = '../Checkpoint/KoSBERT/kosbert-klue-bert-base'

embedder = SentenceTransformer(model_path)

# Corpus with example sentences
corpus = ['ํ•œ ๋‚จ์ž๊ฐ€ ์Œ์‹์„ ๋จน๋Š”๋‹ค.',
          'ํ•œ ๋‚จ์ž๊ฐ€ ๋นต ํ•œ ์กฐ๊ฐ์„ ๋จน๋Š”๋‹ค.',
          '๊ทธ ์—ฌ์ž๊ฐ€ ์•„์ด๋ฅผ ๋Œ๋ณธ๋‹ค.',
          'ํ•œ ๋‚จ์ž๊ฐ€ ๋ง์„ ํƒ„๋‹ค.',
          'ํ•œ ์—ฌ์ž๊ฐ€ ๋ฐ”์ด์˜ฌ๋ฆฐ์„ ์—ฐ์ฃผํ•œ๋‹ค.',
          '๋‘ ๋‚จ์ž๊ฐ€ ์ˆ˜๋ ˆ๋ฅผ ์ˆฒ ์†ฆ์œผ๋กœ ๋ฐ€์—ˆ๋‹ค.',
          'ํ•œ ๋‚จ์ž๊ฐ€ ๋‹ด์œผ๋กœ ์‹ธ์ธ ๋•…์—์„œ ๋ฐฑ๋งˆ๋ฅผ ํƒ€๊ณ  ์žˆ๋‹ค.',
          '์›์ˆญ์ด ํ•œ ๋งˆ๋ฆฌ๊ฐ€ ๋“œ๋Ÿผ์„ ์—ฐ์ฃผํ•œ๋‹ค.',
          '์น˜ํƒ€ ํ•œ ๋งˆ๋ฆฌ๊ฐ€ ๋จน์ด ๋’ค์—์„œ ๋‹ฌ๋ฆฌ๊ณ  ์žˆ๋‹ค.',
          'ํ•œ ๋‚จ์ž๊ฐ€ ํŒŒ์Šคํƒ€๋ฅผ ๋จน๋Š”๋‹ค.',
          '๊ณ ๋ฆด๋ผ ์˜์ƒ์„ ์ž…์€ ๋ˆ„๊ตฐ๊ฐ€๊ฐ€ ๋“œ๋Ÿผ์„ ์—ฐ์ฃผํ•˜๊ณ  ์žˆ๋‹ค.',
          '์น˜ํƒ€๊ฐ€ ๋“คํŒ์„ ๊ฐ€๋กœ ์งˆ๋Ÿฌ ๋จน์ด๋ฅผ ์ซ“๋Š”๋‹ค.']

corpus_embeddings = embedder.encode(corpus)

# Then, we perform k-means clustering using sklearn:
from sklearn.cluster import KMeans

num_clusters = 5
clustering_model = KMeans(n_clusters=num_clusters)
clustering_model.fit(corpus_embeddings)
cluster_assignment = clustering_model.labels_

clustered_sentences = [[] for i in range(num_clusters)]
for sentence_id, cluster_id in enumerate(cluster_assignment):
    clustered_sentences[cluster_id].append(corpus[sentence_id])

for i, cluster in enumerate(clustered_sentences):
    print("Cluster ", i+1)
    print(cluster)
    print("")
  • Results are as follows:
Cluster  1
['ํ•œ ๋‚จ์ž๊ฐ€ ์Œ์‹์„ ๋จน๋Š”๋‹ค.', 'ํ•œ ๋‚จ์ž๊ฐ€ ๋นต ํ•œ ์กฐ๊ฐ์„ ๋จน๋Š”๋‹ค.', 'ํ•œ ๋‚จ์ž๊ฐ€ ํŒŒ์Šคํƒ€๋ฅผ ๋จน๋Š”๋‹ค.']

Cluster  2
['์›์ˆญ์ด ํ•œ ๋งˆ๋ฆฌ๊ฐ€ ๋“œ๋Ÿผ์„ ์—ฐ์ฃผํ•œ๋‹ค.', '๊ณ ๋ฆด๋ผ ์˜์ƒ์„ ์ž…์€ ๋ˆ„๊ตฐ๊ฐ€๊ฐ€ ๋“œ๋Ÿผ์„ ์—ฐ์ฃผํ•˜๊ณ  ์žˆ๋‹ค.']

Cluster  3
['ํ•œ ๋‚จ์ž๊ฐ€ ๋ง์„ ํƒ„๋‹ค.', '๋‘ ๋‚จ์ž๊ฐ€ ์ˆ˜๋ ˆ๋ฅผ ์ˆฒ ์†ฆ์œผ๋กœ ๋ฐ€์—ˆ๋‹ค.', 'ํ•œ ๋‚จ์ž๊ฐ€ ๋‹ด์œผ๋กœ ์‹ธ์ธ ๋•…์—์„œ ๋ฐฑ๋งˆ๋ฅผ ํƒ€๊ณ  ์žˆ๋‹ค.']

Cluster  4
['์น˜ํƒ€ ํ•œ ๋งˆ๋ฆฌ๊ฐ€ ๋จน์ด ๋’ค์—์„œ ๋‹ฌ๋ฆฌ๊ณ  ์žˆ๋‹ค.', '์น˜ํƒ€๊ฐ€ ๋“คํŒ์„ ๊ฐ€๋กœ ์งˆ๋Ÿฌ ๋จน์ด๋ฅผ ์ซ“๋Š”๋‹ค.']

Cluster  5
['๊ทธ ์—ฌ์ž๊ฐ€ ์•„์ด๋ฅผ ๋Œ๋ณธ๋‹ค.', 'ํ•œ ์—ฌ์ž๊ฐ€ ๋ฐ”์ด์˜ฌ๋ฆฐ์„ ์—ฐ์ฃผํ•œ๋‹ค.']

References

@misc{park2021klue,
    title={KLUE: Korean Language Understanding Evaluation},
    author={Sungjoon Park and Jihyung Moon and Sungdong Kim and Won Ik Cho and Jiyoon Han and Jangwon Park and Chisung Song and Junseong Kim and Yongsook Song and Taehwan Oh and Joohong Lee and Juhyun Oh and Sungwon Lyu and Younghoon Jeong and Inkwon Lee and Sangwoo Seo and Dongjun Lee and Hyunwoo Kim and Myeonghwa Lee and Seongbo Jang and Seungwon Do and Sunkyoung Kim and Kyungtae Lim and Jongwon Lee and Kyumin Park and Jamin Shin and Seonghyun Kim and Lucy Park and Alice Oh and Jung-Woo Ha and Kyunghyun Cho},
    year={2021},
    eprint={2105.09680},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
@inproceedings{gao2021simcse,
   title={{SimCSE}: Simple Contrastive Learning of Sentence Embeddings},
   author={Gao, Tianyu and Yao, Xingcheng and Chen, Danqi},
   booktitle={Empirical Methods in Natural Language Processing (EMNLP)},
   year={2021}
}
@article{ham2020kornli,
  title={KorNLI and KorSTS: New Benchmark Datasets for Korean Natural Language Understanding},
  author={Ham, Jiyeon and Choe, Yo Joong and Park, Kyubyong and Choi, Ilji and Soh, Hyungjoon},
  journal={arXiv preprint arXiv:2004.03289},
  year={2020}
}
@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "http://arxiv.org/abs/1908.10084",
}
Owner
Self-softmax
โœจRubrix is a production-ready Python framework for exploring, annotating, and managing data in NLP projects.

โœจA Python framework to explore, label, and monitor data for NLP projects

Recognai 1.5k Jan 02, 2023
Tracking Progress in Natural Language Processing

Repository to track the progress in Natural Language Processing (NLP), including the datasets and the current state-of-the-art for the most common NLP tasks.

Sebastian Ruder 21.2k Dec 30, 2022
:id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.

Dedupe Python Library dedupe is a python library that uses machine learning to perform fuzzy matching, deduplication and entity resolution quickly on

Dedupe.io 3.6k Jan 02, 2023
Code for EMNLP20 paper: "ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training"

ProphetNet-X This repo provides the code for reproducing the experiments in ProphetNet. In the paper, we propose a new pre-trained language model call

Microsoft 394 Dec 17, 2022
Practical Natural Language Processing Tools for Humans is build on the top of Senna Natural Language Processing (NLP)

Practical Natural Language Processing Tools for Humans is build on the top of Senna Natural Language Processing (NLP) predictions: part-of-speech (POS) tags, chunking (CHK), name entity recognition (

jawahar 20 Apr 30, 2022
Words_And_Phrases - Just a repo for useful words and phrases that might come handy in some scenarios. Feel free to add yours

Words_And_Phrases Just a repo for useful words and phrases that might come handy in some scenarios. Feel free to add yours Abbreviations Abbreviation

Subhadeep Mandal 1 Feb 01, 2022
C.J. Hutto 3.8k Dec 30, 2022
A very simple framework for state-of-the-art Natural Language Processing (NLP)

A very simple framework for state-of-the-art NLP. Developed by Humboldt University of Berlin and friends. Flair is: A powerful NLP library. Flair allo

flair 12.3k Jan 02, 2023
Question and answer retrieval in Turkish with BERT

trfaq Google supported this work by providing Google Cloud credit. Thank you Google for supporting the open source! ๐ŸŽ‰ What is this? At this repo, I'm

M. Yusuf Sarฤฑgรถz 13 Oct 10, 2022
An A-SOUL Text Generator Based on CPM-Distill.

ASOUL-Generator-Backend ๆœฌ้กน็›ฎไธบ https://asoul.infedg.xyz/ ็š„ๅŽ็ซฏใ€‚ ๆจกๅž‹ไธบๅŸบไบŽ CPM-Distill ็š„ transformers ่ฝฌๅŒ–็‰ˆๆœฌ CPM-Generate-distill ่ฎญ็ปƒ่€Œๆˆใ€‚

infinityedge 46 Dec 11, 2022
Python wrapper for Stanford CoreNLP tools v3.4.1

Python interface to Stanford Core NLP tools v3.4.1 This is a Python wrapper for Stanford University's NLP group's Java-based CoreNLP tools. It can eit

Dustin Smith 610 Sep 07, 2022
๐Ÿค— Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

English | ็ฎ€ไฝ“ไธญๆ–‡ | ็น้ซ”ไธญๆ–‡ | ํ•œ๊ตญ์–ด State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow ๐Ÿค— Transformers provides thousands of pretrained models

Hugging Face 77.1k Dec 31, 2022
This repository contains (not all) code from my project on Named Entity Recognition in philosophical text

NERphilosophy ๐Ÿ‘‹ Welcome to the github repository of my BsC thesis. This repository contains (not all) code from my project on Named Entity Recognitio

Ruben 1 Jan 27, 2022
Various Algorithms for Short Text Mining

Short Text Mining in Python Introduction This package shorttext is a Python package that facilitates supervised and unsupervised learning for short te

Kwan-Yuet 466 Dec 06, 2022
Examples of using sparse attention, as in "Generating Long Sequences with Sparse Transformers"

Status: Archive (code is provided as-is, no updates expected) Update August 2020: For an example repository that achieves state-of-the-art modeling pe

OpenAI 1.3k Dec 28, 2022
aMLP Transformer Model for Japanese

aMLP-japanese Japanese aMLP Pretrained Model aMLPใจใฏใ€Liu, Daiใ‚‰ใŒๆๆกˆใ™ใ‚‹ใ€Transformerใƒขใƒ‡ใƒซใงใ™ใ€‚ ใ–ใฃใใ‚Šใจใ„ใ†ใจใ€BERTใฎไปฃใ‚ใ‚Šใซไฝฟใˆใฆใ€ใ‚ˆใ‚Šๆ€ง่ƒฝใฎ่‰ฏใ„ใƒขใƒ‡ใƒซใงใ™ใ€‚ ่ฉณใ—ใ„่งฃ่ชฌใฏใ€ใ“ใกใ‚‰ใฎ่จ˜ไบ‹ใชใฉใ‚’ๅ‚่€ƒใซใ—ใฆใใ ใ•ใ„ใ€‚ ใ“ใฎ

tanreinama 13 Aug 11, 2022
Yomichad - a Japanese pop-up dictionary that can display readings and English definitions of Japanese words

Yomichad is a Japanese pop-up dictionary that can display readings and English definitions of Japanese words, kanji, and optionally named entities. It is similar to yomichan, 10ten, and rikaikun in s

Jonas Belouadi 7 Nov 07, 2022
The SVO-Probes Dataset for Verb Understanding

The SVO-Probes Dataset for Verb Understanding This repository contains the SVO-Probes benchmark designed to probe for Subject, Verb, and Object unders

DeepMind 20 Nov 30, 2022
Package for controllable summarization

summarizers summarizers is package for controllable summarization based CTRLsum. currently, we only supports English. It doesn't work in other languag

Hyunwoong Ko 72 Dec 07, 2022
NLP applications using deep learning.

NLP-Natural-Language-Processing NLP applications using deep learning like text generation etc. 1- Poetry Generation: Using a collection of Irish Poem

KASHISH 1 Jan 27, 2022