Korean Sentence Embedding Repository

Overview

Korean-Sentence-Embedding

🍭 Korean sentence embedding repository. You can download the pre-trained models and inference right away, also it provides environments where individuals can train models.

Baseline Models

Baseline models used for korean sentence embedding - KLUE-PLMs

Model Embedding size Hidden size # Layers # Heads
KLUE-BERT-base 768 768 12 12
KLUE-RoBERTa-base 768 768 12 12

NOTE: All the pretrained models are uploaded in Huggingface Model Hub. Check https://huggingface.co/klue.

How to start

  • Get datasets to train or test.
bash get_model_dataset.sh
  • If you want to do inference quickly, download the pre-trained models and then you can start some downstream tasks.
bash get_model_checkpoint.sh
cd KoSBERT/
python SemanticSearch.py

Available Models

  1. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks [SBERT]-[EMNLP 2019]
  2. SimCSE: Simple Contrastive Learning of Sentence Embeddings [SimCSE]-[EMNLP 2021]

KoSentenceBERT

  • πŸ€— Model Training
  • Dataset
    • Train: snli_1.0_train.ko.tsv (First phase, training NLI), sts-train.tsv (Second phase, continued training STS)
    • Valid: sts-dev.tsv
    • Test: sts-test.tsv

KoSimCSE

  • πŸ€— Model Training
  • Dataset
    • Train: snli_1.0_train.ko.tsv + multinli.train.ko.tsv
    • Valid: sts-dev.tsv
    • Test: sts-test.tsv

Performance

  • Semantic Textual Similarity test set results
Model Cosine Pearson Cosine Spearman Euclidean Pearson Euclidean Spearman Manhattan Pearson Manhattan Spearman Dot Pearson Dot Spearman
KoSBERT†SKT 78.81 78.47 77.68 77.78 77.71 77.83 75.75 75.22
KoSBERTbase 82.13 82.25 80.67 80.75 80.69 80.78 77.96 77.90
KoSRoBERTabase 80.70 81.03 80.97 81.06 80.84 80.97 79.20 78.93
KoSimCSE-BERT†SKT 82.12 82.56 81.84 81.63 81.99 81.74 79.55 79.19
KoSimCSE-BERTbase 82.73 83.51 82.32 82.78 82.43 82.88 77.86 76.70
KoSimCSE-RoBERTabase 83.64 84.05 83.32 83.84 83.33 83.79 80.92 79.84

Downstream Tasks

  • KoSBERT: Semantic Search, Clustering
python SemanticSearch.py
python Clustering.py
  • KoSimCSE: Semantic Search
python SemanticSearch.py

Semantic Search (KoSBERT)

from sentence_transformers import SentenceTransformer, util
import numpy as np

model_path = '../Checkpoint/KoSBERT/kosbert-klue-bert-base'

embedder = SentenceTransformer(model_path)

# Corpus with example sentences
corpus = ['ν•œ λ‚¨μžκ°€ μŒμ‹μ„ λ¨ΉλŠ”λ‹€.',
          'ν•œ λ‚¨μžκ°€ λΉ΅ ν•œ 쑰각을 λ¨ΉλŠ”λ‹€.',
          'κ·Έ μ—¬μžκ°€ 아이λ₯Ό λŒλ³Έλ‹€.',
          'ν•œ λ‚¨μžκ°€ 말을 탄닀.',
          'ν•œ μ—¬μžκ°€ λ°”μ΄μ˜¬λ¦°μ„ μ—°μ£Όν•œλ‹€.',
          '두 λ‚¨μžκ°€ 수레λ₯Ό 숲 μ†¦μœΌλ‘œ λ°€μ—ˆλ‹€.',
          'ν•œ λ‚¨μžκ°€ λ‹΄μœΌλ‘œ 싸인 λ•…μ—μ„œ 백마λ₯Ό 타고 μžˆλ‹€.',
          'μ›μˆ­μ΄ ν•œ λ§ˆλ¦¬κ°€ λ“œλŸΌμ„ μ—°μ£Όν•œλ‹€.',
          'μΉ˜νƒ€ ν•œ λ§ˆλ¦¬κ°€ 먹이 λ’€μ—μ„œ 달리고 μžˆλ‹€.']

corpus_embeddings = embedder.encode(corpus, convert_to_tensor=True)

# Query sentences:
queries = ['ν•œ λ‚¨μžκ°€ νŒŒμŠ€νƒ€λ₯Ό λ¨ΉλŠ”λ‹€.',
           '고릴라 μ˜μƒμ„ μž…μ€ λˆ„κ΅°κ°€κ°€ λ“œλŸΌμ„ μ—°μ£Όν•˜κ³  μžˆλ‹€.',
           'μΉ˜νƒ€κ°€ λ“€νŒμ„ κ°€λ‘œ 질러 먹이λ₯Ό μ«“λŠ”λ‹€.']

# Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity
top_k = 5
for query in queries:
    query_embedding = embedder.encode(query, convert_to_tensor=True)
    cos_scores = util.pytorch_cos_sim(query_embedding, corpus_embeddings)[0]
    cos_scores = cos_scores.cpu()

    #We use np.argpartition, to only partially sort the top_k results
    top_results = np.argpartition(-cos_scores, range(top_k))[0:top_k]

    print("\n\n======================\n\n")
    print("Query:", query)
    print("\nTop 5 most similar sentences in corpus:")

    for idx in top_results[0:top_k]:
        print(corpus[idx].strip(), "(Score: %.4f)" % (cos_scores[idx]))
  • Results are as follows :

Query: ν•œ λ‚¨μžκ°€ νŒŒμŠ€νƒ€λ₯Ό λ¨ΉλŠ”λ‹€.

Top 5 most similar sentences in corpus:
ν•œ λ‚¨μžκ°€ μŒμ‹μ„ λ¨ΉλŠ”λ‹€. (Score: 0.6141)
ν•œ λ‚¨μžκ°€ λΉ΅ ν•œ 쑰각을 λ¨ΉλŠ”λ‹€. (Score: 0.5952)
ν•œ λ‚¨μžκ°€ 말을 탄닀. (Score: 0.1231)
ν•œ λ‚¨μžκ°€ λ‹΄μœΌλ‘œ 싸인 λ•…μ—μ„œ 백마λ₯Ό 타고 μžˆλ‹€. (Score: 0.0752)
두 λ‚¨μžκ°€ 수레λ₯Ό 숲 μ†¦μœΌλ‘œ λ°€μ—ˆλ‹€. (Score: 0.0486)


======================


Query: 고릴라 μ˜μƒμ„ μž…μ€ λˆ„κ΅°κ°€κ°€ λ“œλŸΌμ„ μ—°μ£Όν•˜κ³  μžˆλ‹€.

Top 5 most similar sentences in corpus:
μ›μˆ­μ΄ ν•œ λ§ˆλ¦¬κ°€ λ“œλŸΌμ„ μ—°μ£Όν•œλ‹€. (Score: 0.6656)
μΉ˜νƒ€ ν•œ λ§ˆλ¦¬κ°€ 먹이 λ’€μ—μ„œ 달리고 μžˆλ‹€. (Score: 0.2988)
ν•œ μ—¬μžκ°€ λ°”μ΄μ˜¬λ¦°μ„ μ—°μ£Όν•œλ‹€. (Score: 0.1566)
ν•œ λ‚¨μžκ°€ 말을 탄닀. (Score: 0.1112)
ν•œ λ‚¨μžκ°€ λ‹΄μœΌλ‘œ 싸인 λ•…μ—μ„œ 백마λ₯Ό 타고 μžˆλ‹€. (Score: 0.0262)


======================


Query: μΉ˜νƒ€κ°€ λ“€νŒμ„ κ°€λ‘œ 질러 먹이λ₯Ό μ«“λŠ”λ‹€.

Top 5 most similar sentences in corpus:
μΉ˜νƒ€ ν•œ λ§ˆλ¦¬κ°€ 먹이 λ’€μ—μ„œ 달리고 μžˆλ‹€. (Score: 0.7570)
두 λ‚¨μžκ°€ 수레λ₯Ό 숲 μ†¦μœΌλ‘œ λ°€μ—ˆλ‹€. (Score: 0.3658)
μ›μˆ­μ΄ ν•œ λ§ˆλ¦¬κ°€ λ“œλŸΌμ„ μ—°μ£Όν•œλ‹€. (Score: 0.3583)
ν•œ λ‚¨μžκ°€ 말을 탄닀. (Score: 0.0505)
κ·Έ μ—¬μžκ°€ 아이λ₯Ό λŒλ³Έλ‹€. (Score: -0.0087)

Clustering (KoSBERT)

from sentence_transformers import SentenceTransformer, util
import numpy as np

model_path = '../Checkpoint/KoSBERT/kosbert-klue-bert-base'

embedder = SentenceTransformer(model_path)

# Corpus with example sentences
corpus = ['ν•œ λ‚¨μžκ°€ μŒμ‹μ„ λ¨ΉλŠ”λ‹€.',
          'ν•œ λ‚¨μžκ°€ λΉ΅ ν•œ 쑰각을 λ¨ΉλŠ”λ‹€.',
          'κ·Έ μ—¬μžκ°€ 아이λ₯Ό λŒλ³Έλ‹€.',
          'ν•œ λ‚¨μžκ°€ 말을 탄닀.',
          'ν•œ μ—¬μžκ°€ λ°”μ΄μ˜¬λ¦°μ„ μ—°μ£Όν•œλ‹€.',
          '두 λ‚¨μžκ°€ 수레λ₯Ό 숲 μ†¦μœΌλ‘œ λ°€μ—ˆλ‹€.',
          'ν•œ λ‚¨μžκ°€ λ‹΄μœΌλ‘œ 싸인 λ•…μ—μ„œ 백마λ₯Ό 타고 μžˆλ‹€.',
          'μ›μˆ­μ΄ ν•œ λ§ˆλ¦¬κ°€ λ“œλŸΌμ„ μ—°μ£Όν•œλ‹€.',
          'μΉ˜νƒ€ ν•œ λ§ˆλ¦¬κ°€ 먹이 λ’€μ—μ„œ 달리고 μžˆλ‹€.',
          'ν•œ λ‚¨μžκ°€ νŒŒμŠ€νƒ€λ₯Ό λ¨ΉλŠ”λ‹€.',
          '고릴라 μ˜μƒμ„ μž…μ€ λˆ„κ΅°κ°€κ°€ λ“œλŸΌμ„ μ—°μ£Όν•˜κ³  μžˆλ‹€.',
          'μΉ˜νƒ€κ°€ λ“€νŒμ„ κ°€λ‘œ 질러 먹이λ₯Ό μ«“λŠ”λ‹€.']

corpus_embeddings = embedder.encode(corpus)

# Then, we perform k-means clustering using sklearn:
from sklearn.cluster import KMeans

num_clusters = 5
clustering_model = KMeans(n_clusters=num_clusters)
clustering_model.fit(corpus_embeddings)
cluster_assignment = clustering_model.labels_

clustered_sentences = [[] for i in range(num_clusters)]
for sentence_id, cluster_id in enumerate(cluster_assignment):
    clustered_sentences[cluster_id].append(corpus[sentence_id])

for i, cluster in enumerate(clustered_sentences):
    print("Cluster ", i+1)
    print(cluster)
    print("")
  • Results are as follows:
Cluster  1
['ν•œ λ‚¨μžκ°€ μŒμ‹μ„ λ¨ΉλŠ”λ‹€.', 'ν•œ λ‚¨μžκ°€ λΉ΅ ν•œ 쑰각을 λ¨ΉλŠ”λ‹€.', 'ν•œ λ‚¨μžκ°€ νŒŒμŠ€νƒ€λ₯Ό λ¨ΉλŠ”λ‹€.']

Cluster  2
['μ›μˆ­μ΄ ν•œ λ§ˆλ¦¬κ°€ λ“œλŸΌμ„ μ—°μ£Όν•œλ‹€.', '고릴라 μ˜μƒμ„ μž…μ€ λˆ„κ΅°κ°€κ°€ λ“œλŸΌμ„ μ—°μ£Όν•˜κ³  μžˆλ‹€.']

Cluster  3
['ν•œ λ‚¨μžκ°€ 말을 탄닀.', '두 λ‚¨μžκ°€ 수레λ₯Ό 숲 μ†¦μœΌλ‘œ λ°€μ—ˆλ‹€.', 'ν•œ λ‚¨μžκ°€ λ‹΄μœΌλ‘œ 싸인 λ•…μ—μ„œ 백마λ₯Ό 타고 μžˆλ‹€.']

Cluster  4
['μΉ˜νƒ€ ν•œ λ§ˆλ¦¬κ°€ 먹이 λ’€μ—μ„œ 달리고 μžˆλ‹€.', 'μΉ˜νƒ€κ°€ λ“€νŒμ„ κ°€λ‘œ 질러 먹이λ₯Ό μ«“λŠ”λ‹€.']

Cluster  5
['κ·Έ μ—¬μžκ°€ 아이λ₯Ό λŒλ³Έλ‹€.', 'ν•œ μ—¬μžκ°€ λ°”μ΄μ˜¬λ¦°μ„ μ—°μ£Όν•œλ‹€.']

References

@misc{park2021klue,
    title={KLUE: Korean Language Understanding Evaluation},
    author={Sungjoon Park and Jihyung Moon and Sungdong Kim and Won Ik Cho and Jiyoon Han and Jangwon Park and Chisung Song and Junseong Kim and Yongsook Song and Taehwan Oh and Joohong Lee and Juhyun Oh and Sungwon Lyu and Younghoon Jeong and Inkwon Lee and Sangwoo Seo and Dongjun Lee and Hyunwoo Kim and Myeonghwa Lee and Seongbo Jang and Seungwon Do and Sunkyoung Kim and Kyungtae Lim and Jongwon Lee and Kyumin Park and Jamin Shin and Seonghyun Kim and Lucy Park and Alice Oh and Jung-Woo Ha and Kyunghyun Cho},
    year={2021},
    eprint={2105.09680},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
@inproceedings{gao2021simcse,
   title={{SimCSE}: Simple Contrastive Learning of Sentence Embeddings},
   author={Gao, Tianyu and Yao, Xingcheng and Chen, Danqi},
   booktitle={Empirical Methods in Natural Language Processing (EMNLP)},
   year={2021}
}
@article{ham2020kornli,
  title={KorNLI and KorSTS: New Benchmark Datasets for Korean Natural Language Understanding},
  author={Ham, Jiyeon and Choe, Yo Joong and Park, Kyubyong and Choi, Ilji and Soh, Hyungjoon},
  journal={arXiv preprint arXiv:2004.03289},
  year={2020}
}
@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "http://arxiv.org/abs/1908.10084",
}
Owner
Self-softmax
A programming language with logic of Python, and syntax of all languages.

Pytov The idea was to take all well known syntaxes, and combine them into one programming language with many posabilities. Installation Install using

Yuval Rosen 14 Dec 07, 2022
Script and models for clustering LAION-400m CLIP embeddings.

clustering-laion400m Script and models for clustering LAION-400m CLIP embeddings. Models were fit on the first million or so image embeddings. A subje

Peter Baylies 22 Oct 04, 2022
A python package to fine-tune transformer-based models for named entity recognition (NER).

nerblackbox A python package to fine-tune transformer-based language models for named entity recognition (NER). Resources Source Code: https://github.

Felix Stollenwerk 13 Jul 30, 2022
Indonesia spellchecker with python

indonesia-spellchecker Ganti kata yang terdapat pada file teks.txt untuk diperiksa kebenaran kata. Run on local machine python3 main.py

Rahmat Agung Julians 1 Sep 14, 2022
This project uses word frequency and Term Frequency-Inverse Document Frequency to summarize a text.

Text Summarizer This project uses word frequency and Term Frequency-Inverse Document Frequency to summarize a text. Team Members This mini-project was

1 Nov 16, 2021
A repo for materials relating to the tutorial of CS-332 NLP

CS-332-NLP A repo for materials relating to the tutorial of CS-332 NLP Contents Tutorial 1: Introduction Corpus Regular expression Tokenization Tutori

Alok singh 9 Feb 15, 2022
A Python wrapper for simple offline real-time dictation (speech-to-text) and speaker-recognition using Vosk.

Simple-Vosk A Python wrapper for simple offline real-time dictation (speech-to-text) and speaker-recognition using Vosk. Check out the official Vosk G

2 Jun 19, 2022
Tensorflow implementation of paper: Learning to Diagnose with LSTM Recurrent Neural Networks.

Multilabel time series classification with LSTM Tensorflow implementation of model discussed in the following paper: Learning to Diagnose with LSTM Re

Aaqib 552 Nov 28, 2022
Count the frequency of letters or words in a text file and show a graph.

Word Counter By EBUS Coding Club Count the frequency of letters or words in a text file and show a graph. Requirements Python 3.9 or higher matplotlib

EBUS Coding Club 0 Apr 09, 2022
Python module (C extension and plain python) implementing Aho-Corasick algorithm

pyahocorasick pyahocorasick is a fast and memory efficient library for exact or approximate multi-pattern string search meaning that you can find mult

Wojciech MuΕ‚a 763 Dec 27, 2022
Model parallel transformers in JAX and Haiku

Table of contents Mesh Transformer JAX Updates Pretrained Models GPT-J-6B Links Acknowledgments License Model Details Zero-Shot Evaluations Architectu

Ben Wang 4.9k Jan 04, 2023
Let Xiao Ai speakers control third-party devices

A stupid way to extend miot/xiaoai. Demo for Panasonic Bath Bully FV-RB20VL1 逆向 Panasonic Smart ChinaοΌŒθŽ·εΎ—ζŽ§εˆΆζ΅΄ιœΈηš„θ―·ζ±‚δΏ‘ζ―οΌˆHTTP θ―·ζ±‚οΌ‰οΌŒθ―¦θ§ apps/panasonic.pyοΌ› 2. ι€šθΏ‡

bin 14 Jul 07, 2022
justCTF [*] 2020 challenges sources

justCTF [*] 2020 This repo contains sources for justCTF [*] 2020 challenges hosted by justCatTheFish. TLDR: Run a challenge with ./run.sh (requires Do

justCatTheFish 25 Dec 27, 2022
TFPNER: Exploration on the Named Entity Recognition of Token Fused with Part-of-Speech

TFPNER TFPNER: Exploration on the Named Entity Recognition of Token Fused with Part-of-Speech Named entity recognition (NER), which aims at identifyin

1 Feb 07, 2022
Multilingual Emotion classification using BERT (fine-tuning). Published at the WASSA workshop (ACL2022).

XLM-EMO: Multilingual Emotion Prediction in Social Media Text Abstract Detecting emotion in text allows social and computational scientists to study h

MilaNLP 35 Sep 17, 2022
Code for "Parallel Instance Query Network for Named Entity Recognition", accepted at ACL 2022.

README Code for Two-stage Identifier: "Parallel Instance Query Network for Named Entity Recognition", accepted at ACL 2022. For details of the model a

Yongliang Shen 45 Nov 29, 2022
Open source annotation tool for machine learning practitioners.

doccano doccano is an open source text annotation tool for humans. It provides annotation features for text classification, sequence labeling and sequ

7.1k Jan 01, 2023
πŸ¦… Pretrained BigBird Model for Korean (up to 4096 tokens)

Pretrained BigBird Model for Korean What is BigBird β€’ How to Use β€’ Pretraining β€’ Evaluation Result β€’ Docs β€’ Citation ν•œκ΅­μ–΄ | English What is BigBird? Bi

Jangwon Park 183 Dec 14, 2022
Tβ€˜rex Park is a Youzan sponsored project. Offering Chinese NLP and image models pretrained from E-commerce datasets

Tβ€˜rex Park is a Youzan sponsored project. Offering Chinese NLP and image models pretrained from E-commerce datasets (product titles, images, comments, etc.).

55 Nov 22, 2022
Multi-Scale Temporal Frequency Convolutional Network With Axial Attention for Speech Enhancement

MTFAA-Net Unofficial PyTorch implementation of Baidu's MTFAA-Net: "Multi-Scale Temporal Frequency Convolutional Network With Axial Attention for Speec

Shimin Zhang 87 Dec 19, 2022