Korean Sentence Embedding Repository

Overview

Korean-Sentence-Embedding

๐Ÿญ Korean sentence embedding repository. You can download the pre-trained models and inference right away, also it provides environments where individuals can train models.

Baseline Models

Baseline models used for korean sentence embedding - KLUE-PLMs

Model Embedding size Hidden size # Layers # Heads
KLUE-BERT-base 768 768 12 12
KLUE-RoBERTa-base 768 768 12 12

NOTE: All the pretrained models are uploaded in Huggingface Model Hub. Check https://huggingface.co/klue.

How to start

  • Get datasets to train or test.
bash get_model_dataset.sh
  • If you want to do inference quickly, download the pre-trained models and then you can start some downstream tasks.
bash get_model_checkpoint.sh
cd KoSBERT/
python SemanticSearch.py

Available Models

  1. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks [SBERT]-[EMNLP 2019]
  2. SimCSE: Simple Contrastive Learning of Sentence Embeddings [SimCSE]-[EMNLP 2021]

KoSentenceBERT

  • ๐Ÿค— Model Training
  • Dataset
    • Train: snli_1.0_train.ko.tsv (First phase, training NLI), sts-train.tsv (Second phase, continued training STS)
    • Valid: sts-dev.tsv
    • Test: sts-test.tsv

KoSimCSE

  • ๐Ÿค— Model Training
  • Dataset
    • Train: snli_1.0_train.ko.tsv + multinli.train.ko.tsv
    • Valid: sts-dev.tsv
    • Test: sts-test.tsv

Performance

  • Semantic Textual Similarity test set results
Model Cosine Pearson Cosine Spearman Euclidean Pearson Euclidean Spearman Manhattan Pearson Manhattan Spearman Dot Pearson Dot Spearman
KoSBERTโ€ SKT 78.81 78.47 77.68 77.78 77.71 77.83 75.75 75.22
KoSBERTbase 82.13 82.25 80.67 80.75 80.69 80.78 77.96 77.90
KoSRoBERTabase 80.70 81.03 80.97 81.06 80.84 80.97 79.20 78.93
KoSimCSE-BERTโ€ SKT 82.12 82.56 81.84 81.63 81.99 81.74 79.55 79.19
KoSimCSE-BERTbase 82.73 83.51 82.32 82.78 82.43 82.88 77.86 76.70
KoSimCSE-RoBERTabase 83.64 84.05 83.32 83.84 83.33 83.79 80.92 79.84

Downstream Tasks

  • KoSBERT: Semantic Search, Clustering
python SemanticSearch.py
python Clustering.py
  • KoSimCSE: Semantic Search
python SemanticSearch.py

Semantic Search (KoSBERT)

from sentence_transformers import SentenceTransformer, util
import numpy as np

model_path = '../Checkpoint/KoSBERT/kosbert-klue-bert-base'

embedder = SentenceTransformer(model_path)

# Corpus with example sentences
corpus = ['ํ•œ ๋‚จ์ž๊ฐ€ ์Œ์‹์„ ๋จน๋Š”๋‹ค.',
          'ํ•œ ๋‚จ์ž๊ฐ€ ๋นต ํ•œ ์กฐ๊ฐ์„ ๋จน๋Š”๋‹ค.',
          '๊ทธ ์—ฌ์ž๊ฐ€ ์•„์ด๋ฅผ ๋Œ๋ณธ๋‹ค.',
          'ํ•œ ๋‚จ์ž๊ฐ€ ๋ง์„ ํƒ„๋‹ค.',
          'ํ•œ ์—ฌ์ž๊ฐ€ ๋ฐ”์ด์˜ฌ๋ฆฐ์„ ์—ฐ์ฃผํ•œ๋‹ค.',
          '๋‘ ๋‚จ์ž๊ฐ€ ์ˆ˜๋ ˆ๋ฅผ ์ˆฒ ์†ฆ์œผ๋กœ ๋ฐ€์—ˆ๋‹ค.',
          'ํ•œ ๋‚จ์ž๊ฐ€ ๋‹ด์œผ๋กœ ์‹ธ์ธ ๋•…์—์„œ ๋ฐฑ๋งˆ๋ฅผ ํƒ€๊ณ  ์žˆ๋‹ค.',
          '์›์ˆญ์ด ํ•œ ๋งˆ๋ฆฌ๊ฐ€ ๋“œ๋Ÿผ์„ ์—ฐ์ฃผํ•œ๋‹ค.',
          '์น˜ํƒ€ ํ•œ ๋งˆ๋ฆฌ๊ฐ€ ๋จน์ด ๋’ค์—์„œ ๋‹ฌ๋ฆฌ๊ณ  ์žˆ๋‹ค.']

corpus_embeddings = embedder.encode(corpus, convert_to_tensor=True)

# Query sentences:
queries = ['ํ•œ ๋‚จ์ž๊ฐ€ ํŒŒ์Šคํƒ€๋ฅผ ๋จน๋Š”๋‹ค.',
           '๊ณ ๋ฆด๋ผ ์˜์ƒ์„ ์ž…์€ ๋ˆ„๊ตฐ๊ฐ€๊ฐ€ ๋“œ๋Ÿผ์„ ์—ฐ์ฃผํ•˜๊ณ  ์žˆ๋‹ค.',
           '์น˜ํƒ€๊ฐ€ ๋“คํŒ์„ ๊ฐ€๋กœ ์งˆ๋Ÿฌ ๋จน์ด๋ฅผ ์ซ“๋Š”๋‹ค.']

# Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity
top_k = 5
for query in queries:
    query_embedding = embedder.encode(query, convert_to_tensor=True)
    cos_scores = util.pytorch_cos_sim(query_embedding, corpus_embeddings)[0]
    cos_scores = cos_scores.cpu()

    #We use np.argpartition, to only partially sort the top_k results
    top_results = np.argpartition(-cos_scores, range(top_k))[0:top_k]

    print("\n\n======================\n\n")
    print("Query:", query)
    print("\nTop 5 most similar sentences in corpus:")

    for idx in top_results[0:top_k]:
        print(corpus[idx].strip(), "(Score: %.4f)" % (cos_scores[idx]))
  • Results are as follows :

Query: ํ•œ ๋‚จ์ž๊ฐ€ ํŒŒ์Šคํƒ€๋ฅผ ๋จน๋Š”๋‹ค.

Top 5 most similar sentences in corpus:
ํ•œ ๋‚จ์ž๊ฐ€ ์Œ์‹์„ ๋จน๋Š”๋‹ค. (Score: 0.6141)
ํ•œ ๋‚จ์ž๊ฐ€ ๋นต ํ•œ ์กฐ๊ฐ์„ ๋จน๋Š”๋‹ค. (Score: 0.5952)
ํ•œ ๋‚จ์ž๊ฐ€ ๋ง์„ ํƒ„๋‹ค. (Score: 0.1231)
ํ•œ ๋‚จ์ž๊ฐ€ ๋‹ด์œผ๋กœ ์‹ธ์ธ ๋•…์—์„œ ๋ฐฑ๋งˆ๋ฅผ ํƒ€๊ณ  ์žˆ๋‹ค. (Score: 0.0752)
๋‘ ๋‚จ์ž๊ฐ€ ์ˆ˜๋ ˆ๋ฅผ ์ˆฒ ์†ฆ์œผ๋กœ ๋ฐ€์—ˆ๋‹ค. (Score: 0.0486)


======================


Query: ๊ณ ๋ฆด๋ผ ์˜์ƒ์„ ์ž…์€ ๋ˆ„๊ตฐ๊ฐ€๊ฐ€ ๋“œ๋Ÿผ์„ ์—ฐ์ฃผํ•˜๊ณ  ์žˆ๋‹ค.

Top 5 most similar sentences in corpus:
์›์ˆญ์ด ํ•œ ๋งˆ๋ฆฌ๊ฐ€ ๋“œ๋Ÿผ์„ ์—ฐ์ฃผํ•œ๋‹ค. (Score: 0.6656)
์น˜ํƒ€ ํ•œ ๋งˆ๋ฆฌ๊ฐ€ ๋จน์ด ๋’ค์—์„œ ๋‹ฌ๋ฆฌ๊ณ  ์žˆ๋‹ค. (Score: 0.2988)
ํ•œ ์—ฌ์ž๊ฐ€ ๋ฐ”์ด์˜ฌ๋ฆฐ์„ ์—ฐ์ฃผํ•œ๋‹ค. (Score: 0.1566)
ํ•œ ๋‚จ์ž๊ฐ€ ๋ง์„ ํƒ„๋‹ค. (Score: 0.1112)
ํ•œ ๋‚จ์ž๊ฐ€ ๋‹ด์œผ๋กœ ์‹ธ์ธ ๋•…์—์„œ ๋ฐฑ๋งˆ๋ฅผ ํƒ€๊ณ  ์žˆ๋‹ค. (Score: 0.0262)


======================


Query: ์น˜ํƒ€๊ฐ€ ๋“คํŒ์„ ๊ฐ€๋กœ ์งˆ๋Ÿฌ ๋จน์ด๋ฅผ ์ซ“๋Š”๋‹ค.

Top 5 most similar sentences in corpus:
์น˜ํƒ€ ํ•œ ๋งˆ๋ฆฌ๊ฐ€ ๋จน์ด ๋’ค์—์„œ ๋‹ฌ๋ฆฌ๊ณ  ์žˆ๋‹ค. (Score: 0.7570)
๋‘ ๋‚จ์ž๊ฐ€ ์ˆ˜๋ ˆ๋ฅผ ์ˆฒ ์†ฆ์œผ๋กœ ๋ฐ€์—ˆ๋‹ค. (Score: 0.3658)
์›์ˆญ์ด ํ•œ ๋งˆ๋ฆฌ๊ฐ€ ๋“œ๋Ÿผ์„ ์—ฐ์ฃผํ•œ๋‹ค. (Score: 0.3583)
ํ•œ ๋‚จ์ž๊ฐ€ ๋ง์„ ํƒ„๋‹ค. (Score: 0.0505)
๊ทธ ์—ฌ์ž๊ฐ€ ์•„์ด๋ฅผ ๋Œ๋ณธ๋‹ค. (Score: -0.0087)

Clustering (KoSBERT)

from sentence_transformers import SentenceTransformer, util
import numpy as np

model_path = '../Checkpoint/KoSBERT/kosbert-klue-bert-base'

embedder = SentenceTransformer(model_path)

# Corpus with example sentences
corpus = ['ํ•œ ๋‚จ์ž๊ฐ€ ์Œ์‹์„ ๋จน๋Š”๋‹ค.',
          'ํ•œ ๋‚จ์ž๊ฐ€ ๋นต ํ•œ ์กฐ๊ฐ์„ ๋จน๋Š”๋‹ค.',
          '๊ทธ ์—ฌ์ž๊ฐ€ ์•„์ด๋ฅผ ๋Œ๋ณธ๋‹ค.',
          'ํ•œ ๋‚จ์ž๊ฐ€ ๋ง์„ ํƒ„๋‹ค.',
          'ํ•œ ์—ฌ์ž๊ฐ€ ๋ฐ”์ด์˜ฌ๋ฆฐ์„ ์—ฐ์ฃผํ•œ๋‹ค.',
          '๋‘ ๋‚จ์ž๊ฐ€ ์ˆ˜๋ ˆ๋ฅผ ์ˆฒ ์†ฆ์œผ๋กœ ๋ฐ€์—ˆ๋‹ค.',
          'ํ•œ ๋‚จ์ž๊ฐ€ ๋‹ด์œผ๋กœ ์‹ธ์ธ ๋•…์—์„œ ๋ฐฑ๋งˆ๋ฅผ ํƒ€๊ณ  ์žˆ๋‹ค.',
          '์›์ˆญ์ด ํ•œ ๋งˆ๋ฆฌ๊ฐ€ ๋“œ๋Ÿผ์„ ์—ฐ์ฃผํ•œ๋‹ค.',
          '์น˜ํƒ€ ํ•œ ๋งˆ๋ฆฌ๊ฐ€ ๋จน์ด ๋’ค์—์„œ ๋‹ฌ๋ฆฌ๊ณ  ์žˆ๋‹ค.',
          'ํ•œ ๋‚จ์ž๊ฐ€ ํŒŒ์Šคํƒ€๋ฅผ ๋จน๋Š”๋‹ค.',
          '๊ณ ๋ฆด๋ผ ์˜์ƒ์„ ์ž…์€ ๋ˆ„๊ตฐ๊ฐ€๊ฐ€ ๋“œ๋Ÿผ์„ ์—ฐ์ฃผํ•˜๊ณ  ์žˆ๋‹ค.',
          '์น˜ํƒ€๊ฐ€ ๋“คํŒ์„ ๊ฐ€๋กœ ์งˆ๋Ÿฌ ๋จน์ด๋ฅผ ์ซ“๋Š”๋‹ค.']

corpus_embeddings = embedder.encode(corpus)

# Then, we perform k-means clustering using sklearn:
from sklearn.cluster import KMeans

num_clusters = 5
clustering_model = KMeans(n_clusters=num_clusters)
clustering_model.fit(corpus_embeddings)
cluster_assignment = clustering_model.labels_

clustered_sentences = [[] for i in range(num_clusters)]
for sentence_id, cluster_id in enumerate(cluster_assignment):
    clustered_sentences[cluster_id].append(corpus[sentence_id])

for i, cluster in enumerate(clustered_sentences):
    print("Cluster ", i+1)
    print(cluster)
    print("")
  • Results are as follows:
Cluster  1
['ํ•œ ๋‚จ์ž๊ฐ€ ์Œ์‹์„ ๋จน๋Š”๋‹ค.', 'ํ•œ ๋‚จ์ž๊ฐ€ ๋นต ํ•œ ์กฐ๊ฐ์„ ๋จน๋Š”๋‹ค.', 'ํ•œ ๋‚จ์ž๊ฐ€ ํŒŒ์Šคํƒ€๋ฅผ ๋จน๋Š”๋‹ค.']

Cluster  2
['์›์ˆญ์ด ํ•œ ๋งˆ๋ฆฌ๊ฐ€ ๋“œ๋Ÿผ์„ ์—ฐ์ฃผํ•œ๋‹ค.', '๊ณ ๋ฆด๋ผ ์˜์ƒ์„ ์ž…์€ ๋ˆ„๊ตฐ๊ฐ€๊ฐ€ ๋“œ๋Ÿผ์„ ์—ฐ์ฃผํ•˜๊ณ  ์žˆ๋‹ค.']

Cluster  3
['ํ•œ ๋‚จ์ž๊ฐ€ ๋ง์„ ํƒ„๋‹ค.', '๋‘ ๋‚จ์ž๊ฐ€ ์ˆ˜๋ ˆ๋ฅผ ์ˆฒ ์†ฆ์œผ๋กœ ๋ฐ€์—ˆ๋‹ค.', 'ํ•œ ๋‚จ์ž๊ฐ€ ๋‹ด์œผ๋กœ ์‹ธ์ธ ๋•…์—์„œ ๋ฐฑ๋งˆ๋ฅผ ํƒ€๊ณ  ์žˆ๋‹ค.']

Cluster  4
['์น˜ํƒ€ ํ•œ ๋งˆ๋ฆฌ๊ฐ€ ๋จน์ด ๋’ค์—์„œ ๋‹ฌ๋ฆฌ๊ณ  ์žˆ๋‹ค.', '์น˜ํƒ€๊ฐ€ ๋“คํŒ์„ ๊ฐ€๋กœ ์งˆ๋Ÿฌ ๋จน์ด๋ฅผ ์ซ“๋Š”๋‹ค.']

Cluster  5
['๊ทธ ์—ฌ์ž๊ฐ€ ์•„์ด๋ฅผ ๋Œ๋ณธ๋‹ค.', 'ํ•œ ์—ฌ์ž๊ฐ€ ๋ฐ”์ด์˜ฌ๋ฆฐ์„ ์—ฐ์ฃผํ•œ๋‹ค.']

References

@misc{park2021klue,
    title={KLUE: Korean Language Understanding Evaluation},
    author={Sungjoon Park and Jihyung Moon and Sungdong Kim and Won Ik Cho and Jiyoon Han and Jangwon Park and Chisung Song and Junseong Kim and Yongsook Song and Taehwan Oh and Joohong Lee and Juhyun Oh and Sungwon Lyu and Younghoon Jeong and Inkwon Lee and Sangwoo Seo and Dongjun Lee and Hyunwoo Kim and Myeonghwa Lee and Seongbo Jang and Seungwon Do and Sunkyoung Kim and Kyungtae Lim and Jongwon Lee and Kyumin Park and Jamin Shin and Seonghyun Kim and Lucy Park and Alice Oh and Jung-Woo Ha and Kyunghyun Cho},
    year={2021},
    eprint={2105.09680},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
@inproceedings{gao2021simcse,
   title={{SimCSE}: Simple Contrastive Learning of Sentence Embeddings},
   author={Gao, Tianyu and Yao, Xingcheng and Chen, Danqi},
   booktitle={Empirical Methods in Natural Language Processing (EMNLP)},
   year={2021}
}
@article{ham2020kornli,
  title={KorNLI and KorSTS: New Benchmark Datasets for Korean Natural Language Understanding},
  author={Ham, Jiyeon and Choe, Yo Joong and Park, Kyubyong and Choi, Ilji and Soh, Hyungjoon},
  journal={arXiv preprint arXiv:2004.03289},
  year={2020}
}
@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "http://arxiv.org/abs/1908.10084",
}
Owner
Self-softmax
This converter will create the exact measure for your cappuccino recipe from the grandiose Rafaella Ballerini!

About CappuccinoJs This converter will create the exact measure for your cappuccino recipe from the grandiose Rafaella Ballerini! Este conversor criar

Arthur Ottoni Ribeiro 48 Nov 15, 2022
NLPIR tutorial: pretrain for IR. pre-train on raw textual corpus, fine-tune on MS MARCO Document Ranking

pretrain4ir_tutorial NLPIR tutorial: pretrain for IR. pre-train on raw textual corpus, fine-tune on MS MARCO Document Ranking ็”จไฝœNLPIRๅฎž้ชŒๅฎค, Pre-training

ZYMa 12 Apr 07, 2022
Pipeline for chemical image-to-text competition

BMS-Molecular-Translation Introduction This is a pipeline for Bristol-Myers Squibb โ€“ Molecular Translation by Vadim Timakin and Maksim Zhdanov. We got

Maksim Zhdanov 7 Sep 20, 2022
A single model that parses Universal Dependencies across 75 languages.

A single model that parses Universal Dependencies across 75 languages. Given a sentence, jointly predicts part-of-speech tags, morphology tags, lemmas, and dependency trees.

Dan Kondratyuk 189 Nov 29, 2022
Transformer - A TensorFlow Implementation of the Transformer: Attention Is All You Need

[UPDATED] A TensorFlow Implementation of Attention Is All You Need When I opened this repository in 2017, there was no official code yet. I tried to i

Kyubyong Park 3.8k Dec 26, 2022
A simple Speech Emotion Recognition (SER) API created using Flask and running in a Docker container.

keyword_searching Steps to use this Python scripts๏ผš (1)Paste this script into the file folder containing the PDF files you need to search from; (2)Thi

2 Nov 11, 2022
Pretty-doc - Composable text objects with python

pretty-doc from __future__ import annotations from dataclasses import dataclass

Taine Zhao 2 Jan 17, 2022
Blue Brain text mining toolbox for semantic search and structured information extraction

Blue Brain Search Source Code DOI Data & Models DOI Documentation Latest Release Python Versions License Build Status Static Typing Code Style Securit

The Blue Brain Project 29 Dec 01, 2022
Malware-Related Sentence Classification

Malware-Related Sentence Classification This repo contains the code for the ICTAI 2021 paper "Enrichment of Features for Malware-Related Sentence Clas

Chau Nguyen 1 Mar 26, 2022
Pytorch implementation of winner from VQA Chllange Workshop in CVPR'17

2017 VQA Challenge Winner (CVPR'17 Workshop) pytorch implementation of Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challeng

Mark Dong 166 Dec 11, 2022
a CTF web challenge about making screenshots

screenshotter (web) A CTF web challenge about making screenshots. It is inspired by a bug found in real life. The challenge was created by @LiveOverfl

219 Jan 02, 2023
Universal Adversarial Triggers for Attacking and Analyzing NLP (EMNLP 2019)

Universal Adversarial Triggers for Attacking and Analyzing NLP This is the official code for the EMNLP 2019 paper, Universal Adversarial Triggers for

Eric Wallace 248 Dec 17, 2022
Original implementation of the pooling method introduced in "Speaker embeddings by modeling channel-wise correlations"

Speaker-Embeddings-Correlation-Pooling This is the original implementation of the pooling method introduced in "Speaker embeddings by modeling channel

Themos Stafylakis 10 Apr 30, 2022
Source code for AAAI20 "Generating Persona Consistent Dialogues by Exploiting Natural Language Inference".

Generating Persona Consistent Dialogues by Exploiting Natural Language Inference Source code for RCDG model in AAAI20 Generating Persona Consistent Di

16 Oct 08, 2022
Auto_code_complete is a auto word-completetion program which allows you to customize it on your needs

auto_code_complete is a auto word-completetion program which allows you to customize it on your needs. the model for this program is one of the deep-learning NLP(Natural Language Process) model struc

RUO 2 Feb 22, 2022
์ดˆ์„ฑ ํ•ด์„๊ธฐ based on ko-BART

์ดˆ์„ฑ ํ•ด์„๊ธฐ ๊ฐœ์š” ํ•œ๊ตญ์–ด ์ดˆ์„ฑ๋งŒ์œผ๋กœ ์ด๋ฃจ์–ด์ง„ ๋ฌธ์žฅ์„ ์ž…๋ ฅํ•˜๋ฉด, ์™„์„ฑ๋œ ๋ฌธ์žฅ์„ ์˜ˆ์ธกํ•˜๋Š” ์ดˆ์„ฑ ํ•ด์„๊ธฐ์ž…๋‹ˆ๋‹ค. ์ดˆ์„ฑ: ใ„ดใ„ด ใ„ดใ„น ใ…ˆใ…‡ใ…Ž ์˜ˆ์ธก ๋ฌธ์žฅ: ๋‚˜๋Š” ๋„ˆ๋ฅผ ์ข‹์•„ํ•ด ๋ชจ๋ธ ๋ชจ๋ธ์€ SKT-AI์—์„œ ๊ณต๊ฐœํ•œ Ko-BART๋ฅผ ์ด์šฉํ•ฉ๋‹ˆ๋‹ค. ๋ฐ์ดํ„ฐ ๋ฌธ์žฅ ๋‹จ์œ„๋กœ ์ด๋ฃจ์–ด์ง„ ์•„๋ฌด ์ฝ”ํผ์Šค๋‚˜

Dawoon Jung 29 Oct 28, 2022
An Explainable Leaderboard for NLP

ExplainaBoard: An Explainable Leaderboard for NLP Introduction | Website | Download | Backend | Paper | Video | Bib Introduction ExplainaBoard is an i

NeuLab 319 Dec 20, 2022
Sploitus - Command line search tool for sploitus.com. Think searchsploit, but with more POCs

Sploitus Command line search tool for sploitus.com. Think searchsploit, but with

watchdog2000 5 Mar 07, 2022
pysentimiento: A Python toolkit for Sentiment Analysis and Social NLP tasks

A Python multilingual toolkit for Sentiment Analysis and Social NLP tasks

297 Dec 29, 2022
Unofficial implementation of Google's FNet: Mixing Tokens with Fourier Transforms

FNet: Mixing Tokens with Fourier Transforms Pytorch implementation of Fnet : Mixing Tokens with Fourier Transforms. Citation: @misc{leethorp2021fnet,

Rishikesh (เค‹เคทเคฟเค•เฅ‡เคถ) 217 Dec 05, 2022