๐Ÿฆ… Pretrained BigBird Model for Korean (up to 4096 tokens)

Overview

Pretrained BigBird Model for Korean

What is BigBird โ€ข How to Use โ€ข Pretraining โ€ข Evaluation Result โ€ข Docs โ€ข Citation

ํ•œ๊ตญ์–ด | English

Apache 2.0 Issues linter DOI

What is BigBird?

BigBird: Transformers for Longer Sequences์—์„œ ์†Œ๊ฐœ๋œ sparse-attention ๊ธฐ๋ฐ˜์˜ ๋ชจ๋ธ๋กœ, ์ผ๋ฐ˜์ ์ธ BERT๋ณด๋‹ค ๋” ๊ธด sequence๋ฅผ ๋‹ค๋ฃฐ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๐Ÿฆ… Longer Sequence - ์ตœ๋Œ€ 512๊ฐœ์˜ token์„ ๋‹ค๋ฃฐ ์ˆ˜ ์žˆ๋Š” BERT์˜ 8๋ฐฐ์ธ ์ตœ๋Œ€ 4096๊ฐœ์˜ token์„ ๋‹ค๋ฃธ

โฑ๏ธ Computational Efficiency - Full attention์ด ์•„๋‹Œ Sparse Attention์„ ์ด์šฉํ•˜์—ฌ O(n2)์—์„œ O(n)์œผ๋กœ ๊ฐœ์„ 

How to Use

  • ๐Ÿค— Huggingface Hub์— ์—…๋กœ๋“œ๋œ ๋ชจ๋ธ์„ ๊ณง๋ฐ”๋กœ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:)
  • ์ผ๋ถ€ ์ด์Šˆ๊ฐ€ ํ•ด๊ฒฐ๋œ transformers>=4.11.0 ์‚ฌ์šฉ์„ ๊ถŒ์žฅํ•ฉ๋‹ˆ๋‹ค. (MRC ์ด์Šˆ ๊ด€๋ จ PR)
  • BigBirdTokenizer ๋Œ€์‹ ์— BertTokenizer ๋ฅผ ์‚ฌ์šฉํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. (AutoTokenizer ์‚ฌ์šฉ์‹œ BertTokenizer๊ฐ€ ๋กœ๋“œ๋ฉ๋‹ˆ๋‹ค.)
  • ์ž์„ธํ•œ ์‚ฌ์šฉ๋ฒ•์€ BigBird Tranformers documentation์„ ์ฐธ๊ณ ํ•ด์ฃผ์„ธ์š”.
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained("monologg/kobigbird-bert-base")  # BigBirdModel
tokenizer = AutoTokenizer.from_pretrained("monologg/kobigbird-bert-base")  # BertTokenizer

Pretraining

์ž์„ธํ•œ ๋‚ด์šฉ์€ [Pretraining BigBird] ์ฐธ๊ณ 

Hardware Max len LR Batch Train Step Warmup Step
KoBigBird-BERT-Base TPU v3-8 4096 1e-4 32 2M 20k
  • ๋ชจ๋‘์˜ ๋ง๋ญ‰์น˜, ํ•œ๊ตญ์–ด ์œ„ํ‚ค, Common Crawl, ๋‰ด์Šค ๋ฐ์ดํ„ฐ ๋“ฑ ๋‹ค์–‘ํ•œ ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šต
  • ITC (Internal Transformer Construction) ๋ชจ๋ธ๋กœ ํ•™์Šต (ITC vs ETC)

Evaluation Result

1. Short Sequence (<=512)

์ž์„ธํ•œ ๋‚ด์šฉ์€ [Finetune on Short Sequence Dataset] ์ฐธ๊ณ 

NSMC
(acc)
KLUE-NLI
(acc)
KLUE-STS
(pearsonr)
Korquad 1.0
(em/f1)
KLUE MRC
(em/rouge-w)
KoELECTRA-Base-v3 91.13 86.87 93.14 85.66 / 93.94 59.54 / 65.64
KLUE-RoBERTa-Base 91.16 86.30 92.91 85.35 / 94.53 69.56 / 74.64
KoBigBird-BERT-Base 91.18 87.17 92.61 87.08 / 94.71 70.33 / 75.34

2. Long Sequence (>=1024)

์ž์„ธํ•œ ๋‚ด์šฉ์€ [Finetune on Long Sequence Dataset] ์ฐธ๊ณ 

TyDi QA
(em/f1)
Korquad 2.1
(em/f1)
Fake News
(f1)
Modu Sentiment
(f1-macro)
KLUE-RoBERTa-Base 76.80 / 78.58 55.44 / 73.02 95.20 42.61
KoBigBird-BERT-Base 79.13 / 81.30 67.77 / 82.03 98.85 45.42

Docs

Citation

KoBigBird๋ฅผ ์‚ฌ์šฉํ•˜์‹ ๋‹ค๋ฉด ์•„๋ž˜์™€ ๊ฐ™์ด ์ธ์šฉํ•ด์ฃผ์„ธ์š”.

@software{jangwon_park_2021_5654154,
  author       = {Jangwon Park and Donggyu Kim},
  title        = {KoBigBird: Pretrained BigBird Model for Korean},
  month        = nov,
  year         = 2021,
  publisher    = {Zenodo},
  version      = {1.0.0},
  doi          = {10.5281/zenodo.5654154},
  url          = {https://doi.org/10.5281/zenodo.5654154}
}

Contributors

Jangwon Park and Donggyu Kim

Acknowledgements

KoBigBird๋Š” Tensorflow Research Cloud (TFRC) ํ”„๋กœ๊ทธ๋žจ์˜ Cloud TPU ์ง€์›์œผ๋กœ ์ œ์ž‘๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

๋˜ํ•œ ๋ฉ‹์ง„ ๋กœ๊ณ ๋ฅผ ์ œ๊ณตํ•ด์ฃผ์‹  Seyun Ahn๋‹˜๊ป˜ ๊ฐ์‚ฌ๋ฅผ ์ „ํ•ฉ๋‹ˆ๋‹ค.

You might also like...
KakaoBrain KoGPT (Korean Generative Pre-trained Transformer)

KoGPT KoGPT (Korean Generative Pre-trained Transformer) https://github.com/kakaobrain/kogpt https://huggingface.co/kakaobrain/kogpt Model Descriptions

Generating Korean Slogans with phonetic and structural repetition
Generating Korean Slogans with phonetic and structural repetition

LexPOS_ko Generating Korean Slogans with phonetic and structural repetition Generating Slogans with Linguistic Features LexPOS is a sequence-to-sequen

Korean extractive summarization. 2021 AI ํ…์ŠคํŠธ ์š”์•ฝ ์˜จ๋ผ์ธ ํ•ด์ปคํ†ค ํ™”์„ฑ๊ฐˆ๋„๋‹ˆ๊นŒํŒ€ ์ฝ”๋“œ
Korean extractive summarization. 2021 AI ํ…์ŠคํŠธ ์š”์•ฝ ์˜จ๋ผ์ธ ํ•ด์ปคํ†ค ํ™”์„ฑ๊ฐˆ๋„๋‹ˆ๊นŒํŒ€ ์ฝ”๋“œ

korean extractive summarization 2021 AI ํ…์ŠคํŠธ ์š”์•ฝ ์˜จ๋ผ์ธ ํ•ด์ปคํ†ค ํ™”์„ฑ๊ฐˆ๋„๋‹ˆ๊นŒํŒ€ ์ฝ”๋“œ Leaderboard Notice Text Summarization with Pretrained Encoders์— ๋‚˜์˜ค๋Š” bertsumext๋ชจ๋ธ(ext

Training code for Korean multi-class sentiment analysis

KoSentimentAnalysis Bert implementation for the Korean multi-class sentiment analysis ์™œ ํ•œ๊ตญ์–ด ๊ฐ์ • ๋‹ค์ค‘๋ถ„๋ฅ˜ ๋ชจ๋ธ์€ ๊ฑฐ์˜ ์—†๋Š” ๊ฒƒ์ผ๊นŒ?์—์„œ ์‹œ์ž‘๋œ ํ”„๋กœ์ ํŠธ Environment: Pytorch, Da

Korean Sentence Embedding Repository

Korean-Sentence-Embedding ๐Ÿญ Korean sentence embedding repository. You can download the pre-trained models and inference right away, also it provides

ProteinBERT is a universal protein language model pretrained on ~106M proteins from the UniRef90 dataset.

ProteinBERT is a universal protein language model pretrained on ~106M proteins from the UniRef90 dataset. Through its Python API, the pretrained model can be fine-tuned on any protein-related task in a matter of minutes. Based on our experiments with a wide range of benchmarks, ProteinBERT usually achieves state-of-the-art performance. ProteinBERT is built on TenforFlow/Keras.

IndoBERTweet is the first large-scale pretrained model for Indonesian Twitter. Published at EMNLP 2021 (main conference)

IndoBERTweet ๐Ÿฆ ๐Ÿ‡ฎ๐Ÿ‡ฉ 1. Paper Fajri Koto, Jey Han Lau, and Timothy Baldwin. IndoBERTweet: A Pretrained Language Model for Indonesian Twitter with Effe

BMInf (Big Model Inference) is a low-resource inference package for large-scale pretrained language models (PLMs).
BMInf (Big Model Inference) is a low-resource inference package for large-scale pretrained language models (PLMs).

BMInf (Big Model Inference) is a low-resource inference package for large-scale pretrained language models (PLMs).

Crie tokens de autenticaรงรฃo รญntegros e seguros com UToken.

UToken - Tokens seguros. UToken (ou Unhandleable Token) รฉ uma bilioteca criada para ser utilizada na geraรงรฃo de tokens seguros e รญntegros, ou seja, nรฃ

Comments
  • Pretraining Epoch ์งˆ๋ฌธ

    Pretraining Epoch ์งˆ๋ฌธ

    Checklist

    • [x] I've searched the project's issues

    โ“ Question

    ์•ˆ๋…•ํ•˜์„ธ์š” ์ €๋Š” ํ˜„์žฌ ์นœ๊ตฌ๋“ค๊ณผ ํ•จ๊ป˜ 4096 ํ† ํฐ์„ ์ž…๋ ฅ๋ฐ›์•„ ์š”์•ฝ ํƒœ์Šคํฌ๋ฅผ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋Š” ๋ชจ๋ธ์„ ๋งŒ๋“ค๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์ฒ˜์Œ์—” ๋น…๋ฒ„๋“œ + ๋ฒ„ํŠธ ์กฐํ•ฉ์œผ๋กœ ํ•ด๋ณด๋ ค๊ณ  ํ–ˆ๋Š”๋ฐ, ์ด๋ฏธ monologg ๋‹˜๊ป˜์„œ ๋งŒ๋“ค์–ด์ฃผ์…จ๋”๋ผ๊ตฌ์š” ใ…Žใ…Ž ๊ทธ๋ž˜์„œ ๋กฑํฌ๋จธ + ๋ฐ”ํŠธ + ํŽ˜๊ฐ€์ˆ˜์Šค ์กฐํ•ฉ์œผ๋กœ ํ•™์Šต์„ ์ง„ํ–‰ํ•˜๋ ค ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. pretrained๋œ KoBart๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์–ดํ…์…˜์„ ๋กฑํฌ๋จธ๋กœ ๋ฐ”๊พผ ํ›„, ํŽ˜๊ฐ€์ˆ˜์Šค task๋ฅผ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ตฌ์กฐ๋กœ ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.

    ํ˜„์žฌ 13GB์˜ ๋ฐ์ดํ„ฐ๋ฅผ ๋ชจ์•„์„œ ์ „์ฒ˜๋ฆฌ์™€ ๋ฐ์ดํ„ฐ๋กœ๋” ์ž‘์„ฑ, ๋ชจ๋ธ ์ฝ”๋“œ๊นŒ์ง€๋Š” ์™„๋ฃŒํ•œ ์ƒํƒœ์ž…๋‹ˆ๋‹ค. ์ด๋ฒˆ ์ฃผ ๋‚ด๋กœ ํ•™์Šต์„ ์ง„ํ–‰ํ•˜๋ ค ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

    ์ €ํฌ๊ฐ€ ๊ฐ€์ง„ GPU๋กœ๋Š” ๋Œ€๋žต ์ดํ‹€์ด๋ฉด 1 ์—ํฌํฌ๋ฅผ ๋Œ ์ˆ˜ ์žˆ์„ ๊ฒƒ ๊ฐ™์€๋ฐ, monologg๋‹˜๊ป˜์„œ๋Š” KoBirBird ๋ชจ๋ธ ๊ฐœ๋ฐœ ์‹œ ์—ํฌํฌ๋ฅผ ์–ผ๋งˆ๋‚˜ ๋„์…จ๋Š”์ง€ ์—ฌ์ญค๋ณด๊ณ  ์‹ถ์Šต๋‹ˆ๋‹ค.

    ์•„๋ฌด๋ž˜๋„ pretrained ๋œ ๋ชจ๋ธ์„ ๊ฐ€์ ธ๋‹ค ์“ฐ๋‹ค๋ณด๋‹ˆ ์—ํฌํฌ๋ฅผ ๋งŽ์ด ๋Œ ํ•„์š”๋Š” ์—†์„ ๊ฒƒ ๊ฐ™์€๋ฐ, ๊ธฐ์ค€์ ์œผ๋กœ ์‚ผ๊ณ  ์‹ถ์–ด์„œ์š”!

    ๋ง์ด ๊ธธ์–ด์กŒ๋Š”๋ฐ ์š”์•ฝํ•˜์ž๋ฉด, KoBirBird ํ•™์Šต ์‹œ ์—ํฌํฌ๋ฅผ ์–ผ๋งˆ๋‚˜ ์ฃผ์…จ๋Š”์ง€ ๊ถ๊ธˆํ•ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ, ๊ทธ ๊ธฐ์ค€์€ ๋ฌด์—‡์œผ๋กœ ์‚ผ์œผ์…จ๋Š”์ง€๋„ ๊ถ๊ธˆํ•ฉ๋‹ˆ๋‹ค.

    question 
    opened by KimJaehee0725 2
  • Specific information about this model.

    Specific information about this model.

    Checklist

    • [ x ] I've searched the project's issues

    โ“ Question

    • You mentioned "๋ชจ๋‘์˜ ๋ง๋ญ‰์น˜, ํ•œ๊ตญ์–ด ์œ„ํ‚ค, Common Crawl, ๋‰ด์Šค ๋ฐ์ดํ„ฐ ๋“ฑ ๋‹ค์–‘ํ•œ ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šต" and I want to know the size of total corpus for pre-training.

    • Also I want to know the vocab size of this model.

    ๐Ÿ“Ž Additional context

    question 
    opened by midannii 2
  • Fix some minors

    Fix some minors

    Description

    ์ฝ”๋“œ์™€ ์ฃผ์„ ๋“ฑ์„ ์ฝ๋‹ค๊ฐ€ ๋ณด์ธ ์ž‘์€ ์˜คํƒ€ ๋“ฑ์„ ์ˆ˜์ •ํ–ˆ์Šต๋‹ˆ๋‹ค

    ๋‹ค์–‘ํ•œ ๋…ธํ•˜์šฐ๋ฅผ ์•„๋‚Œ์—†์ด ๊ณต์œ ํ•ด์ฃผ์‹  @monologg , @donggyukimc ์—๊ฒŒ ๊ฐ์‚ฌ์˜ ๋ง์”€๋“œ๋ฆฝ๋‹ˆ๋‹ค.

    ์ดํ›„์—๋Š” GPU ํ™˜๊ฒฝ์—์„œ finetuning์„ ํ…Œ์ŠคํŠธํ•ด ๋ณผ ์˜ˆ์ •์ž…๋‹ˆ๋‹ค ๊ณ ๋ง™์Šต๋‹ˆ๋‹ค.

    Related Issue

    chore 
    opened by sackoh 0
Releases(v1.0.0)
Labelling platform for text using distant supervision

With DataQA, you can label unstructured text documents using rule-based distant supervision.

245 Aug 05, 2022
A script that automatically creates a branch name using google translation api and jira api

About google translation api์™€ jira api์„ ์‚ฌ์šฉํ•˜์—ฌ ์ž๋™์œผ๋กœ ๋ธŒ๋žœ์น˜ ์ด๋ฆ„์„ ๋งŒ๋“ค์–ด์ฃผ๋Š” ์Šคํฌ๋ฆฝํŠธ Setup ํ™˜๊ฒฝ๋ณ€์ˆ˜์— ๋‹ค์Œ 3๊ฐ€์ง€๋ฅผ ๋“ฑ๋กํ•ด์•ผ ํ•œ๋‹ค. JIRA_USER : JIRA email (ex: hyunwook.kim 2 Dec 20, 2021

HAN2HAN : Hangul Font Generation

HAN2HAN : Hangul Font Generation

Changwoo Lee 36 Dec 28, 2022
The training code for the 4th place model at MDX 2021 leaderboard A.

The training code for the 4th place model at MDX 2021 leaderboard A.

Chin-Yun Yu 32 Dec 18, 2022
code for modular summarization work published in ACL2021 by Krishna et al

This repository contains the code for running modular summarization pipelines as described in the publication Krishna K, Khosla K, Bigham J, Lipton ZC

Approximately Correct Machine Intelligence (ACMI) Lab 21 Nov 24, 2022
The model is designed to train a single and large neural network in order to predict correct translation by reading the given sentence.

Neural Machine Translation communication system The model is basically direct to convert one source language to another targeted language using encode

Nishant Banjade 7 Sep 22, 2022
In this Notebook I've build some machine-learning and deep-learning to classify corona virus tweets, in both multi class classification and binary classification.

Hello, This Notebook Contains Example of Corona Virus Tweets Multi Class Classification. - Classes is: Extremely Positive, Positive, Extremely Negativ

Khaled Tofailieh 3 Dec 06, 2022
Official Pytorch implementation of Test-Agnostic Long-Tailed Recognition by Test-Time Aggregating Diverse Experts with Self-Supervision.

This repository is the official Pytorch implementation of Test-Agnostic Long-Tailed Recognition by Test-Time Aggregating Diverse Experts with Self-Supervision.

vanint 101 Dec 30, 2022
IEEEXtreme15.0 Questions And Answers

IEEEXtreme15.0 Questions And Answers IEEEXtreme is a global challenge in which teams of IEEE Student members โ€“ advised and proctored by an IEEE member

Dilan Perera 15 Oct 24, 2022
Phrase-BERT: Improved Phrase Embeddings from BERT with an Application to Corpus Exploration

Phrase-BERT: Improved Phrase Embeddings from BERT with an Application to Corpus Exploration This is the official repository for the EMNLP 2021 long pa

70 Dec 11, 2022
CPC-big and k-means clustering for zero-resource speech processing

The CPC-big model and k-means checkpoints used in Analyzing Speaker Information in Self-Supervised Models to Improve Zero-Resource Speech Processing.

Benjamin van Niekerk 5 Nov 23, 2022
Finding Label and Model Errors in Perception Data With Learned Observation Assertions

Finding Label and Model Errors in Perception Data With Learned Observation Assertions This is the project page for Finding Label and Model Errors in P

Stanford Future Data Systems 17 Oct 14, 2022
Contains the code and data for our #ICSE2022 paper titled as "CodeFill: Multi-token Code Completion by Jointly Learning from Structure and Naming Sequences"

CodeFill This repository contains the code for our paper titled as "CodeFill: Multi-token Code Completion by Jointly Learning from Structure and Namin

Software Analytics Lab 11 Oct 31, 2022
Awesome Treasure of Transformers Models Collection

๐Ÿ’ Awesome Treasure of Transformers Models for Natural Language processing contains papers, videos, blogs, official repo along with colab Notebooks. ๐Ÿ›ซโ˜‘๏ธ

Ashish Patel 577 Jan 07, 2023
Semantic search for quotes.

squote A semantic search engine that takes some input text and returns some (questionably) relevant (questionably) famous quotes. Built with: bert-as-

cjwallace 11 Jun 25, 2022
GSoC'2021 | TensorFlow implementation of Wav2Vec2

GSoC'2021 | TensorFlow implementation of Wav2Vec2

Vasudev Gupta 73 Nov 28, 2022
Pytorch implementation of Tacotron

Tacotron-pytorch A pytorch implementation of Tacotron: A Fully End-to-End Text-To-Speech Synthesis Model. Requirements Install python 3 Install pytorc

soobin seo 203 Dec 02, 2022
nlpcommon is a python Open Source Toolkit for text classification.

nlpcommon nlpcommon, Python Text Tool. Guide Feature Install Usage Dataset Contact Cite Reference Feature nlpcommon is a python Open Source

xuming 3 May 29, 2022
Based on 125GB of data leaked from Twitch, you can see their monthly revenues from 2019-2021

Twitch Revenues Bu script'i kullanarak istediฤŸiniz yayฤฑncฤฑlarฤฑn, Twitch'den sฤฑzdฤฑrฤฑlan 125 GB'lik veriye dayanarak, 2019-2021 arasฤฑ aylฤฑk gelirlerini

4 Nov 11, 2021
Bidirectional Variational Inference for Non-Autoregressive Text-to-Speech (BVAE-TTS)

Bidirectional Variational Inference for Non-Autoregressive Text-to-Speech (BVAE-TTS) Yoonhyung Lee, Joongbo Shin, Kyomin Jung Abstract: Although early

LEE YOON HYUNG 147 Dec 05, 2022