🦅 Pretrained BigBird Model for Korean (up to 4096 tokens)

Last update: Dec 14, 2022

Overview

Pretrained BigBird Model for Korean

What is BigBird • How to Use • Pretraining • Evaluation Result • Docs • Citation

한국어 | English

What is BigBird?

BigBird: Transformers for Longer Sequences에서 소개된 sparse-attention 기반의 모델로, 일반적인 BERT보다 더 긴 sequence를 다룰 수 있습니다.

🦅 Longer Sequence - 최대 512개의 token을 다룰 수 있는 BERT의 8배인 최대 4096개의 token을 다룸

⏱️ Computational Efficiency - Full attention이 아닌 Sparse Attention을 이용하여 O(n²)에서 O(n)으로 개선

How to Use

🤗 Huggingface Hub에 업로드된 모델을 곧바로 사용할 수 있습니다:)
일부 이슈가 해결된 transformers>=4.11.0 사용을 권장합니다. (MRC 이슈 관련 PR)
BigBirdTokenizer 대신에 BertTokenizer 를 사용해야 합니다. (AutoTokenizer 사용시 BertTokenizer가 로드됩니다.)
자세한 사용법은 BigBird Tranformers documentation을 참고해주세요.

from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained("monologg/kobigbird-bert-base")  # BigBirdModel
tokenizer = AutoTokenizer.from_pretrained("monologg/kobigbird-bert-base")  # BertTokenizer

Pretraining

자세한 내용은 [Pretraining BigBird] 참고

	Hardware	Max len	LR	Batch	Train Step	Warmup Step
KoBigBird-BERT-Base	TPU v3-8	4096	1e-4	32	2M	20k

모두의 말뭉치, 한국어 위키, Common Crawl, 뉴스 데이터 등 다양한 데이터로 학습
ITC (Internal Transformer Construction) 모델로 학습 (ITC vs ETC)

Evaluation Result

1. Short Sequence (<=512)

자세한 내용은 [Finetune on Short Sequence Dataset] 참고

	NSMC (acc)	KLUE-NLI (acc)	KLUE-STS (pearsonr)	Korquad 1.0 (em/f1)	KLUE MRC (em/rouge-w)
KoELECTRA-Base-v3	91.13	86.87	93.14	85.66 / 93.94	59.54 / 65.64
KLUE-RoBERTa-Base	91.16	86.30	92.91	85.35 / 94.53	69.56 / 74.64
KoBigBird-BERT-Base	91.18	87.17	92.61	87.08 / 94.71	70.33 / 75.34

2. Long Sequence (>=1024)

자세한 내용은 [Finetune on Long Sequence Dataset] 참고

	TyDi QA (em/f1)	Korquad 2.1 (em/f1)	Fake News (f1)	Modu Sentiment (f1-macro)
KLUE-RoBERTa-Base	76.80 / 78.58	55.44 / 73.02	95.20	42.61
KoBigBird-BERT-Base	79.13 / 81.30	67.77 / 82.03	98.85	45.42

Docs

Citation

KoBigBird를 사용하신다면 아래와 같이 인용해주세요.

@software{jangwon_park_2021_5654154,
  author       = {Jangwon Park and Donggyu Kim},
  title        = {KoBigBird: Pretrained BigBird Model for Korean},
  month        = nov,
  year         = 2021,
  publisher    = {Zenodo},
  version      = {1.0.0},
  doi          = {10.5281/zenodo.5654154},
  url          = {https://doi.org/10.5281/zenodo.5654154}
}

Contributors

Jangwon Park and Donggyu Kim

Acknowledgements

KoBigBird는 Tensorflow Research Cloud (TFRC) 프로그램의 Cloud TPU 지원으로 제작되었습니다.

또한 멋진 로고를 제공해주신 Seyun Ahn님께 감사를 전합니다.

KakaoBrain KoGPT (Korean Generative Pre-trained Transformer)

KoGPT KoGPT (Korean Generative Pre-trained Transformer) https://github.com/kakaobrain/kogpt https://huggingface.co/kakaobrain/kogpt Model Descriptions

797 Dec 26, 2022

Generating Korean Slogans with phonetic and structural repetition

LexPOS_ko Generating Korean Slogans with phonetic and structural repetition Generating Slogans with Linguistic Features LexPOS is a sequence-to-sequen

3 May 23, 2022

Korean extractive summarization. 2021 AI 텍스트 요약 온라인 해커톤 화성갈끄니까팀 코드

korean extractive summarization 2021 AI 텍스트 요약 온라인 해커톤 화성갈끄니까팀 코드 Leaderboard Notice Text Summarization with Pretrained Encoders에 나오는 bertsumext모델(ext

3 Aug 10, 2022

Training code for Korean multi-class sentiment analysis

KoSentimentAnalysis Bert implementation for the Korean multi-class sentiment analysis 왜 한국어 감정 다중분류 모델은 거의 없는 것일까?에서 시작된 프로젝트 Environment: Pytorch, Da

3 Dec 2, 2022

Korean Sentence Embedding Repository

Korean-Sentence-Embedding 🍭 Korean sentence embedding repository. You can download the pre-trained models and inference right away, also it provides

80 Jan 2, 2023

ProteinBERT is a universal protein language model pretrained on ~106M proteins from the UniRef90 dataset.

ProteinBERT is a universal protein language model pretrained on ~106M proteins from the UniRef90 dataset. Through its Python API, the pretrained model can be fine-tuned on any protein-related task in a matter of minutes. Based on our experiments with a wide range of benchmarks, ProteinBERT usually achieves state-of-the-art performance. ProteinBERT is built on TenforFlow/Keras.

241 Jan 4, 2023

IndoBERTweet is the first large-scale pretrained model for Indonesian Twitter. Published at EMNLP 2021 (main conference)

IndoBERTweet 🐦 🇮🇩 1. Paper Fajri Koto, Jey Han Lau, and Timothy Baldwin. IndoBERTweet: A Pretrained Language Model for Indonesian Twitter with Effe

40 Nov 30, 2022

BMInf (Big Model Inference) is a low-resource inference package for large-scale pretrained language models (PLMs).

377 Jan 2, 2023

Crie tokens de autenticação íntegros e seguros com UToken.

UToken - Tokens seguros. UToken (ou Unhandleable Token) é uma bilioteca criada para ser utilizada na geração de tokens seguros e íntegros, ou seja, nã

0 Nov 29, 2022

Comments

Pretraining Epoch 질문
Checklist

[x] I've searched the project's issues

❓ Question

안녕하세요 저는 현재 친구들과 함께 4096 토큰을 입력받아 요약 태스크를 수행할 수 있는 모델을 만들고 있습니다. 처음엔 빅버드 + 버트 조합으로 해보려고 했는데, 이미 monologg 님께서 만들어주셨더라구요 ㅎㅎ 그래서 롱포머 + 바트 + 페가수스 조합으로 학습을 진행하려 하고 있습니다. pretrained된 KoBart를 기반으로 어텐션을 롱포머로 바꾼 후, 페가수스 task를 수행하는 구조로 되어 있습니다.

현재 13GB의 데이터를 모아서 전처리와 데이터로더 작성, 모델 코드까지는 완료한 상태입니다. 이번 주 내로 학습을 진행하려 하고 있습니다.

저희가 가진 GPU로는 대략 이틀이면 1 에포크를 돌 수 있을 것 같은데, monologg님께서는 KoBirBird 모델 개발 시 에포크를 얼마나 도셨는지 여쭤보고 싶습니다.

아무래도 pretrained 된 모델을 가져다 쓰다보니 에포크를 많이 돌 필요는 없을 것 같은데, 기준점으로 삼고 싶어서요!

말이 길어졌는데 요약하자면, KoBirBird 학습 시 에포크를 얼마나 주셨는지 궁금합니다. 또한, 그 기준은 무엇으로 삼으셨는지도 궁금합니다.
question
opened by KimJaehee0725 2
Specific information about this model.
Checklist

[ x ] I've searched the project's issues

❓ Question

You mentioned "모두의 말뭉치, 한국어 위키, Common Crawl, 뉴스 데이터 등 다양한 데이터로 학습" and I want to know the size of total corpus for pre-training.

Also I want to know the vocab size of this model.

📎 Additional context
question
opened by midannii 2
Fix some minors

Description

코드와 주석 등을 읽다가 보인 작은 오타 등을 수정했습니다

다양한 노하우를 아낌없이 공유해주신 @monologg , @donggyukimc 에게 감사의 말씀드립니다.

이후에는 GPU 환경에서 finetuning을 테스트해 볼 예정입니다 고맙습니다.

Related Issue
chore

opened by sackoh 0

Releases(v1.0.0)

v1.0.0(Nov 8, 2021)

Initial release for KoBigBird - Pretrained BigBird Model for Korean
Source code(tar.gz)
Source code(zip)

Owner

Jangwon Park

GitHub Repository https://huggingface.co/monologg/kobigbird-bert-base

中文医疗信息处理基准CBLUE: A Chinese Biomedical LanguageUnderstanding Evaluation Benchmark

English | 中文说明 CBLUE AI (Artificial Intelligence) is playing an indispensabe role in the biomedical field, helping improve medical technology. For fur

452 Dec 30, 2022

Repository to hold code for the cap-bot varient that is being presented at the SIIC Defence Hackathon 2021.

capbot-siic Repository to hold code for the cap-bot varient that is being presented at the SIIC Defence Hackathon 2021. Problem Inspiration A plethora

19 Feb 17, 2022

Simple telegram bot to convert files into direct download link.you can use telegram as a file server 🪁

TGCLOUD 🪁 Simple telegram bot to convert files into direct download link.you can use telegram as a file server 🪁 Features Easy to Deploy Heroku Supp

6 Oct 18, 2022

A collection of Classical Chinese natural language processing models, including Classical Chinese related models and resources on the Internet.

GuwenModels: 古文自然语言处理模型合集, 收录互联网上的古文相关模型及资源. A collection of Classical Chinese natural language processing models, including Classical Chinese related models and resources on the Internet.

66 Dec 26, 2022

Implementation of "Adversarial purification with Score-based generative models", ICML 2021

Adversarial Purification with Score-based Generative Models by Jongmin Yoon, Sung Ju Hwang, Juho Lee This repository includes the official PyTorch imp

15 Dec 15, 2022

I label phrases on a scale of five values: negative, somewhat negative, neutral, somewhat positive, positive

I label phrases on a scale of five values: negative, somewhat negative, neutral, somewhat positive, positive. Obstacles like sentence negation, sarcasm, terseness, language ambiguity, and many others

1 Jan 13, 2022

Pipeline for chemical image-to-text competition

BMS-Molecular-Translation Introduction This is a pipeline for Bristol-Myers Squibb – Molecular Translation by Vadim Timakin and Maksim Zhdanov. We got

7 Sep 20, 2022

Arabic speech recognition, classification and text-to-speech.

klaam Arabic speech recognition, classification and text-to-speech using many advanced models like wave2vec and fastspeech2. This repository allows tr

177 Dec 27, 2022

Korean extractive summarization. 2021 AI 텍스트 요약 온라인 해커톤 화성갈끄니까팀 코드

3 Aug 10, 2022

Question answering app is used to answer for a user given question from user given text.

Question answering app is used to answer for a user given question from user given text.It is created using HuggingFace's transformer pipeline and streamlit python packages.

3 Apr 05, 2022

Code for paper "Which Training Methods for GANs do actually Converge? (ICML 2018)"

GAN stability This repository contains the experiments in the supplementary material for the paper Which Training Methods for GANs do actually Converg

884 Nov 11, 2022

Unofficial Python library for using the Polish Wordnet (plWordNet / Słowosieć)

Polish Wordnet Python library Simple, easy-to-use and reasonably fast library for using the Słowosieć (also known as PlWordNet) - a lexico-semantic da

12 Dec 23, 2022

DeepAmandine is an artificial intelligence that allows you to talk to it for hours, you won't know the difference.

DeepAmandine This is an artificial intelligence based on GPT-3 that you can chat with, it is very nice and makes a lot of jokes. We wish you a good ex

3 Apr 19, 2022

Unsupervised Document Expansion for Information Retrieval with Stochastic Text Generation

Unsupervised Document Expansion for Information Retrieval with Stochastic Text Generation Official Code Repository for the paper "Unsupervised Documen

2 Oct 26, 2021

Beyond Accuracy: Behavioral Testing of NLP models with CheckList

CheckList This repository contains code for testing NLP Models as described in the following paper: Beyond Accuracy: Behavioral Testing of NLP models

1.8k Dec 28, 2022

NeurIPS'21: Probabilistic Margins for Instance Reweighting in Adversarial Training (Pytorch implementation).

source code for NeurIPS21 paper robabilistic Margins for Instance Reweighting in Adversarial Training

9 Dec 20, 2022

To be a next-generation DL-based phenotype prediction from genome mutations.

Sequence -----------+-- 3D_structure -- 3D_module --+ +-- ? | |

18 Jan 11, 2022

Anomaly Detection 이상치 탐지 전처리 모듈

Anomaly Detection 시계열 데이터에 대한 이상치 탐지 1. Kernel Density Estimation을 활용한 이상치 탐지 train_data_path와 test_data_path에 존재하는 시점 정보를 포함하고 있는 csv 형태의 train data와

43 Nov 28, 2022

In this repository, I have developed an end to end Automatic speech recognition project. I have developed the neural network model for automatic speech recognition with PyTorch and used MLflow to manage the ML lifecycle, including experimentation, reproducibility, deployment, and a central model registry.

End to End Automatic Speech Recognition In this repository, I have developed an end to end Automatic speech recognition project. I have developed the

22 Nov 13, 2022

Extract city and country mentions from Text like GeoText without regex, but FlashText, a Aho-Corasick implementation.

flashgeotext ⚡ 🌍 Extract and count countries and cities (+their synonyms) from text, like GeoText on steroids using FlashText, a Aho-Corasick impleme

57 Dec 16, 2022

🦅 Pretrained BigBird Model for Korean (up to 4096 tokens)

Related tags

Overview

Pretrained BigBird Model for Korean

What is BigBird?

How to Use

Pretraining

Evaluation Result

1. Short Sequence (<=512)

2. Long Sequence (>=1024)

Docs

Citation

Contributors

Acknowledgements

You might also like...

KakaoBrain KoGPT (Korean Generative Pre-trained Transformer)

Generating Korean Slogans with phonetic and structural repetition

Korean extractive summarization. 2021 AI 텍스트 요약 온라인 해커톤 화성갈끄니까팀 코드

Training code for Korean multi-class sentiment analysis

Korean Sentence Embedding Repository

ProteinBERT is a universal protein language model pretrained on ~106M proteins from the UniRef90 dataset.

IndoBERTweet is the first large-scale pretrained model for Indonesian Twitter. Published at EMNLP 2021 (main conference)

BMInf (Big Model Inference) is a low-resource inference package for large-scale pretrained language models (PLMs).

Crie tokens de autenticação íntegros e seguros com UToken.

Comments

Pretraining Epoch 질문

Checklist

❓ Question

Specific information about this model.

Checklist

❓ Question

📎 Additional context

Fix some minors

Description

Related Issue

Releases(v1.0.0)

v1.0.0(Nov 8, 2021)

Owner

Jangwon Park

中文医疗信息处理基准CBLUE: A Chinese Biomedical LanguageUnderstanding Evaluation Benchmark

Repository to hold code for the cap-bot varient that is being presented at the SIIC Defence Hackathon 2021.

Simple telegram bot to convert files into direct download link.you can use telegram as a file server 🪁

A collection of Classical Chinese natural language processing models, including Classical Chinese related models and resources on the Internet.

Implementation of "Adversarial purification with Score-based generative models", ICML 2021

I label phrases on a scale of five values: negative, somewhat negative, neutral, somewhat positive, positive

Pipeline for chemical image-to-text competition

Arabic speech recognition, classification and text-to-speech.

Korean extractive summarization. 2021 AI 텍스트 요약 온라인 해커톤 화성갈끄니까팀 코드

Question answering app is used to answer for a user given question from user given text.

Code for paper "Which Training Methods for GANs do actually Converge? (ICML 2018)"

Unofficial Python library for using the Polish Wordnet (plWordNet / Słowosieć)

DeepAmandine is an artificial intelligence that allows you to talk to it for hours, you won't know the difference.

Unsupervised Document Expansion for Information Retrieval with Stochastic Text Generation

Beyond Accuracy: Behavioral Testing of NLP models with CheckList

NeurIPS'21: Probabilistic Margins for Instance Reweighting in Adversarial Training (Pytorch implementation).

To be a next-generation DL-based phenotype prediction from genome mutations.

Anomaly Detection 이상치 탐지 전처리 모듈

Extract city and country mentions from Text like GeoText without regex, but FlashText, a Aho-Corasick implementation.