KR-FinBert And KR-FinBert-SC

Overview

KR-FinBert & KR-FinBert-SC

Much progress has been made in the NLP (Natural Language Processing) field, with numerous studies showing that domain adaptation using small-scale corpus and fine-tuning with labeled data is effective for overall performance improvement. we proposed KR-FinBert for the financial domain by further pre-training it on a financial corpus and fine-tuning it for sentiment analysis. As many studies have shown, the performance improvement through adaptation and conducting the downstream task was also clear in this experiment.

KR-FinBert

Data

The training data for this model is expanded from those of KR-BERT-MEDIUM, texts from Korean Wikipedia, general news articles, legal texts crawled from the National Law Information Center and Korean Comments dataset. For the transfer learning, corporate related economic news articles from 72 media sources such as the Financial Times, The Korean Economy Daily, etc and analyst reports from 16 securities companies such as Kiwoom Securities, Samsung Securities, etc are added. Included in the dataset is 440,067 news titles with their content and 11,237 analyst reports. The total data size is about 13.22GB. For mlm training, we split the data line by line and the total no. of lines is 6,379,315. KR-FinBert is trained for 5.5M steps with the maxlen of 512, training batch size of 32, and learning rate of 5e-5, taking 67.48 hours to train the model using NVIDIA TITAN XP.

Models

Requirements

  • transformers v 4.0.0
  • Pytorch_lightning v1.3.0

MLM model

from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("snunlp/KR-FinBert")
model = AutoModelForMaskedLM.from_pretrained("snunlp/KR-FinBert")
python3 run_mlm.py \
    --model_name_or_path snunlp/KR-Medium \
    --train_file newsdata_line.txt \
    --do_train \
    --output_dir ./test-mlm \
    --line_by_line True \
    --max_seq_length 512
TruthfulQA: Measuring How Models Imitate Human Falsehoods

TruthfulQA: Measuring How Models Imitate Human Falsehoods

69 Dec 25, 2022
AI-Broad-casting - AI Broad casting with python

Basic Code 1. Use The Code Configuration Environment conda create -n code_base p

Speach Recognitions

easy_meeting Добро пожаловать в интерфейс сервиса автопротоколирования совещаний Easy Meeting. Website - http://cf5c-62-192-251-83.ngrok.io/ Принципиа

Maksim 3 Feb 18, 2022
Biterm Topic Model (BTM): modeling topics in short texts

Biterm Topic Model Bitermplus implements Biterm topic model for short texts introduced by Xiaohui Yan, Jiafeng Guo, Yanyan Lan, and Xueqi Cheng. Actua

Maksim Terpilowski 49 Dec 30, 2022
2021海华AI挑战赛·中文阅读理解·技术组·第三名

文字是人类用以记录和表达的最基本工具,也是信息传播的重要媒介。透过文字与符号,我们可以追寻人类文明的起源,可以传播知识与经验,读懂文字是认识与了解的第一步。对于人工智能而言,它的核心问题之一就是认知,而认知的核心则是语义理解。

21 Dec 26, 2022
Natural language Understanding Toolkit

Natural language Understanding Toolkit TOC Requirements Installation Documentation CLSCL NER References Requirements To install nut you need: Python 2

Peter Prettenhofer 119 Oct 08, 2022
Uncomplete archive of files from the European Nopsled Team

European Nopsled CTF Archive This is an archive of collected material from various Capture the Flag competitions that the European Nopsled team played

European Nopsled 4 Nov 24, 2021
Deal or No Deal? End-to-End Learning for Negotiation Dialogues

Introduction This is a PyTorch implementation of the following research papers: (1) Hierarchical Text Generation and Planning for Strategic Dialogue (

Facebook Research 1.4k Dec 29, 2022
TextFlint is a multilingual robustness evaluation platform for natural language processing tasks,

TextFlint is a multilingual robustness evaluation platform for natural language processing tasks, which unifies general text transformation, task-specific transformation, adversarial attack, sub-popu

TextFlint 587 Dec 20, 2022
This is the code for the EMNLP 2021 paper AEDA: An Easier Data Augmentation Technique for Text Classification

The baseline code is for EDA: Easy Data Augmentation techniques for boosting performance on text classification tasks

Akbar Karimi 81 Dec 09, 2022
A large-scale (194k), Multiple-Choice Question Answering (MCQA) dataset designed to address realworld medical entrance exam questions.

MedMCQA MedMCQA : A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering A large-scale, Multiple-Choice Question Answe

MedMCQA 24 Nov 30, 2022
Retraining OpenAI's GPT-2 on Discord Chats

Train OpenAI's GPT-2 on Discord Chats Retraining a Text Generation Model on Discord Chats using gpt-2-simple that wraps existing model fine-tuning and

Ayush Mishra 4 Oct 27, 2022
Comprehensive-E2E-TTS - PyTorch Implementation

A Non-Autoregressive End-to-End Text-to-Speech (text-to-wav), supporting a family of SOTA unsupervised duration modelings. This project grows with the research community, aiming to achieve the ultima

Keon Lee 114 Nov 13, 2022
Text Classification Using LSTM

Text classification is the task of assigning a set of predefined categories to free text. Text classifiers can be used to organize, structure, and categorize pretty much anything. For example, new ar

KrishArul26 3 Jan 03, 2023
NLP, Machine learning

Netflix-recommendation-system NLP, Machine learning About Recommendation algorithms are at the core of the Netflix product. It provides their members

Harshith VH 6 Jan 12, 2022
CoSENT、STS、SentenceBERT

CoSENT_Pytorch 比Sentence-BERT更有效的句向量方案

102 Dec 07, 2022
Ceaser-Cipher - The Caesar Cipher technique is one of the earliest and simplest method of encryption technique

Ceaser-Cipher The Caesar Cipher technique is one of the earliest and simplest me

Lateefah Ajadi 2 May 12, 2022
End-to-End Speech Processing Toolkit

ESPnet: end-to-end speech processing toolkit system/pytorch ver. 1.0.1 1.1.0 1.2.0 1.3.1 1.4.0 1.5.1 1.6.0 1.7.1 1.8.1 ubuntu18/python3.8/pip ubuntu18

ESPnet 5.9k Jan 03, 2023
Perform sentiment analysis and keyword extraction on Craigslist listings

craiglist-helper synopsis Perform sentiment analysis and keyword extraction on Craigslist listings Background I love Craigslist. I've found most of my

Mark Musil 1 Nov 08, 2021
A Streamlit web app that generates Rick and Morty stories using GPT2.

Rick and Morty Story Generator This project uses a pre-trained GPT2 model, which was fine-tuned on Rick and Morty transcripts, to generate new stories

₸ornike 33 Oct 13, 2022