Transformer Based Korean Sentence Spacing Corrector

Overview

TKOrrector

Transformer Based Korean Sentence Spacing Corrector

Architecture

License Summary

This solution is made available under Apache 2 license. See the LICENSE file.

Minimum Requirements

It is recommended that you run the Trainig on a machine with Nvidia GPU with drivers and CUDA installed.

Prerequisites

  1. Clone this repo and cd into it.

  2. Install dependencies. Preferrably in a virtual env.

    a. Optional: Create new virtual env. Conda example below.
    conda create --name TKOrrector python=3.9 -y
    conda activate TKOrrector

    b. Install PyTorch with CUDA conda install pytorch torchvision torchaudio cudatoolkit=11.1 -c pytorch -c nvidia

    or

    b. Install PyTorch without GPU conda install pytorch torchvision torchaudio cpuonly -c pytorch

    c. Install dependencies
    pip install -r requirements.txt

Run

You can run the pretrained model without the need to Train.

Download the pretrained model and extract into the current directory (tar zxvf TKOrrector.tar.gz)

sh demo.sh

Example demo run screen and results.
Example Demo Run

Train

Download the Corpus

  1. Go to NIKL Corpus Download Site and apply for a new license.

    The cost is free but you need to sign an agreement. It is recommended that you upload the corpus file on an object storage such as GCS to quickly download on additional machines such as GCP GCE to use a VM with GPU for training as needed without huge upfront cost. Edit src/download_corpus.sh to download the Corpus file and expand it into the designated directory.

    cd src
    sh download_corpus.sh

Run the data prep stage

Change lines 51, 53 in prepare_corpus_with_tokenizer.sh to increase the training dataset size.  
The second argument is the number of files to include into the training set + 1.  
`get_corpus "../data/$CORPUS1/*" 10`  
Above command would include 9 files (manual pdf file is skipped) from the Newspaper corpus.
  1. Run the data prep command.

    sh prepare_corpus_with_tokenizer.sh

Run the training stage

  1. Run the training command.

    sh train.sh

Run the Evaluation

  1. After the training is done, evaluation of the model with test dataset can be performed with batch translations by running the command below.

    sh calculate_metrics.sh

Detailed Dataflow Diagram

Detailed Architecture

Owner
Paul Hyung Yuel Kim
Paul Hyung Yuel Kim
A paper list for aspect based sentiment analysis.

Aspect-Based-Sentiment-Analysis A paper list for aspect based sentiment analysis. Survey [IEEE-TAC-20]: Issues and Challenges of Aspect-based Sentimen

jiangqn 419 Dec 20, 2022
AudioCLIP Extending CLIP to Image, Text and Audio

AudioCLIP Extending CLIP to Image, Text and Audio This repository contains implementation of the models described in the paper arXiv:2106.13043. This

458 Jan 02, 2023
Simple python code to fix your combo list by removing any text after a separator or removing duplicate combos

Combo List Fixer A simple python code to fix your combo list by removing any text after a separator or removing duplicate combos Removing any text aft

Hamidreza Dehghan 3 Dec 05, 2022
Python package for performing Entity and Text Matching using Deep Learning.

DeepMatcher DeepMatcher is a Python package for performing entity and text matching using deep learning. It provides built-in neural networks and util

461 Dec 28, 2022
A python package for deep multilingual punctuation prediction.

This python library predicts the punctuation of English, Italian, French and German texts. We developed it to restore the punctuation of transcribed spoken language.

Oliver Guhr 27 Dec 22, 2022
Open solution to the Toxic Comment Classification Challenge

Starter code: Kaggle Toxic Comment Classification Challenge More competitions 🎇 Check collection of public projects 🎁 , where you can find multiple

minerva.ml 153 Jun 22, 2022
Textpipe: clean and extract metadata from text

textpipe: clean and extract metadata from text textpipe is a Python package for converting raw text in to clean, readable text and extracting metadata

Textpipe 298 Nov 21, 2022
Rootski - Full codebase for rootski.io (without the data)

📣 Welcome to the Rootski codebase! This is the codebase for the application run

Eric 20 Nov 18, 2022
Open source code for AlphaFold.

AlphaFold This package provides an implementation of the inference pipeline of AlphaFold v2.0. This is a completely new model that was entered in CASP

DeepMind 9.7k Jan 02, 2023
Sentence boundary disambiguation tool for Japanese texts (日本語文境界判定器)

Bunkai Bunkai is a sentence boundary (SB) disambiguation tool for Japanese texts. Quick Start $ pip install bunkai $ echo -e '宿を予約しました♪!まだ2ヶ月も先だけど。早すぎ

Megagon Labs 160 Dec 23, 2022
Chinese real time voice cloning (VC) and Chinese text to speech (TTS).

Chinese real time voice cloning (VC) and Chinese text to speech (TTS). 好用的中文语音克隆兼中文语音合成系统,包含语音编码器、语音合成器、声码器和可视化模块。

Kuang Dada 6 Nov 08, 2022
Chinese segmentation library

What is loso? loso is a Chinese segmentation system written in Python. It was developed by Victor Lin ( Fang-Pen Lin 82 Jun 28, 2022

NeurIPS'21: Probabilistic Margins for Instance Reweighting in Adversarial Training (Pytorch implementation).

source code for NeurIPS21 paper robabilistic Margins for Instance Reweighting in Adversarial Training

9 Dec 20, 2022
Python-zhuyin - An open source Python library that provides a unified interface for converting between Chinese pinyin and Zhuyin (bopomofo)

Python-zhuyin - An open source Python library that provides a unified interface for converting between Chinese pinyin and Zhuyin (bopomofo)

2 Dec 29, 2022
A Chinese to English Neural Model Translation Project

ZH-EN NMT Chinese to English Neural Machine Translation This project is inspired by Stanford's CS224N NMT Project Dataset used in this project: News C

Zhenbang Feng 29 Nov 26, 2022
Legal text retrieval for python

legal-text-retrieval Overview This system contains 2 steps: generate training data containing negative sample found by mixture score of cosine(tfidf)

Nguyễn Minh Phương 22 Dec 06, 2022
Fast topic modeling platform

The state-of-the-art platform for topic modeling. Full Documentation User Mailing List Download Releases User survey What is BigARTM? BigARTM is a pow

BigARTM 633 Dec 21, 2022
🤗Transformers: State-of-the-art Natural Language Processing for Pytorch and TensorFlow 2.0.

State-of-the-art Natural Language Processing for PyTorch and TensorFlow 2.0 🤗 Transformers provides thousands of pretrained models to perform tasks o

Hugging Face 77.3k Jan 03, 2023
A minimal Conformer ASR implementation adapted from ESPnet.

Conformer ASR A minimal Conformer ASR implementation adapted from ESPnet. Introduction I want to use the pre-trained English ASR model provided by ESP

Niu Zhe 3 Jan 24, 2022