Code for producing Japanese GPT-2 provided by rinna Co., Ltd.

Last update: Jan 07, 2023

Related tags

Overview

japanese-gpt2

This repository provides the code for training Japanese GPT-2 models. This code has been used for producing japanese-gpt2-medium released on HuggingFace model hub by rinna.

Please open an issue (in English/日本語) if you encounter any problem using the code or using our models via Huggingface.

Train a Japanese GPT-2 from scratch on your own machine

Download training corpus Japanese CC-100 and extract the ja.txt file.
Move the ja.txt file or modify src/corpus/jp_cc100/config.py to match the filepath of ja.txt with self.raw_data_dir in the config file.
Split ja.txt to smaller files by running:

cd src/
python -m corpus.jp_cc100.split_to_small_files

Train a medium-sized GPT-2 on 4 GPUs by running:

CUDA_VISIBLE_DEVICES=0,1,2,3 python -m task.pretrain.train --n_gpus 4 --save_model True --enable_log True

Interact with the trained model

Assume you have run the training script and saved your medium-sized GPT-2 to data/model/gpt2-medium-xxx.checkpoint. Run the following command to use it to complete text on one GPU by nucleus sampling with p=0.95 and k=40:

CUDA_VISIBLE_DEVICES=0 python -m task.pretrain.interact --checkpoint_path ../data/model/gpt2-medium-xxx.checkpoint --gen_type top --top_p 0.95 --top_k 40

Prepare files for uploading to Huggingface

Make your Huggingface account; Create a model repo; Clone it to your local machine.
Create model and config files from a checkpoint by running:

python -m task.pretrain.checkpoint2huggingface --checkpoint_path ../data/model/gpt2-medium-xxx.checkpoint --save_dir {huggingface's model repo directory}

Validate the created files by running:

python -m task.pretrain.check_huggingface --model_dir {huggingface's model repo directory}

Add files, commit, and push to your Huggingface repo.

Customize your training script

Check available arguments by running:

python -m task.pretrain.train --help

License

The MIT license

Code for producing Japanese GPT-2 provided by rinna Co., Ltd.

Related tags

Overview

japanese-gpt2

Train a Japanese GPT-2 from scratch on your own machine

Interact with the trained model

Prepare files for uploading to Huggingface

Customize your training script

License

Owner

rinna Co.,Ltd.

chaii - hindi & tamil question answering

Partially offline multi-language translator built upon Huggingface transformers.

💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

Finally, some decent sample sentences

[ICLR 2021 Spotlight] Pytorch implementation for "Long-tailed Recognition by Routing Diverse Distribution-Aware Experts."

A simple Streamlit App to classify swahili news into different categories.

Sequence Modeling with Structured State Spaces

Phomber is infomation grathering tool that reverse search phone numbers and get their details, written in python3.

A Python script that compares files in directories

Random-Word-Generator - Generates meaningful words from dictionary with given no. of letters and words.

Use PaddlePaddle to reproduce the paper：mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer

XLNet: Generalized Autoregressive Pretraining for Language Understanding

Train and use generative text models in a few lines of code.

The aim of this task is to predict someone's English proficiency based on a text input.

texlive expressions for documents

DeepAmandine is an artificial intelligence that allows you to talk to it for hours, you won't know the difference.

multi-label，classifier，text classification，多标签文本分类，文本分类，BERT，ALBERT，multi-label-classification，seq2seq，attention，beam search

Simple NLP based project without any use of AI

Google and Stanford University released a new pre-trained model called ELECTRA

This repository details the steps in creating a Part of Speech tagger using Trigram Hidden Markov Models and the Viterbi Algorithm without using external libraries.