This repository serves as a place to document a toy attempt on how to create a generative text model in Catalan, based on GPT-2

Last update: Jan 28, 2022

Related tags

Text Data & NLP gpt2-catalan

Overview

GPT-2 in Catalan

This repository serves as a place to document a toy attempt on how to create a generative text model in Catalan, based on GPT-2. In other words... this is more of a prototype and a personal playground than a serious attempt to have a fully functional GPT-2 in Catalan.

Nevertheless, I hope this can also help someone else train their own GPT-2 model and provide some pointers on how to do so.

Suggestions and constructive criticism are always welcome!

1. GPT-2 📝
- 1.1. What is GPT-2 ❓
- 1.2. Why GPT-2 ❔
2. Training 🔨
Testing the model 🐱
3. Questions ❓ ❔
4. TO-DO 🚧

1. GPT-2 📝

1.1. What is GPT-2 ❓

GPT-2 (GPT-2 stands for Generative Pre-trained Transformer 2) is a transformer-based language model trained in large volumes of data and was not trained with a specific task in mind. Nevertheless, it has probably been used mostly for generating new text.

A better and further explanation can be found here (http://jalammar.github.io/illustrated-gpt2/).

1.2. Why GPT-2 ❔

It is undeniable that GPT-2 played a large role and became very popular when it came out. It has also created some controversy. These aside, GPT-2 acted as a big step forward in terms of generating texts... And is also "faster" to train on custom data than its next generation sibling, GPT-3.

2. Training 🔨

2.1. Requirements 📎

You will need a powerful GPU or reduce the batch size. You can also use a VM from a Cloud service such as Google Colab or Microsoft Azure.

2.2. Training Script 📈

The training is implemented in the train_GPT2.py script, which serves as a skeleton. You can run it from the Commandline and passing all the arguments.

e.g.

cd src
./train_GPT2.py \
    --model DeepESP/gpt2-spanish \
    --tokenizer DeepESP/gpt2-spanish \
    --train_path ../data/catalan_corpus_train.csv \
    --test_path ../data/catalan_corpus_test.csv \
    --n_epochs 1 \
    --train_batch_size 4 \
    --eval_batch_size 8 \
    --eval_steps 100 \
    --save_steps 1000 \
    --warmup_steps 100 \
    --output gpt2-catalan

2.3. About the data used 📂 open_file_folder

The data used has mostly been the WikiCorpus data provided by the Computer Science department @ FIB, UPC (Facultat d'Informàtica de Barcelona, Universitat Politècnica de Catalunya).

You can download it using the datasets library from Huggingface:

from datasets import load_dataset

dataset = load_dataset("wikicorpus, 'raw_ca')

Or you can use the download_wikicorpus.py file in this repository, which also splits the data in train/test and can create a smaller subset for testing, if desired.

2.3.1. WikiCorpus PROs 👍

Well, the data is already obtained. That's always a pro.

2.3.2. WikiCorpus CONs 👎

We are limiting the knowledge of the Language model to data from the Wikipedia. Therefore, this model will probably be more error-prone with informal text inputs. This includes data from chats, colloquialisms and text from social media.

Additionally, the size of the data is tiny with respect to what it should be.

Further training for specific tasks ⚡

Once the model is trained in Catalan and we have a base, we can further train this model for a specific task in mind.

A couple of Proof of Concepts (PoC) have been done using data gathered from Twitter and also from Catalan songs.

Testing the model 🐱

We can test the trained model easily using the script test_generation.py.

cd src
python .\test_generation.py -t DeepESP/gpt2-spanish -m ../data/gpt2-catalan -i generation_test.txt

3. Questions ❓ ❔

3.1. Why Catalan ❓

Artificial Intelligence should not be only for largely spoken languages, such as English or even Spanish. Catalan, a minority language, is my mother tongue and it's always fun to see something you work with also operating in your own language. So why not?

3.2. Why use a Pretrained model in Spanish ❔

Although Spanish and Catalan are different languages, they share a lot of expressions, vocabulary and grammatical structures. Therefore, basing a Catalan model on a previously trained model in a close language such as Spanish is not unreasonable.

Transferring the knowledge from it to our model is better than starting from zero, specially to save computational time.

3.3. Can I use another data/language ❓

Even though the scripts are all prepared with the Catalan language in mind, the scripts should work with any text data, be it Catalan from the Wikicorpus,

Feel free to change the CatalanDataset class or swap it with yours, since probably formatting of the input text is the most varying aspect between projects.

Be sure to also change the base model, since if you want to train another language (e.g. German), basing it on a pre-trained model in Spanish will not work well.

4. TO-DO 🚧

Since we are actually using the Transfer learning approach and relying on a previously pretrained model in Spanish, we probably don't have as an accurate model as we should.

More varied data should also be used during the training, because it is very biased towards informative data (for obvious reasons).

This repository serves as a place to document a toy attempt on how to create a generative text model in Catalan, based on GPT-2

Related tags

Overview

GPT-2 in Catalan

1. GPT-2 📝

1.1. What is GPT-2 ❓

1.2. Why GPT-2 ❔

2. Training 🔨

2.1. Requirements 📎

2.2. Training Script 📈

2.3. About the data used 📂 open_file_folder

2.3.1. WikiCorpus PROs 👍

2.3.2. WikiCorpus CONs 👎

Further training for specific tasks ⚡

Testing the model 🐱

3. Questions ❓ ❔

3.1. Why Catalan ❓

3.2. Why use a Pretrained model in Spanish ❔

3.3. Can I use another data/language ❓

4. TO-DO 🚧

Owner

Laura

Princeton NLP's pre-training library based on fairseq with DeepSpeed kernel integration 🚃

Negative sampling for solving the unlabeled entity problem in NER. ICLR-2021 paper: Empirical Analysis of Unlabeled Entity Problem in Named Entity Recognition.

auto_code_complete is a auto word-completetion program which allows you to customize it on your need

An ultra fast tiny model for lane detection, using onnx_parser, TensorRTAPI, torch2trt to accelerate. our model support for int8, dynamic input and profiling. (Nvidia-Alibaba-TensoRT-hackathon2021)

Watson Natural Language Understanding and Knowledge Studio

ttslearn: Library for Pythonで学ぶ音声合成 (Text-to-speech with Python)

Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning on image-text and video-text tasks

A PyTorch implementation of the Transformer model in "Attention is All You Need".

IEEEXtreme15.0 Questions And Answers

A multi-lingual approach to AllenNLP CoReference Resolution along with a wrapper for spaCy.

Using Bert as the backbone model for lime, designed for NLP task explanation (sentence pair text classification task)

XLNet: Generalized Autoregressive Pretraining for Language Understanding

端到端的长本文摘要模型（法研杯2020司法摘要赛道）

This repo is to provide a list of literature regarding Deep Learning on Graphs for NLP

Package for controllable summarization

Train GPT-3 model on V100(16GB Mem) Using improved Transformer.

A collection of Korean Text Datasets ready to use using Tensorflow-Datasets.

:id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.

Implementation of Natural Language Code Search in the project CodeBERT: A Pre-Trained Model for Programming and Natural Languages.

Paradigm Shift in NLP - "Paradigm Shift in Natural Language Processing".