Universal End2End Training Platform, including pre-training, classification tasks, machine translation, and etc.

Overview

背景

TrenTrans是一个统一的端到端的多语言多任务预训练平台,支持多种预训练方式,以及序列生成和自然语言理解任务。

安装教程

git clone [email protected]:baijunji/Teg-Tentrans.git
pip install -r requirements.txt 

Tentrans是一个基于Pytorch的轻量级工具包,安装十分方便。

快速上手

(一)预训练模型

TenTrans支持多种预训练模型,包括基于编码器的预训练(e.g. MLM)和基于seq2seq结构的生成式预训练方法(e.g. Mass)。 此外, Tentrans还支持大规模的多语言机器翻译预训练。

我们将从最简单的MLM预训练开始,让您快速熟悉TenTrans的运行逻辑。

  1. 数据处理

在预训练MLM模型时,我们需要对单语训练文件进行二进制化。您可以使用以下命令, 词表的格式为一行一词,执行该命令后会生成train.bpe.en.pth。

python process.py vocab file  lang [shard_id](optional)

当数据规模不大时,您可以使用纯文本格式的csv作为训练文件。csv的文件格式为

seq1 lang1
This is a positive sentence. en
This is a negtive sentence. en
This is a sentence. en
  1. 参数配置

TenTrans是通过yaml文件的方式读取训练参数的, 我们提供了一系列的适应各个任务的训练配置文件模版(见 run/ 文件夹),您只要改动很小的一部分参数即可。

# base config
langs: [en]
epoch: 15
update_every_epoch:  1   # 每轮更新多少step
dumpdir: ./dumpdir       # 模型及日志文件保存的地方
share_all_task_model: True # 是否共享所有任务的模型参数
save_intereval: 1      # 模型保存间隔
log_interval: 10       # 打印日志间隔



#全局设置开始, 如果tasks内没有定义特定的参数,则将使用全局设置
optimizer: adam 
learning_rate: 0.0001
learning_rate_warmup: 4000
scheduling: warmupexponentialdecay
max_tokens: 2000
group_by_size: False   # 是否对语料对长度排序
max_seq_length: 260    # 模型所能接受的最大句子长度
weight_decay: 0.01
eps: 0.000001
adam_betas: [0.9, 0.999]

sentenceRep:           # 模型编码器设置
  type: transformer #cbow, rnn
  hidden_size: 768
  ff_size: 3072
  dropout: 0.1
  attention_dropout: 0.1
  encoder_layers: 12
  num_lang: 1
  num_heads: 12
  use_langembed: False
  embedd_size: 768
  learned_pos: True
  pretrain_embedd: 
  activation: gelu
#全局设置结束


tasks:                #任务定义, TenTrans支持多种任务联合训练,包括分类,MLM和seq2seq联合训练。
  en_mlm:             #任务ID,  您可以随意定义有含义的标识名
    task_name: mlm    #任务名,  TenTrans会根据指定的任务名进行训练
    data:
        data_folder: your_data_folder
        src_vocab: vocab.txt
        # train_valid_test: [train.bpe.en.csv, valid.bpe.en.csv, test.bpe.en.csv]
        train_valid_test: [train.bpe.en.pth, valid.bpe.en.pth, test.bpe.en.pth]
        stream_text: False  # 是否启动文本流训练
        p_pred_mask_kepp_rand: [0.15, 0.8, 0.1, 0.1]

    target:           # 输出层定义
        sentence_rep_dim: 768
        dropout: 0.1
        share_out_embedd: True
  1. 启动训练

单机多卡

export NPROC_PER_NODE=8;
python -m torch.distributed.launch \
                --nproc_per_node=$NPROC_PER_NODE main.py \
                --config run/xlm.yaml --multi_gpu True

(二)机器翻译

本节您将快速学会如何训练一个基于Transformer的神经机器翻译模型,我们以WMT14 英-德为例(下载数据)。

  1. 数据处理

与处理单语训练文件相同,您也需要对翻译的平行语料进行二进制化。

python process.py vocab.bpe.32000 train.bpe.de de
python process.py vocab.bpe.32000 train.bpe.en en
  1. 参数配置
# base config
langs: [en, de]
epoch: 50
update_every_epoch: 5000
dumpdir: ./exp/tentrans/wmt14ende_template

share_all_task_model: True
optimizer: adam 
learning_rate: 0.0007
learning_rate_warmup: 4000
scheduling: warmupexponentialdecay
max_tokens: 8000
max_seq_length: 512
save_intereval: 1
weight_decay: 0
adam_betas: [0.9, 0.98]

clip_grad_norm: 0
label_smoothing: 0.1
accumulate_gradients: 2
share_all_embedd: True
patience: 10
#share_out_embedd: False

tasks:
  wmtende_mt:
    task_name: seq2seq
    reload_checkpoint:
    data:
        data_folder:  /train_data/wmt16_ende/
        src_vocab: vocab.bpe.32000
        tgt_vocab: vocab.bpe.32000
        train_valid_test: [train.bpe.en.pth:train.bpe.de.pth, valid.bpe.en.pth:valid.bpe.de.pth, test.bpe.en.pth:test.bpe.de.pth]
        group_by_size: True
        max_len: 200

    sentenceRep:
      type: transformer 
      hidden_size: 512
      ff_size: 2048
      attention_dropout: 0.1
      encoder_layers: 6
      num_heads: 8
      embedd_size: 512
      dropout: 0.1
      learned_pos: True
      activation: relu

    target:
      type: transformer 
      hidden_size: 512
      ff_size: 2048
      attention_dropout: 0.1
      decoder_layers: 6
      num_heads: 8
      embedd_size: 512
      dropout: 0.1
      learned_pos: True
      activation: relu
  1. 模型解码

大约训练更新20万步之后(8张M40,大约耗时四十小时), 我们可以使用TenTrans提供的脚本对平均最后几个模型来获得更好的效果。

path=model_save_path
python  scripts/average_checkpoint.py --inputs  $path/checkpoint_seq2seq_ldc_mt_40 \
    $path/checkpoint_seq2seq_ldc_mt_39 $path/checkpoint_seq2seq_ldc_mt_38 \
    $path/checkpoint_seq2seq_ldc_mt_37 $path/checkpoint_seq2seq_ldc_mt_36 \
    $path/checkpoint_seq2seq_ldc_mt_35 $path/checkpoint_seq2seq_ldc_mt_34 \
    --output $path/average.pt

我们可以使用平均之后的模型进行翻译解码,

python -u infer/translation_infer.py \
        --src train_data/wmt16_ende/test.bpe.en \
        --src_vocab train_data/wmt16_ende/vocab.bpe.32000 \
        --tgt_vocab train_data/wmt16_ende/vocab.bpe.32000 \
        --src_lang en \
        --tgt_lang de --batch_size 50 --beam 4 --length_penalty 0.6 \
        --model_path model_save_path/average.pt | \
        grep "Target_" | cut -f2- -d " " | sed -r 's/(@@ )|(@@ ?$)//g' > predict.ende

cat  train_data/wmt16_ende/test.tok.de |  perl -ple 's{(\S)-(\S)}{$1 ##AT##-##AT## $2}g' > generate.ref
cat  predict.ende | perl -ple 's{(\S)-(\S)}{$1 ##AT##-##AT## $2}g' > generate.sys
perl ../scripts/multi-bleu.perl generate.ref < generate.sys
  1. 翻译结果
WMT14-ende BLEU
Attention is all you need(beam=4) 27.30
TenTrans(beam=4, 8gpus, updates=200k, gradient_accu=1) 27.54
TenTrans(beam=4, 8gpus, updates=125k, gradient_accu=2) 27.64
TenTrans(beam=4, 24gpus, updates=90k, gradient_accu=1) 27.67

(三)文本分类

您同样可以使用我们所提供的预训练模型来进行下游任务, 本节我们将以SST2任务为例, 让你快速上手使用预训练模型进行微调下游任务。

  1. 数据处理

我们推荐使用文本格式进行文本分类的训练,因为这更轻量和快速。我们将SST2的数据处理为如下格式(见sample_data 文件夹):

seq1 label1 lang1
This is a positive sentence. postive en
This is a negtive sentence. negtive en
This is a sentence. unknow en
  1. 参数配置
# base config
langs: [en]
epoch: 200
update_every_epoch: 1000
share_all_task_model: False
batch_size: 8 
save_interval: 20
dumpdir: ./dumpdir/sst2

sentenceRep:
  type: transformer
  pretrain_rep: ../tentrans_pretrain/model_mlm2048.tt

tasks:
  sst2_en:
    task_name: classification
    data:
        data_folder:  sample_data/sst2
        src_vocab: vocab_en
        train_valid_test: [train.csv, dev.csv, test.csv]
        label1: [0, 1]
        feature: [seq1, label1, lang1]
    lr_e: 0.000005  # encoder学习率
    lr_p: 0.000125  # target 学习率
    target:
      sentence_rep_dim: 2048
      dropout: 0.1
    weight_training: False # 是否采用数据平衡
  1. 分类解码
python -u classification_infer.py \
         --model model_path \
         --vocab  sample_data/sst2/vocab_en \
         --src test.txt \
         --lang en --threhold 0.5  > predict.out.label
python scripts/eval_recall.py  test.en.label predict.out.label

TenTrans 进阶

1. 多语言机器翻译

2. 跨语言预训练

Owner
Tencent Minority-Mandarin Translation Team
Tencent Minority-Mandarin Translation Team
💫 Industrial-strength Natural Language Processing (NLP) in Python

spaCy: Industrial-strength NLP spaCy is a library for advanced Natural Language Processing in Python and Cython. It's built on the very latest researc

Explosion 24.9k Jan 02, 2023
A 10000+ hours dataset for Chinese speech recognition

A 10000+ hours dataset for Chinese speech recognition

309 Dec 16, 2022
⛵️The official PyTorch implementation for "BERT-of-Theseus: Compressing BERT by Progressive Module Replacing" (EMNLP 2020).

BERT-of-Theseus Code for paper "BERT-of-Theseus: Compressing BERT by Progressive Module Replacing". BERT-of-Theseus is a new compressed BERT by progre

Kevin Canwen Xu 284 Nov 25, 2022
Deep learning for NLP crash course at ABBYY.

Deep NLP Course at ABBYY Deep learning for NLP crash course at ABBYY. Suggested textbook: Neural Network Methods in Natural Language Processing by Yoa

Dan Anastasyev 597 Dec 18, 2022
Harvis is designed to automate your C2 Infrastructure.

Harvis Harvis is designed to automate your C2 Infrastructure, currently using Mythic C2. 📌 What is it? Harvis is a python tool to help you create mul

Thiago Mayllart 99 Oct 06, 2022
A method for cleaning and classifying text using transformers.

NLP Translation and Classification The repository contains a method for classifying and cleaning text using NLP transformers. Overview The input data

Ray Chamidullin 0 Nov 15, 2022
BeautyNet is an AI powered model which can tell you whether you're beautiful or not.

BeautyNet BeautyNet is an AI powered model which can tell you whether you're beautiful or not. Download Dataset from here:https://www.kaggle.com/gpios

Ansh Gupta 0 May 06, 2022
PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

VAENAR-TTS - PyTorch Implementation PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

Keon Lee 67 Nov 14, 2022
Sequence Modeling with Structured State Spaces

Structured State Spaces for Sequence Modeling This repository provides implementations and experiments for the following papers. S4 Efficiently Modeli

HazyResearch 902 Jan 06, 2023
IndoBERTweet is the first large-scale pretrained model for Indonesian Twitter. Published at EMNLP 2021 (main conference)

IndoBERTweet 🐦 🇮🇩 1. Paper Fajri Koto, Jey Han Lau, and Timothy Baldwin. IndoBERTweet: A Pretrained Language Model for Indonesian Twitter with Effe

IndoLEM 40 Nov 30, 2022
Spokestack is a library that allows a user to easily incorporate a voice interface into any Python application with a focus on embedded systems.

Welcome to Spokestack Python! This library is intended for developing voice interfaces in Python. This can include anything from Raspberry Pi applicat

Spokestack 133 Sep 20, 2022
This repository implements a brute-force spellchecker utilizing the Damerau-Levenshtein edit distance.

About spellchecker.py Implementing a highly-accurate, brute-force, and dynamically programmed spellchecking program that utilizes the Damerau-Levensht

Raihan Ahmed 1 Dec 11, 2021
A framework for training and evaluating AI models on a variety of openly available dialogue datasets.

ParlAI (pronounced “par-lay”) is a python framework for sharing, training and testing dialogue models, from open-domain chitchat, to task-oriented dia

Facebook Research 9.7k Jan 09, 2023
Compute distance between sequences. 30+ algorithms, pure python implementation, common interface, optional external libs usage.

TextDistance TextDistance -- python library for comparing distance between two or more sequences by many algorithms. Features: 30+ algorithms Pure pyt

Life4 3k Jan 06, 2023
Intent parsing and slot filling in PyTorch with seq2seq + attention

PyTorch Seq2Seq Intent Parsing Reframing intent parsing as a human - machine translation task. Work in progress successor to torch-seq2seq-intent-pars

Sean Robertson 159 Apr 04, 2022
Library of deep learning models and datasets designed to make deep learning more accessible and accelerate ML research.

Tensor2Tensor Tensor2Tensor, or T2T for short, is a library of deep learning models and datasets designed to make deep learning more accessible and ac

12.9k Jan 07, 2023
초성 해석기 based on ko-BART

초성 해석기 개요 한국어 초성만으로 이루어진 문장을 입력하면, 완성된 문장을 예측하는 초성 해석기입니다. 초성: ㄴㄴ ㄴㄹ ㅈㅇㅎ 예측 문장: 나는 너를 좋아해 모델 모델은 SKT-AI에서 공개한 Ko-BART를 이용합니다. 데이터 문장 단위로 이루어진 아무 코퍼스나

Dawoon Jung 29 Oct 28, 2022
Use the state-of-the-art m2m100 to translate large data on CPU/GPU/TPU. Super Easy!

Easy-Translate is a script for translating large text files in your machine using the M2M100 models from Facebook/Meta AI. We also privide a script fo

Iker García-Ferrero 41 Dec 15, 2022
Mkdocs + material + cool stuff

Modern-Python-Doc-Example mkdocs + material + cool stuff Doc is live here Features out of the box amazing good looking website thanks to mkdocs.org an

Francesco Saverio Zuppichini 61 Oct 26, 2022
A single model that parses Universal Dependencies across 75 languages.

A single model that parses Universal Dependencies across 75 languages. Given a sentence, jointly predicts part-of-speech tags, morphology tags, lemmas, and dependency trees.

Dan Kondratyuk 189 Nov 29, 2022