Learning to Rewrite for Non-Autoregressive Neural Machine Translation

Overview

RewriteNAT

This repo provides the code for reproducing our proposed RewriteNAT in EMNLP 2021 paper entitled "Learning to Rewrite for Non-Autoregressive Neural Machine Translation". RewriteNAT is a iterative NAT model which utilizes a locator component to explicitly learn to rewrite the erroneous translation pieces during iterative decoding.

Dependencies

Preprocessing

All the datasets are tokenized using the scripts from Moses except for Chinese with Jieba tokenizer, and splitted into subword units using BPE. The tokenized datasets are binaried using the script binaried.sh as follows:

python preprocess.py \
    --source-lang ${src} --target-lang ${tgt} \
    --trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \
    --destdir data-bin/${dataset} --thresholdtgt 0 --thresholdsrc 0 \ 
    --workers 64 --joined-dictionary

Train

All the models are run on 8 Tesla V100 GPUs for 300,000 updates with an effective batch size of 128,000 tokens apart from En→Fr where we make 500,000 updates to account for the data size. The training scripts train.rewrite.nat.sh is configured as follows:

python train.py \
    data-bin/${dataset} \
    --source-lang ${src} --target-lang ${tgt} \
    --save-dir ${save_dir} \
    --ddp-backend=no_c10d \
    --task translation_lev \
    --criterion rewrite_nat_loss \
    --arch rewrite_nonautoregressive_transformer \
    --noise full_mask \
    ${share_all_embeddings} \
    --optimizer adam --adam-betas '(0.9,0.98)' \
    --lr 0.0005 --lr-scheduler inverse_sqrt \
    --min-lr '1e-09' --warmup-updates 10000 \
    --warmup-init-lr '1e-07' --label-smoothing 0.1 \
    --dropout 0.3 --weight-decay 0.01 \
    --decoder-learned-pos \
    --encoder-learned-pos \
    --length-loss-factor 0.1 \
    --apply-bert-init \
    --log-format 'simple' --log-interval 100 \
    --fixed-validation-seed 7 \ 
    --max-tokens 4000 \
    --save-interval-updates 10000 \
    --max-update ${step} \
    --update-freq 4 \ 
    --fp16 \
    --save-interval ${save_interval} \
    --discriminator-layers 6 \ 
    --train-max-iter ${max_iter} \
    --roll-in-g sample \
    --roll-in-d oracle \
    --imitation-g \
    --imitation-d \
    --discriminator-loss-factor ${discriminator_weight} \
    --no-share-discriminator \
    --generator-scale ${generator_scale} \
    --discriminator-scale ${discriminator_scale} \

Evaluation

We evaluate performance with BLEU for all language pairs, except for En→>Zh, where we use SacreBLEU. The testing scripts test.rewrite.nat.sh is utilized to generate the translations, as follows:

python generate.py \                                            
    data-bin/${dataset} \                                          
    --source-lang ${src} --target-lang ${tgt} \                    
    --gen-subset ${subset} \                                       
    --task translation_lev \                                       
    --path ${save_dir}/${dataset}/checkpoint_average_${suffix}.pt \
    --iter-decode-max-iter ${max_iter} \                           
    --iter-decode-with-beam ${beam} \                              
    --iter-decode-p ${iter_p} \                                    
    --beam 1 --remove-bpe \                                        
    --batch-size 50\                                               
    --print-step \                                                 
    --quiet 

Citation

Please cite as:

@inproceedings{geng-etal-2021-learning,
    title = "Learning to Rewrite for Non-Autoregressive Neural Machine Translation",
    author = "Geng, Xinwei and Feng, Xiaocheng and Qin, Bing",
    booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2021",
    address = "Online and Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.emnlp-main.265",
    pages = "3297--3308",
}
Owner
Xinwei Geng
Ph.D. student working on improving Neural Machine Translation with Reinforcement Learning @HIT-SCIR
Xinwei Geng
Pre-Training with Whole Word Masking for Chinese BERT

Pre-Training with Whole Word Masking for Chinese BERT

Yiming Cui 7.7k Dec 31, 2022
LegalNLP - Natural Language Processing Methods for the Brazilian Legal Language

LegalNLP - Natural Language Processing Methods for the Brazilian Legal Language ⚖️ The library of Natural Language Processing for Brazilian legal lang

Felipe Maia Polo 125 Dec 20, 2022
Correctly generate plurals, ordinals, indefinite articles; convert numbers to words

NAME inflect.py - Correctly generate plurals, singular nouns, ordinals, indefinite articles; convert numbers to words. SYNOPSIS import inflect p = in

Jason R. Coombs 762 Dec 29, 2022
A list of NLP(Natural Language Processing) tutorials

NLP Tutorial A list of NLP(Natural Language Processing) tutorials built on PyTorch. Table of Contents A step-by-step tutorial on how to implement and

Allen Lee 1.3k Dec 25, 2022
Multilingual finetuning of Machine Translation model on low-resource languages. Project for Deep Natural Language Processing course.

Low-resource-Machine-Translation This repository contains the code for the project relative to the course Deep Natural Language Processing. The goal o

Andrea Cavallo 3 Jun 22, 2022
ttslearn: Library for Pythonで学ぶ音声合成 (Text-to-speech with Python)

ttslearn: Library for Pythonで学ぶ音声合成 (Text-to-speech with Python) 日本語は以下に続きます (Japanese follows) English: This book is written in Japanese and primaril

Ryuichi Yamamoto 189 Dec 29, 2022
Weaviate demo with the text2vec-openai module

Weaviate demo with the text2vec-openai module This repository contains an example of how to use the Weaviate text2vec-openai module. When using this d

SeMI Technologies 11 Nov 11, 2022
Unsupervised text tokenizer for Neural Network-based text generation.

SentencePiece SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabu

Google 6.4k Jan 01, 2023
Topic Modelling for Humans

gensim – Topic Modelling in Python Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. Targ

RARE Technologies 13.8k Jan 02, 2023
The guide to tackle with the Text Summarization

The guide to tackle with the Text Summarization

Takahiro Kubo 1.2k Dec 30, 2022
This repository contains the code, data, and models of the paper titled "CrossSum: Beyond English-Centric Cross-Lingual Abstractive Text Summarization for 1500+ Language Pairs".

CrossSum This repository contains the code, data, and models of the paper titled "CrossSum: Beyond English-Centric Cross-Lingual Abstractive Text Summ

BUET CSE NLP Group 29 Nov 19, 2022
A modular framework for vision & language multimodal research from Facebook AI Research (FAIR)

MMF is a modular framework for vision and language multimodal research from Facebook AI Research. MMF contains reference implementations of state-of-t

Facebook Research 5.1k Dec 26, 2022
PhoNLP: A BERT-based multi-task learning toolkit for part-of-speech tagging, named entity recognition and dependency parsing

PhoNLP is a multi-task learning model for joint part-of-speech (POS) tagging, named entity recognition (NER) and dependency parsing. Experiments on Vietnamese benchmark datasets show that PhoNLP prod

VinAI Research 109 Dec 02, 2022
Weakly-supervised Text Classification Based on Keyword Graph

Weakly-supervised Text Classification Based on Keyword Graph How to run? Download data Our dataset follows previous works. For long texts, we follow C

Hello_World 20 Dec 29, 2022
A highly sophisticated sequence-to-sequence model for code generation

CoderX A proof-of-concept AI system by Graham Neubig (June 30, 2021). About CoderX CoderX is a retrieval-based code generation AI system reminiscent o

Graham Neubig 39 Aug 03, 2021
NLP: SLU tagging

NLP: SLU tagging

北海若 3 Jan 14, 2022
A unified tokenization tool for Images, Chinese and English.

ICE Tokenizer Token id [0, 20000) are image tokens. Token id [20000, 20100) are common tokens, mainly punctuations. E.g., icetk[20000] == 'unk', ice

THUDM 42 Dec 27, 2022
Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition

SEW (Squeezed and Efficient Wav2vec) The repo contains the code of the paper "Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speec

ASAPP Research 67 Dec 01, 2022
The official implementation of VAENAR-TTS, a VAE based non-autoregressive TTS model.

VAENAR-TTS This repo contains code accompanying the paper "VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis". Sa

THUHCSI 138 Oct 28, 2022
Library for fast text representation and classification.

fastText fastText is a library for efficient learning of word representations and sentence classification. Table of contents Resources Models Suppleme

Facebook Research 24.1k Jan 05, 2023