Implementaion of our ACL 2022 paper Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation

Overview

Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation

This is the implementaion of our paper:

Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation
Zhiwei He*, Xing Wang, Rui Wang, Shuming Shi, Zhaopeng Tu
ACL 2022 (long paper, main conference)

We based this code heavily on the original code of XLM, MASS and Deepaicode.

Dependencies

  • Python3

  • Pytorch1.7.1

    pip3 install torch==1.7.1+cu110
  • fastBPE

  • Apex

    git clone https://github.com/NVIDIA/apex
    cd apex
    git reset --hard 0c2c6ee
    pip3 install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" .

Data ready

We prepared the data following the instruction from XLM (Section III). We used their released scripts, BPE codes and vocabularies. However, there are some differences with them๏ผš

  • All available data is used, not just 5,000,000 sentences per language

  • For Romanian, we augment it with the monolingual data from WMT16.

  • Noisy sentences are removed๏ผš

    python3 filter_noisy_data.py --input all.en --lang en --output clean.en
  • For English-German, we used the processed data provided by KaiTao Song.

Considering that it can take a very long time to prepare the data, we provide the processed data for download๏ผš

Pre-trained models

We adopted the released XLM and MASS models for all language pairs. In order to better reproduce the results for MASS on En-De, we used monolingual data to continue pre-training the MASS pre-trained model for 300 epochs and selected the best model ([email protected]) by perplexity (PPL) on the validation set.

Here are pre-trained models we used:

Languages XLM MASS
English-French Model Model
English-German Model Model
English-Romanian Model Model

Model training

We provide training scripts and trained models for UNMT baseline and our approach with online self-training.

Training scripts

Train UNMT model with online self-training and XLM initialization:

cd scripts
sh run-xlm-unmt-st-ende.sh

Note: remember to modify the path variables in the header of the shell script.

Trained model

We selected the best model by BLEU score on the validation set for both directions. Therefore, we release En-X and X-En models for each experiment.

Approch XLM MASS
UNMT En-Fr Fr-En En-Fr Fr-En
En-De De-En En-De De-En
En-Ro Ro-En En-Ro Ro-En
UNMT-ST En-Fr Fr-En En-Fr Fr-En
En-De De-En En-De De-En
En-Ro Ro-En En-Ro Ro-En

Evaluation

Generate translations

Input sentences must have the same tokenization and BPE codes than the ones used in the model.

cat input.en.bpe | \
python3 translate.py \
  --exp_name translate  \
  --src_lang en --tgt_lang de \
  --model_path trained_model.pth  \
  --output_path output.de.bpe \
  --batch_size 8

Remove bpe

sed  -r 's/(@@ )|(@@ ?$)//g' output.de.bpe > output.de.tok

Evaluate

BLEU_SCRIPT_PATH=src/evaluation/multi-bleu.perl
BLEU_SCRIPT_PATH ref.de.tok < output.de.tok
Owner
hezw.tkcw
PhD student @ SJTU
hezw.tkcw
Phrase-Based & Neural Unsupervised Machine Translation

Unsupervised Machine Translation This repository contains the original implementation of the unsupervised PBSMT and NMT models presented in Phrase-Bas

Facebook Research 1.5k Dec 28, 2022
Natural Language Processing Best Practices & Examples

NLP Best Practices In recent years, natural language processing (NLP) has seen quick growth in quality and usability, and this has helped to drive bus

Microsoft 6.1k Dec 31, 2022
ZUNIT - Toward Zero-Shot Unsupervised Image-to-Image Translation

ZUNIT Dependencies you can install all the dependencies by pip install -r requirements.txt Datasets Download CUB dataset. Unzip the birds.zip at ./da

Chen Yuanqi 9 Jun 24, 2022
Python package for Turkish Language.

PyTurkce Python package for Turkish Language. Documentation: https://pyturkce.readthedocs.io. Installation pip install pyturkce Usage from pyturkce im

Mert Cobanov 14 Oct 09, 2022
An easy to use Natural Language Processing library and framework for predicting, training, fine-tuning, and serving up state-of-the-art NLP models.

Welcome to AdaptNLP A high level framework and library for running, training, and deploying state-of-the-art Natural Language Processing (NLP) models

Novetta 407 Jan 03, 2023
Code for Findings of ACL 2022 Paper "Sentiment Word Aware Multimodal Refinement for Multimodal Sentiment Analysis with ASR Errors"

SWRM Code for Findings of ACL 2022 Paper "Sentiment Word Aware Multimodal Refinement for Multimodal Sentiment Analysis with ASR Errors" Clone Clone th

14 Jan 03, 2023
Finally decent dictionaries based on Wiktionary for your beloved eBook reader.

eBook Reader Dictionaries Finally, decent dictionaries based on Wiktionary for your beloved eBook reader. Dictionaries Catalan ๐Ÿšง ฮ•ฮปฮปฮทฮฝฮนฮบฮฌ (help welco

Mickaรซl Schoentgen 163 Dec 31, 2022
KR-FinBert And KR-FinBert-SC

KR-FinBert & KR-FinBert-SC Much progress has been made in the NLP (Natural Language Processing) field, with numerous studies showing that domain adapt

5 Jul 29, 2022
source code for paper: WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach.

WhiteningBERT Source code and data for paper WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach. Preparation git clone https://github.com

49 Dec 17, 2022
An open-source NLP library: fast text cleaning and preprocessing.

An open-source NLP library: fast text cleaning and preprocessing

Iaroslav 21 Mar 18, 2022
Integrating the Best of TF into PyTorch, for Machine Learning, Natural Language Processing, and Text Generation. This is part of the CASL project: http://casl-project.ai/

Texar-PyTorch is a toolkit aiming to support a broad set of machine learning, especially natural language processing and text generation tasks. Texar

ASYML 726 Dec 30, 2022
Pytorch NLP library based on FastAI

Quick NLP Quick NLP is a deep learning nlp library inspired by the fast.ai library It follows the same api as fastai and extends it allowing for quick

Agis pof 283 Nov 21, 2022
The source code of "Language Models are Few-shot Multilingual Learners" (MRL @ EMNLP 2021)

Language Models are Few-shot Multilingual Learners Paper This is the source code of the paper [Arxiv] [ACL Anthology]: This code has been written usin

Genta Indra Winata 45 Nov 21, 2022
Code for paper Multitask-Finetuning of Zero-shot Vision-Language Models

Code for paper Multitask-Finetuning of Zero-shot Vision-Language Models

Zhenhailong Wang 2 Jul 15, 2022
FedNLP: A Benchmarking Framework for Federated Learning in Natural Language Processing

FedNLP is a research-oriented benchmarking framework for advancing federated learning (FL) in natural language processing (NLP). It uses FedML repository as the git submodule. In other words, FedNLP

FedML-AI 216 Nov 27, 2022
Line as a Visual Sentence: Context-aware Line Descriptor for Visual Localization

Line as a Visual Sentence with LineTR This repository contains the inference code, pretrained model, and demo scripts of the following paper. It suppo

SungHo Yoon 158 Dec 27, 2022
ByT5: Towards a token-free future with pre-trained byte-to-byte models

ByT5: Towards a token-free future with pre-trained byte-to-byte models ByT5 is a tokenizer-free extension of the mT5 model. Instead of using a subword

Google Research 409 Jan 06, 2023
Simple Annotated implementation of GPT-NeoX in PyTorch

Simple Annotated implementation of GPT-NeoX in PyTorch This is a simpler implementation of GPT-NeoX in PyTorch. We have taken out several optimization

labml.ai 101 Dec 03, 2022
Code examples for my Write Better Python Code series on YouTube.

Write Better Python Code This repository contains the code examples used in my Write Better Python Code series published on YouTube: https:/

858 Dec 29, 2022
Baseline code for Korean open domain question answering(ODQA)

Open-Domain Question Answering(ODQA)๋Š” ๋‹ค์–‘ํ•œ ์ฃผ์ œ์— ๋Œ€ํ•œ ๋ฌธ์„œ ์ง‘ํ•ฉ์œผ๋กœ๋ถ€ํ„ฐ ์ž์—ฐ์–ด ์งˆ์˜์— ๋Œ€ํ•œ ๋‹ต๋ณ€์„ ์ฐพ์•„์˜ค๋Š” task์ž…๋‹ˆ๋‹ค. ์ด๋•Œ ์‚ฌ์šฉ์ž ์งˆ์˜์— ๋‹ต๋ณ€ํ•˜๊ธฐ ์œ„ํ•ด ์ฃผ์–ด์ง€๋Š” ์ง€๋ฌธ์ด ๋”ฐ๋กœ ์กด์žฌํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ์‚ฌ์ „์— ๊ตฌ์ถ•๋˜์–ด์žˆ๋Š” Knowl

VUMBLEB 69 Nov 04, 2022