Code repository for the paper "Doubly-Trained Adversarial Data Augmentation for Neural Machine Translation" with instructions to reproduce the results.

Overview

Doubly Trained Neural Machine Translation System for Adversarial Attack and Data Augmentation

Languages Experimented:

  • Data Overview:

    Source Target Training Data Valid1 Valid2 Test data
    ZH EN WMT17 without UN corpus WMT2017 newstest WMT2018 newstest WMT2020 newstest
    DE EN WMT17 WMT2017 newstest WMT2018 newstest WMT2014 newstest
    FR EN WMT14 without UN corpus WMT2015 newsdiscussdev WMT2015 newsdiscusstest WMT2014 newstest
  • Corpus Statistics:

    Lang-pair Data Type #Sentences #tokens (English side)
    zh-en Train 9355978 161393634
    Valid1 2001 47636
    Valid2 3981 98308
    test 2000 65561
    de-en Train 4001246 113777884
    Valid1 2941 74288
    Valid2 2970 78358
    test 3003 78182
    fr-en Train 23899064 73523616
    Valid1 1442 30888
    Valid2 1435 30215
    test 3003 81967

Scripts (as shown in paper's appendix)

  • Set-up:

    • To execute the scripts shown below, it's required that fairseq version 0.9 is installed along with COMET. The way to easily install them after cloning this repo is executing following commands (under root of this repo):
      cd fairseq-0.9.0
      pip install --editable ./
      cd ../COMET
      pip install .
    • It's also possible to directly install COMET through pip: pip install unbabel-comet, but the recent version might have different dependency on other packages like fairseq. Please check COMET's official website for the updated information.
    • To make use of script that relies on COMET model (in case of dual-comet), a model from COMET should be downloaded. It can be easily done by running following script:
      from comet.models import download_model
      download_model("wmt-large-da-estimator-1719")
  • Pretrain the model:

    fairseq-train $DATADIR \
        --source-lang $src \
        --target-lang $tgt \
        --save-dir $SAVEDIR \
        --share-decoder-input-output-embed \
        --arch transformer_wmt_en_de \
        --optimizer adam --adam-betas ’(0.9, 0.98)’ --clip-norm 0.0 \
        --lr-scheduler inverse_sqrt \
        --warmup-init-lr 1e-07 --warmup-updates 4000 \
        --lr 0.0005 --min-lr 1e-09 \
        --dropout 0.3 --weight-decay 0.0001 \
        --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
        --max-tokens 2048 --update-freq 16 \
        --seed 2 
  • Adversarial Attack:

    fairseq-train $DATADIR \
        --source-lang $src \
        --target-lang $tgt \
        --save-dir $SAVEDIR \
        --share-decoder-input-output-embed \
        --train-subset valid \
        --arch transformer_wmt_en_de \
        --optimizer adam --adam-betas ’(0.9, 0.98)’ --clip-norm 0.0 \
        --lr-scheduler inverse_sqrt \
        --warmup-init-lr 1e-07 --warmup-updates 4000 \
        --lr 0.0005 --min-lr 1e-09 \
        --dropout 0.3 --weight-decay 0.0001 \
        --criterion dual_bleu --mrt-k 16 \
        --batch-size 2 --update-freq 64 \
        --seed 2 \
        --restore-file $PREETRAIN_MODEL \
        --reset-optimizer \
        --reset-dataloader 
  • Data Augmentation:

    fairseq-train $DATADIR \
        -s $src -t $tgt \
        --train-subset valid \
        --valid-subset valid1 \
        --left-pad-source False \
        --share-decoder-input-output-embed \
        --encoder-embed-dim 512 \
        --arch transformer_wmt_en_de \
        --dual-training \
        --auxillary-model-path $AUX_MODEL \
        --auxillary-model-save-dir $AUX_MODEL_SAVE \
        --optimizer adam --adam-betas ’(0.9, 0.98)’ --clip-norm 0.0 \
        --lr-scheduler inverse_sqrt \
        --warmup-init-lr 0.000001 --warmup-updates 1000 \
        --lr 0.00001 --min-lr 1e-09 \
        --dropout 0.3 --weight-decay 0.0001 \
        --criterion dual_comet/dual_mrt --mrt-k 8 \
        --comet-route $COMET_PATH \
        --batch-size 4 \
        --skip-invalid-size-inputs-valid-test \
        --update-freq 1 \
        --on-the-fly-train --adv-percent 30 \
        --seed 2 \
        --restore-file $PRETRAIN_MODEL \
        --reset-optimizer \
        --reset-dataloader \
        --save-dir $CHECKPOINT_FOLDER 

Generation and Test:

  • For Chinese-English, we use sentencepiece to perform the BPE so it's required to be removed in generation step. For all test we use beam size = 5. Noitce that we modified the code in fairseq-gen to use sacrebleu.tokenizers.TokenizerZh() to tokenize Chinese when the direction is en-zh.

    fairseq-generate $DATA-FOLDER \
        -s zh -t en \
        --task translation \
        --gen-subset $file \
        --path $CHECKPOINT \
        --batch-size 64 --quiet \
        --lenpen 1.0 \
        --remove-bpe sentencepiece \
        --sacrebleu \
        --beam 5
  • For French-Enlish, German-English, we modified the script to detokenize the moses tokenizer (which we used to preprocess the data). To reproduce the result, use following script:

    fairseq-generate $DATA-FOLDER \
        -s de/fr -t en \
        --task translation \
        --gen-subset $file \
        --path $CHECKPOINT \
        --batch-size 64 --quiet \
        --lenpen 1.0 \
        --remove-bpe \
        ---detokenize-moses \
        --sacrebleu \
        --beam 5

    Here --detokenize-moses would call detokenizer during the generation step and detokenize predictions before evaluating it. It would slow the generation step. Another way to manually do this is to retrieve prediction and target sentences from output file of fairseq and manually apply detokenizer from detokenizer.perl.

BibTex

@misc{tan2021doublytrained,
      title={Doubly-Trained Adversarial Data Augmentation for Neural Machine Translation}, 
      author={Weiting Tan and Shuoyang Ding and Huda Khayrallah and Philipp Koehn},
      year={2021},
      eprint={2110.05691},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
Owner
Steven Tan
Johns Hopkins 21' Computer Science & Applied Mathematics and Statistics Major
Steven Tan
Instance-level Image Retrieval using Reranking Transformers

Instance-level Image Retrieval using Reranking Transformers Fuwen Tan, Jiangbo Yuan, Vicente Ordonez, ICCV 2021. Abstract Instance-level image retriev

UVA Computer Vision 87 Jan 03, 2023
PyTorch image models, scripts, pretrained weights -- ResNet, ResNeXT, EfficientNet, EfficientNetV2, NFNet, Vision Transformer, MixNet, MobileNet-V3/V2, RegNet, DPN, CSPNet, and more

PyTorch Image Models Sponsors What's New Introduction Models Features Results Getting Started (Documentation) Train, Validation, Inference Scripts Awe

Ross Wightman 22.9k Jan 09, 2023
PlaidML is a framework for making deep learning work everywhere.

A platform for making deep learning work everywhere. Documentation | Installation Instructions | Building PlaidML | Contributing | Troubleshooting | R

PlaidML 4.5k Jan 02, 2023
Multiple custom object count and detection using YOLOv3-Tiny method

Electronic-Component-YOLOv3 Introduce This project created to detect, count, and recognize multiple custom object using YOLOv3-Tiny method. The target

Derwin Mahardika 2 Nov 14, 2022
Implementation of DocFormer: End-to-End Transformer for Document Understanding, a multi-modal transformer based architecture for the task of Visual Document Understanding (VDU)

DocFormer - PyTorch Implementation of DocFormer: End-to-End Transformer for Document Understanding, a multi-modal transformer based architecture for t

171 Jan 06, 2023
Curating a dataset for bioimage transfer learning

CytoImageNet A large-scale pretraining dataset for bioimage transfer learning. Motivation In past few decades, the increase in speed of data collectio

Stanley Z. Hua 9 Jun 20, 2022
Multi-Person Extreme Motion Prediction

Multi-Person Extreme Motion Prediction Implementation for paper Wen Guo, Xiaoyu Bie, Xavier Alameda-Pineda, Francesc Moreno-Noguer, Multi-Person Extre

GUO-W 38 Nov 15, 2022
Spiking Neural Network for Computer Vision using SpikingJelly framework and Pytorch-Lightning

Spiking Neural Network for Computer Vision using SpikingJelly framework and Pytorch-Lightning

Sami BARCHID 2 Oct 20, 2022
Instant neural graphics primitives: lightning fast NeRF and more

Instant Neural Graphics Primitives Ever wanted to train a NeRF model of a fox in under 5 seconds? Or fly around a scene captured from photos of a fact

NVIDIA Research Projects 10.6k Jan 01, 2023
MARS: Learning Modality-Agnostic Representation for Scalable Cross-media Retrieva

Introduction This is the source code of our TCSVT 2021 paper "MARS: Learning Modality-Agnostic Representation for Scalable Cross-media Retrieval". Ple

7 Aug 24, 2022
Collection of generative models, e.g. GAN, VAE in Pytorch and Tensorflow.

Generative Models Collection of generative models, e.g. GAN, VAE in Pytorch and Tensorflow. Also present here are RBM and Helmholtz Machine. Note: Gen

Agustinus Kristiadi 7k Jan 02, 2023
Neural Turing Machines (NTM) - PyTorch Implementation

PyTorch Neural Turing Machine (NTM) PyTorch implementation of Neural Turing Machines (NTM). An NTM is a memory augumented neural network (attached to

Guy Zana 519 Dec 21, 2022
WSDM2022 "A Simple but Effective Bidirectional Extraction Framework for Relational Triple Extraction"

BiRTE WSDM2022 "A Simple but Effective Bidirectional Extraction Framework for Relational Triple Extraction" Requirements The main requirements are: py

9 Dec 27, 2022
Real-time ground filtering algorithm of cloud points acquired using Terrestrial Laser Scanner (TLS)

This repository contains tools to simulate the ground filtering process of a registered point cloud. The repository contains two filtering methods. The first method uses a normal vector, and fit to p

5 Aug 25, 2022
Generate saved_model, tfjs, tf-trt, EdgeTPU, CoreML, quantized tflite and .pb from .tflite.

tflite2tensorflow Generate saved_model, tfjs, tf-trt, EdgeTPU, CoreML, quantized tflite and .pb from .tflite. 1. Supported Layers No. TFLite Layer TF

Katsuya Hyodo 214 Dec 29, 2022
Database Reasoning Over Text project for ACL paper

Database Reasoning over Text This repository contains the code for the Database Reasoning Over Text paper, to appear at ACL2021. Work is performed in

Facebook Research 320 Dec 12, 2022
using yolox+deepsort for object-tracker

YOLOX_deepsort_tracker yolox+deepsort实现目标跟踪 最新的yolox尝尝鲜~~(yolox正处在频繁更新阶段,因此直接链接yolox仓库作为子模块) Install Clone the repository recursively: git clone --rec

245 Dec 26, 2022
This is the official pytorch implementation of Student Helping Teacher: Teacher Evolution via Self-Knowledge Distillation(TESKD)

Student Helping Teacher: Teacher Evolution via Self-Knowledge Distillation (TESKD) By Zheng Li[1,4], Xiang Li[2], Lingfeng Yang[2,4], Jian Yang[2], Zh

Zheng Li 9 Sep 26, 2022
Code for Multiple Instance Active Learning for Object Detection, CVPR 2021

MI-AOD Language: 简体中文 | English Introduction This is the code for Multiple Instance Active Learning for Object Detection (The PDF is not available tem

Tianning Yuan 269 Dec 21, 2022
Yoga - Yoga asana classifier for python

Yoga Asana Classifier Description Hi welcome to my new deep learning project "Yo

Programminghut 35 Dec 12, 2022