The source code for the Cutoff data augmentation approach proposed in this paper: "A Simple but Tough-to-Beat Data Augmentation Approach for Natural Language Understanding and Generation".

Last update: Dec 22, 2022

Overview

Cutoff: A Simple Data Augmentation Approach for Natural Language

This repository contains source code necessary to reproduce the results presented in the following paper:

A Simple but Tough-to-Beat Data Augmentation Approach for Natural Language Understanding and Generation

This project is maintained by Dinghan Shen. Feel free to contact [email protected] for any relevant issues.

Natural Language Undertanding (e.g. GLUE tasks, etc.)

Prerequisite:

CUDA, cudnn
Python 3.7
PyTorch 1.4.0

Run

Install Huggingface Transformers according to the instructions here: https://github.com/huggingface/transformers.
Download the datasets from the GLUE benchmark:

python download_glue_data.py --data_dir glue_data --tasks all

Fine-tune the RoBERTa-base or RoBERTa-large model with the Cutoff data augmentation strategies:

>>> chmod +x run_glue.sh
>>> ./run_glue.sh

Options: different settings and hyperparameters can be selected and specified in the run_glue.sh script:

do_aug: whether augmented examples are used for training.
aug_type: the specific strategy to synthesize Cutoff samples, which can be chosen from: 'span_cutoff', 'token_cutoff' and 'dim_cutoff'.
aug_cutoff_ratio: the ratio corresponding to the span length, token number or number of dimensions to be cut.
aug_ce_loss: the coefficient for the cross-entropy loss over the cutoff examples.
aug_js_loss: the coefficient for the Jensen-Shannon (JS) Divergence consistency loss over the cutoff examples.
TASK_NAME: the downstream GLUE task for fine-tuning.
model_name_or_path: the pre-trained for initialization (both RoBERTa-base or RoBERTa-large models are supported).
output_dir: the folder results being saved to.

Natural Language Generation (e.g. Translation, etc.)

Please refer to Neural Machine Translation with Data Augmentation for more details

IWSLT'14 German to English (Transformers)

Task	Setting	Approach	BLEU
iwslt14 de-en	transformer-small	w/o cutoff	36.2
iwslt14 de-en	transformer-small	w/ cutoff	37.6

WMT'14 English to German (Transformers)

Task	Setting	Approach	BLEU
wmt14 en-de	transformer-base	w/o cutoff	28.6
wmt14 en-de	transformer-base	w/ cutoff	29.1
wmt14 en-de	transformer-big	w/o cutoff	29.5
wmt14 en-de	transformer-big	w/ cutoff	30.3

Citation

Please cite our paper in your publications if it helps your research:

@article{shen2020simple,
  title={A Simple but Tough-to-Beat Data Augmentation Approach for Natural Language Understanding and Generation},
  author={Shen, Dinghan and Zheng, Mingzhi and Shen, Yelong and Qu, Yanru and Chen, Weizhu},
  journal={arXiv preprint arXiv:2009.13818},
  year={2020}
}

The source code for the Cutoff data augmentation approach proposed in this paper: "A Simple but Tough-to-Beat Data Augmentation Approach for Natural Language Understanding and Generation".

Related tags

Overview

Cutoff: A Simple Data Augmentation Approach for Natural Language

Natural Language Undertanding (e.g. GLUE tasks, etc.)

Prerequisite:

Run

Natural Language Generation (e.g. Translation, etc.)

IWSLT'14 German to English (Transformers)

WMT'14 English to German (Transformers)

Citation

Owner

Dinghan Shen

Gym-TORCS is the reinforcement learning (RL) environment in TORCS domain with OpenAI-gym-like interface.

Source code for the paper "Periodic Traveling Waves in an Integro-Difference Equation With Non-Monotonic Growth and Strong Allee Effect"

Imaginaire - NVIDIA's Deep Imagination Team's PyTorch Library

An implementation of the Contrast Predictive Coding (CPC) method to train audio features in an unsupervised fashion.

Code and data for "Broaden the Vision: Geo-Diverse Visual Commonsense Reasoning" (EMNLP 2021).

Deploy optimized transformer based models on Nvidia Triton server

Computations and statistics on manifolds with geometric structures.

Data labels and scripts for fastMRI.org

YOLOv4 / Scaled-YOLOv4 / YOLO - Neural Networks for Object Detection (Windows and Linux version of Darknet )

Keras implementation of the GNM model in paper ’Graph-Based Semi-Supervised Learning with Nonignorable Nonresponses‘

Source code for our Paper "Learning in High-Dimensional Feature Spaces Using ANOVA-Based Matrix-Vector Multiplication"

RTSeg: Real-time Semantic Segmentation Comparative Study

Little tool in python to watch anime from the terminal (the better way to watch anime)

PyTorch implementation of DUL (Data Uncertainty Learning in Face Recognition, CVPR2020)

Research code of ICCV 2021 paper "Mesh Graphormer"

Code for the TASLP paper "PSLA: Improving Audio Tagging With Pretraining, Sampling, Labeling, and Aggregation".

Vision-Language Pre-training for Image Captioning and Question Answering

The Turing Change Point Detection Benchmark: An Extensive Benchmark Evaluation of Change Point Detection Algorithms on real-world data

Collection of TensorFlow2 implementations of Generative Adversarial Network varieties presented in research papers.

TensorFlow 2 implementation of the Yahoo Open-NSFW model