Code associated with the "Data Augmentation using Pre-trained Transformer Models" paper

Last update: Dec 31, 2022

Overview

Data Augmentation using Pre-trained Transformer Models

Code associated with the Data Augmentation using Pre-trained Transformer Models paper

Code contains implementation of the following data augmentation methods

EDA (Baseline)
Backtranslation (Baseline)
CBERT (Baseline)
BERT Prepend (Our paper)
GPT-2 Prepend (Our paper)
BART Prepend (Our paper)

DataSets

In paper, we use three datasets from following resources

Low-data regime experiment setup

Run src/utils/download_and_prepare_datasets.sh file to prepare all datsets.
download_and_prepare_datasets.sh performs following steps

Download data from github
Replace numeric labels with text for STSA-2 and TREC dataset
For a given dataset, creates 15 random splits of train and dev data.

Dependencies

To run this code, you need following dependencies

Pytorch 1.5
fairseq 0.9
transformers 2.9

How to run

To run data augmentation experiment for a given dataset, run bash script in scripts folder. For example, to run data augmentation on snips dataset,

run scripts/bart_snips_lower.sh for BART experiment
run scripts/bert_snips_lower.sh for rest of the data augmentation methods

How to cite

@inproceedings{kumar-etal-2020-data,
    title = "Data Augmentation using Pre-trained Transformer Models",
    author = "Kumar, Varun  and
      Choudhary, Ashutosh  and
      Cho, Eunah",
    booktitle = "Proceedings of the 2nd Workshop on Life-long Learning for Spoken Language Systems",
    month = dec,
    year = "2020",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.lifelongnlp-1.3",
    pages = "18--26",
}

Contact

Please reachout to [email protected] for any questions related to this code.

License

This project is licensed under the Creative Common Attribution Non-Commercial 4.0 license.

Code associated with the "Data Augmentation using Pre-trained Transformer Models" paper

Related tags

Overview

Data Augmentation using Pre-trained Transformer Models

DataSets

Low-data regime experiment setup

Dependencies

How to run

How to cite

Contact

License

Owner

PG-19 Language Modelling Benchmark

Code for EMNLP20 paper: "ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training"

Nested Named Entity Recognition for Chinese Biomedical Text

The projects lets you extract glossary words and their definitions from a given piece of text automatically using NLP techniques

结巴中文分词

This is a project of data parallel that running on NLP tasks.

Mlcode - Continuous ML API Integrations

XLNet: Generalized Autoregressive Pretraining for Language Understanding

End-to-end MLOps pipeline of a BERT model for emotion classification.

A Persian Image Captioning model based on Vision Encoder Decoder Models of the transformers🤗.

Text Normalization（文本正则化）

ChatterBot is a machine learning, conversational dialog engine for creating chat bots

The NewSHead dataset is a multi-doc headline dataset used in NHNet for training a headline summarization model.

End-2-end speech synthesis with recurrent neural networks

Code for Findings of ACL 2022 Paper "Sentiment Word Aware Multimodal Refinement for Multimodal Sentiment Analysis with ASR Errors"

One Stop Anomaly Shop: Anomaly detection using two-phase approach: (a) pre-labeling using statistics, Natural Language Processing and static rules; (b) anomaly scoring using supervised and unsupervised machine learning.

基于pytorch+bert的中文事件抽取

Chatbot with Pytorch, Python & Nextjs

Create a semantic search engine with a neural network (i.e. BERT) whose knowledge base can be updated

Sentiment Classification using WSD, Maximum Entropy & Naive Bayes Classifiers