Source code for the ACL-IJCNLP 2021 paper entitled "T-DNA: Taming Pre-trained Language Models with N-gram Representations for Low-Resource Domain Adaptation" by Shizhe Diao et al.

Related tags

Deep LearningT-DNA
Overview

T-DNA

Source code for the ACL-IJCNLP 2021 paper entitled Taming Pre-trained Language Models with N-gram Representations for Low-Resource Domain Adaptation.

Our implementation is built on the source code from huggingface transformers.

Model

We aim to adapt a generic pretrained model with a relatively small amount of domain-specific data. We demonstrate that by explicitly incorporating the multi-granularity information of unseen and domain-specific words via the adaptation of (word based) n-grams, the performance of a generic pretrained model can be greatly improved. Specifically, we introduce a Transformer-based Domain-aware N-gram Adaptor, T-DNA, to effectively learn and incorporate the semantic representation of different combinations of words in the new domain. T-DNA is able to achieve significant improvements compared to existing methods on most tasks using limited data with lower computational costs.

The overall architechture of T-DNA is shown in the figure below. image info

Requirements

Our code works with the following environment.

  • python=3.7.9
  • pytorch=1.4.0

To install the necessary packages for the project, please run: pip install -r requirements.txt.

Quick Start (For reproducing results)

  1. To do RoBERTa+T-DNA+FT, please refer to auto_FT.sh and you can simply run CUDA_VISIBLE_DEVICES=<GPU_ID> bash auto_FT.sh and get the expected results:
09/08/2021 19:56:58 - INFO - __main__ -   ***** Test results ag *****
09/08/2021 19:56:58 - INFO - __main__ -     eval_loss = 0.4393280267715454
09/08/2021 19:56:58 - INFO - __main__ -     eval_acc_and_f1 = {'acc': 0.8889473684210526, 'f1': 0.8889374532466023, 'acc_and_f1': 0.8889424108338275}
  1. To do RoBERTa+T-DNA+TAPT, please refer to auto_TAPT.sh and you can simply run CUDA_VISIBLE_DEVICES=<GPU_ID> bash auto_TAPT.sh and get the expected results:
09/08/2021 19:47:03 - INFO - __main__ -   ***** Test results ag *****
09/08/2021 19:47:03 - INFO - __main__ -     eval_loss = 0.48006332549609637
09/08/2021 19:47:03 - INFO - __main__ -     eval_acc_and_f1 = {'acc': 0.8943421052631579, 'f1': 0.8939718422143115, 'acc_and_f1': 0.8941569737387347}
  1. Important arguments:
    • task_name: ag, amazon, citation_intent, chemprot, hyperpartisan_news, imdb, rct-20k, sciie
    • data_dir: path of processed data
    • output_dir: path of saved results

Datasets

Following Gururangan et al. (2020), we conduct our experiments on eight classification tasks from four domains including biomedical sciences, computer scie nce, news and reviews. They are:

  • ChemProt: a manually annotated chemical–protein interaction dataset extracted from 5,031 abstracts for relation classification;
  • RCT: contains approximately 200,000 abstracts from public medicine with the role of each sentence clearly identified;
  • CitationIntent: contains around 2,000 citations annotated for their function;
  • SciERC: consists of 500 scientific abstracts annotated for relation classification;
  • HyperPartisan: which contains 645 articles from Hyperpartisan news with either extreme left-wing or right-wing stand-point used for partisanship classification;
  • AGNews: consists of 127,600 categorized articles from more than 2000 news source for topic classification;
  • Amazon: consists of 145,251 reviews on Women’s and Men’s Clothing & Accessories, each representing users’ implicit feedback on items with a binary label signifying whether the majority of customers found the review helpful;
  • IMDB: 50,000 balanced positive and negative reviews from the Internet Movie Database for sentiment classification

The datasets can be downloaded from the code associated with the Don't Stop Pretraining ACL 2020 paper. Please create a folder ./data in the root directory and put the downloaded datasets into it. After downloading, please convert them to *.tsv files referring to the script convert_dont_stop_corpus.py. Note that to create a low-resource setting, we constrain the size of all datasets into thousand-level. To do so, we randomly select a subset for RCT, AG, Amazon, IMDB with the ratio 1%, 1%, 1%, 10%, respectively.

To extract n-grams for datasets, please run pmi_ngram.py with the following parameters:

  • --dataset: the path of training data file
  • --output_dir: the path of output directory

Use with your own data

In this repo, we conducted experiments on eight classification tasks as described in the paper. In addition, it supports any classification task with just a little adjustment on your dataset. Here are the instructions to conduct experiments with your own data.

Firstly, please adjust your data format as following and put your data into the corresponding path.

Task adaptive pre-training:

Input dataset (./data/):

  • train: text \t label per line
  • dev: text \t label per line

Output: it will save the trained models to results folder automatically, and print out loss.

Fine-tuning dataset:

Input dataset (./data/tapt_data/):

  • train: text \t label per line
  • dev: text \t label per line
  • test: text \t label per line

Then, please modify the configuration file at ./TDNA/config.py

  1. define the desired evaluation metric in glue_compute_metrics(), e.g.,
elif task_name == "ag":
   return {"acc_and_f1": acc_and_f1(preds, labels)}
  1. create a new processor specifying the labels, e.g.,
class agProcessor(generalProcessor):
    def get_labels(self):
        return ['1', '2', '3', '4']
  1. specify the number of labels, e.g.,
glue_tasks_num_labels = {
    "citation_intent": 6,
    "ag": 4,
    "amazon": 2,
    "chemprot": 13,
    "hyperpartisan_news": 2,
    "imdb": 2,
    "rct-20k": 5,
    "sciie": 7,
    "SST2": 2
}
  1. include the new processor into glue_processors, e.g.,
glue_processors = {
    "citation_intent": citation_intentProcessor,
    "ag": agProcessor,
    "amazon": amazonProcessor,
    "chemprot": chemprotProcessor,
    "hyperpartisan_news": hyperpartisan_newsProcessor,
    "imdb": imdbProcessor,
    "rct-20k": rct_20kProcessor,
    "sciie": sciieProcessor,
    "SST2": SST2Processor
}
  1. specify the output mode in glue_output_modes, e.g.,
glue_output_modes = {
    "citation_intent": "classification",
    "ag": "classification",
    "amazon": "classification",
    "chemprot": "classification",
    "hyperpartisan_news": "classification",
    "imdb": "classification",
    "rct-20k": "classification",
    "sciie": "classification",
    "SST2": "classification"
}

Run

For FT,

python ./examples/run_classification.py --model_name_or_path roberta-base \
--task_name <task_name> --max_seq_length 256 --per_device_train_batch_size 16 \
--learning_rate 4e-5 --num_train_epochs 3.0 --output_dir ./results/<task_name>_FT/ \
--data_dir ./data/<task_name>/ --Ngram_path ./ngram/pmi_<task_name>_ngram.txt \
--fasttext_model_path ./ngram/<task_name>.npy --overwrite_output_dir

For TAPT + FT,

python ./examples/run_language_modeling.py \
--output_dir=./models/<task_name>_TAPT/ --model_type=roberta  --overwrite_output_dir \
--model_name_or_path=roberta-base --train_data_file=./data/tapt_data/<task_name>/train.tsv \
--eval_data_file=./data/tapt_data/<task_name>/dev.tsv --mlm --line_by_line \
--Ngram_path ./ngram/pmi_<task_name>_ngram.txt --num_train_epochs 10.0 \
--fasttext_model_path ./ngram/<task_name>.npy --learning_rate 4e-5

python ./examples/run_classification.py \
--model_name_or_path ./models/<task_name>_TAPT \
--task_name <task_name> --max_seq_length 256 --per_device_train_batch_size 16 \
--learning_rate 2e-5 --num_train_epochs 5.0 --output_dir ./results/<task_name>_TAPT_FT/ \
--data_dir ./data/<task_name>/ --Ngram_path ./ngram/pmi_<task_name>_ngram.txt --overwrite_output_dir --save_steps 5000

Output:

The run_classification.py program will save the trained models to results folder automatically, and print out loss, accuracy, f1 score. In addition, you can get the prediction results in args.output_dir/test_pred_{task_name}.txt. Take test_pred_ag.txt as an example:

input   label   pred
Unions representing workers at Turner   Newall say they are 'disappointed' after talks with stricken parent firm Federal Mogul. 3       3
SPACE.com - TORONTO, Canada -- A second\team of rocketeers competing for the  #36;10 million Ansari X Prize, a contest for\privately funded suborbital space flight, has officially announced the first\launch date for its manned rocket.      4       4
...

Contact information

For help or issues using T-DNA, please submit a GitHub issue.

For personal communication related to T-DNA, please contact Shizhe Diao ([email protected]).

Citation

If you use or extend our work, please cite the following paper:

@inproceedings{DXSJSZ2021,
    title = "Taming Pre-trained Language Models with N-gram Representations for Low-Resource Domain Adaptation",
    author = "Diao, Shizhe  and
      Xu, Ruijia  and
      Su, Hongjin  and
      Jiang, Yilei  and
      Song, Yan  and
      Zhang, Tong",
    booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.acl-long.259",
    doi = "10.18653/v1/2021.acl-long.259",
    pages = "3336--3349",
}
Owner
shizhediao
shizhediao
GazeScroller - Using Facial Movements to perform Hands-free Gesture on the system

GazeScroller Using Facial Movements to perform Hands-free Gesture on the system

2 Jan 05, 2022
Vanilla and Prototypical Networks with Random Weights for image classification on Omniglot and mini-ImageNet. Made with Python3.

vanilla-rw-protonets-project Vanilla Prototypical Networks and PNs with Random Weights for image classification on Omniglot and mini-ImageNet. Made wi

Giovani Candido 8 Aug 31, 2022
PyTorch implementation of Munchausen Reinforcement Learning based on DQN and SAC. Handles discrete and continuous action spaces

Exploring Munchausen Reinforcement Learning This is the project repository of my team in the "Advanced Deep Learning for Robotics" course at TUM. Our

Mohamed Amine Ketata 10 Mar 10, 2022
Tensorflow Implementation of the paper "Spectral Normalization for Generative Adversarial Networks" (ICML 2017 workshop)

tf-SNDCGAN Tensorflow implementation of the paper "Spectral Normalization for Generative Adversarial Networks" (https://www.researchgate.net/publicati

Nhat M. Nguyen 248 Nov 25, 2022
Using LSTM write Tang poetry

本教程将通过一个示例对LSTM进行介绍。通过搭建训练LSTM网络,我们将训练一个模型来生成唐诗。本文将对该实现进行详尽的解释,并阐明此模型的工作方式和原因。并不需要过多专业知识,但是可能需要新手花一些时间来理解的模型训练的实际情况。为了节省时间,请尽量选择GPU进行训练。

56 Dec 15, 2022
Repositório criado para abrigar os notebooks com a listas de exercícios propostos pelo professor Gustavo Guanabara do canal Curso em Vídeo do YouTube durante o Curso de Python 3

Curso em Vídeo - Exercícios de Python 3 Sobre o repositório Este repositório contém os notebooks com a listas de exercícios propostos pelo professor G

João Pedro Pereira 9 Oct 15, 2022
Create Data & AI apps in 20 lines of code with Shimoku

Install with: pip install shimoku-api-python Start with: from os import getenv import shimoku_api_python.client as Shimoku

Shimoku 5 Nov 07, 2022
PICARD - Parsing Incrementally for Constrained Auto-Regressive Decoding from Language Models

This is the official implementation of the following paper: Torsten Scholak, Nathan Schucher, Dzmitry Bahdanau. PICARD - Parsing Incrementally for Con

ElementAI 217 Jan 01, 2023
BasicVSR: The Search for Essential Components in Video Super-Resolution and Beyond

BasicVSR BasicVSR: The Search for Essential Components in Video Super-Resolution and Beyond Ported from https://github.com/xinntao/BasicSR Dependencie

Holy Wu 8 Jun 07, 2022
An end-to-end project on customer segmentation

End-to-end Customer Segmentation Project Note: This project is in progress. Tools Used in This Project Prefect: Orchestrate workflows hydra: Manage co

Ocelot Consulting 8 Oct 06, 2022
Official repository of ICCV21 paper "Viewpoint Invariant Dense Matching for Visual Geolocalization"

Viewpoint Invariant Dense Matching for Visual Geolocalization: PyTorch implementation This is the implementation of the ICCV21 paper: G Berton, C. Mas

Gabriele Berton 44 Jan 03, 2023
code release for USENIX'22 paper `On the Security Risks of AutoML`

This project is a minimized runnable project cut from trojanzoo, which contains more datasets, models, attacks and defenses. This repo will not be mai

Ren Pang 5 Apr 19, 2022
DeepRec is a recommendation engine based on TensorFlow.

DeepRec Introduction DeepRec is a recommendation engine based on TensorFlow 1.15, Intel-TensorFlow and NVIDIA-TensorFlow. Background Sparse model is a

Alibaba 676 Jan 03, 2023
Implementation of "GNNAutoScale: Scalable and Expressive Graph Neural Networks via Historical Embeddings" in PyTorch

PyGAS: Auto-Scaling GNNs in PyG PyGAS is the practical realization of our G NN A uto S cale (GAS) framework, which scales arbitrary message-passing GN

Matthias Fey 139 Dec 25, 2022
Unofficial implementation of Perceiver IO: A General Architecture for Structured Inputs & Outputs

Perceiver IO Unofficial implementation of Perceiver IO: A General Architecture for Structured Inputs & Outputs Usage import torch from src.perceiver.

Timur Ganiev 111 Nov 15, 2022
SIMULEVAL A General Evaluation Toolkit for Simultaneous Translation

SimulEval SimulEval is a general evaluation framework for simultaneous translation on text and speech. Requirement python = 3.7.0 Installation git cl

Facebook Research 48 Dec 28, 2022
This repository contains numerical implementation for the paper Intertemporal Pricing under Reference Effects: Integrating Reference Effects and Consumer Heterogeneity.

This repository contains numerical implementation for the paper Intertemporal Pricing under Reference Effects: Integrating Reference Effects and Consumer Heterogeneity.

Hansheng Jiang 6 Nov 18, 2022
A New Open-Source Off-road Environment for Benchmark Generalization of Autonomous Driving

A New Open-Source Off-road Environment for Benchmark Generalization of Autonomous Driving Isaac Han, Dong-Hyeok Park, and Kyung-Joong Kim IEEE Access

13 Dec 27, 2022
An University Project of Quera Web Crawling.

WebCrawlerProject An University Project of Quera Web Crawling. خزشگر اینستاگرام در این پروژه شما باید با استفاده از کتابخانه های زیر یک خزشگر اینستاگر

Mahdi 3 Aug 12, 2022
An efficient toolkit for Face Stylization based on the paper "AgileGAN: Stylizing Portraits by Inversion-Consistent Transfer Learning"

MMGEN-FaceStylor English | 简体中文 Introduction This repo is an efficient toolkit for Face Stylization based on the paper "AgileGAN: Stylizing Portraits

OpenMMLab 182 Dec 27, 2022