EMNLP 2021 paper "Pre-train or Annotate? Domain Adaptation with a Constrained Budget".

Overview

Pre-train or Annotate? Domain Adaptation with a Constrained Budget

This repo contains code and data associated with EMNLP 2021 paper "Pre-train or Annotate? Domain Adaptation with a Constrained Budget".

@inproceedings{bai-etal-2021-pre,
    title = "Pre-train or Annotate? Domain Adaptation with a Constrained Budget",
    author = "Bai, Fan  and
              Ritter, Alan  and
              Xu, Wei",
    booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2021",
    address = "Online and Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
}

Installment

git clone https://github.com/bflashcp3f/ProcBERT.git
cd ProcBERT
conda env create -f environment.yml
conda activate procbert

Data & Model Checkpoints

Three procedural-text datasets (WLP, PubMed and ChemSyn) can be downloaded here, and model checkpoints (ProcBERT and Proc-RoBERTa) are accessible through HuggingFace.

Experiment

Setup

# After downloading the data, update the DATA_PATH variable in code/utils.py
DATA_PATH=<DATA_PATH>

Budget-aware Domain Adaptation Experiments (with EasyAdapt)

# Named Entity Recognition (NER) 
python code/ner_da_budget.py     \
  --lm_model procbert     \
  --src_data pubmed     \
  --tgt_data chemsyn     \
  --gpu_ids 0,1   \
  --output_dir ./output/da/pubmed_chemsyn     \
  --learning_rate 1e-5     \
  --task_name fa_ner     \
  --batch_size 16     \
  --max_len 512    \
  --epochs 25 \
  --budget 700 \
  --alpha 1   \
  --save_model

# Relation Extraction (RE)
python code/rel_da_budget.py \
  --lm_model procbert \
  --src_data pubmed     \
  --tgt_data chemsyn     \
  --gpu_ids 0,1  \
  --output_dir ./output/da/pubmed_chemsyn \
  --learning_rate 1e-5 \
  --task_name fa_rel \
  --batch_size 48 \
  --max_len 256 \
  --epochs 5 \
  --budget 700 \
  --alpha 1 \
  --down_sample \
  --down_sample_rate 0.4 \
  --save_model

To obtain ProcBERT results with different budgets under six domain adaptation settings:

# NER
sh script/ner/run_ner_da_budget_all.sh

# RE
sh script/rel/run_rel_da_budget_all.sh

Budget-aware Target-domain-only Experiments

# Named Entity Recognition (NER) 
python code/ner_budget.py \
  --lm_model procbert \
  --data_name chemsyn \
  --gpu_ids 0,1  \
  --output_dir ./output/chemsyn \
  --learning_rate 1e-5 \
  --task_name ner \
  --batch_size 16 \
  --max_len 512 \
  --epochs 25 \
  --budget 700 \
  --save_model

# Relation Extraction (RE)
python code/rel_budget.py \
  --lm_model procbert \
  --data_name chemsyn \
  --gpu_ids 0,1  \
  --output_dir ./output/chemsyn \
  --learning_rate 1e-5 \
  --task_name rel \
  --batch_size 48 \
  --max_len 256 \
  --epochs 5 \
  --budget 700 \
  --down_sample \
  --down_sample_rate 0.4 \
  --save_model

To obtain ProcBERT results with different budgets on three datasets:

# NER
sh script/ner/run_ner_budget_all.sh

# RE
sh script/rel/run_rel_budget_all.sh

Auxiliary Experiments

# Named Entity Recognition (NER) 
python code/ner.py \
  --lm_model procbert \
  --data_name chemsyn \
  --gpu_ids 0,1  \
  --output_dir ./output/chemsyn \
  --learning_rate 1e-5 \
  --task_name ner \
  --batch_size 16 \
  --max_len 512 \
  --epochs 20 \
  --save_model

# Relation Extraction (RE)
python code/rel.py \
  --lm_model procbert \
  --data_name chemsyn \
  --gpu_ids 0,1  \
  --output_dir ./output/chemsyn \
  --learning_rate 1e-5 \
  --task_name rel \
  --batch_size 48 \
  --max_len 256 \
  --epochs 5 \
  --down_sample \
  --down_sample_rate 0.4 \
  --save_model

To obtain ProcBERT results on all three datasets:

# NER
sh script/ner/run_ner_all.sh

# RE
sh script/rel/run_rel_all.sh
Owner
Fan Bai
Fan Bai
Translates basic English sentences into the Huna language (hoo-NAH)

huna-translator The Huna Language Translates basic English sentences into the Huna language (hoo-NAH). The Huna constructed language was developed in

Miles Smith 0 Jan 20, 2022
A simple chatbot based on chatterbot that you can use for anything has basic features

Chatbotium A simple chatbot based on chatterbot that you can use for anything has basic features. I have some errors Read the paragraph below: Known b

Herman 1 Feb 16, 2022
Python api wrapper for JellyFish Lights

Python api wrapper for JellyFish Lights The hope is to make this a pip installable package Current capabalilities: Connects to a local JellyFish Light

10 Dec 18, 2022
Healthsea is a spaCy pipeline for analyzing user reviews of supplementary products for their effects on health.

Welcome to Healthsea ✨ Create better access to health with spaCy. Healthsea is a pipeline for analyzing user reviews to supplement products by extract

Explosion 75 Dec 19, 2022
NeuralQA: A Usable Library for Question Answering on Large Datasets with BERT

NeuralQA: A Usable Library for (Extractive) Question Answering on Large Datasets with BERT Still in alpha, lots of changes anticipated. View demo on n

Victor Dibia 220 Dec 11, 2022
A demo for end-to-end English and Chinese text spotting using ABCNet.

ABCNet_Chinese A demo for end-to-end English and Chinese text spotting using ABCNet. This is an old model that was trained a long ago, which serves as

Yuliang Liu 45 Oct 04, 2022
Using Bert as the backbone model for lime, designed for NLP task explanation (sentence pair text classification task)

Lime Comparing deep contextualized model for sentences highlighting task. In addition, take the classic explanation model "LIME" with bert-base model

JHJu 2 Jan 18, 2022
NLP techniques such as named entity recognition, sentiment analysis, topic modeling, text classification with Python to predict sentiment and rating of drug from user reviews.

This file contains the following documents sumbited for Baruch CIS9665 group 9 fall 2021. 1. Dataset: drug_reviews.csv 2. python codes for text classi

Aarif Munwar Jahan 2 Jan 04, 2023
A high-level Python library for Quantum Natural Language Processing

lambeq About lambeq is a toolkit for quantum natural language processing (QNLP). Documentation: https://cqcl.github.io/lambeq/ Getting started Prerequ

Cambridge Quantum 315 Jan 01, 2023
Weaviate demo with the text2vec-openai module

Weaviate demo with the text2vec-openai module This repository contains an example of how to use the Weaviate text2vec-openai module. When using this d

SeMI Technologies 11 Nov 11, 2022
A simple command line tool for text to image generation, using OpenAI's CLIP and a BigGAN

artificial intelligence cosmic love and attention fire in the sky a pyramid made of ice a lonely house in the woods marriage in the mountains lantern

Phil Wang 2.3k Jan 01, 2023
KoBART model on huggingface transformers

KoBART-Transformers SKT에서 공개한 KoBART를 편리하게 사용할 수 있게 transformers로 포팅하였습니다. Install (Optional) BartModel과 PreTrainedTokenizerFast를 이용하면 설치하실 필요 없습니다. p

Hyunwoong Ko 58 Dec 07, 2022
Open-Source Toolkit for End-to-End Speech Recognition leveraging PyTorch-Lightning and Hydra.

🤗 Contributing to OpenSpeech 🤗 OpenSpeech provides reference implementations of various ASR modeling papers and three languages recipe to perform ta

Openspeech TEAM 513 Jan 03, 2023
A simple word search made in python

Word Search Puzzle A simple word search made in python Usage $ python3 main.py -h usage: main.py [-h] [-c] [-f FILE] Generates a word s

Magoninho 16 Mar 10, 2022
VMD Audio/Text control with natural language

This repository is a proof of principle for performing Molecular Dynamics analysis, in this case with the program VMD, via natural language commands.

Andrew White 13 Jun 09, 2022
Open solution to the Toxic Comment Classification Challenge

Starter code: Kaggle Toxic Comment Classification Challenge More competitions 🎇 Check collection of public projects 🎁 , where you can find multiple

minerva.ml 153 Jun 22, 2022
Deep Learning Topics with Computer Vision & NLP

Deep learning Udacity Course Deep Learning Topics with Computer Vision & NLP for the AWS Machine Learning Engineer Nanodegree Program Tasks are mostly

Simona Mircheva 1 Jan 20, 2022
Twitter Sentiment Analysis using #tag, words and username

Twitter Sentment Analysis Web App using #tag, words and username to fetch data finds Insides of data and Tells Sentiment of the perticular #tag, words or username.

Kumar Saksham 26 Dec 25, 2022
KR-FinBert And KR-FinBert-SC

KR-FinBert & KR-FinBert-SC Much progress has been made in the NLP (Natural Language Processing) field, with numerous studies showing that domain adapt

5 Jul 29, 2022
A toolkit for document-level event extraction, containing some SOTA model implementations

Document-level Event Extraction via Heterogeneous Graph-based Interaction Model with a Tracker Source code for ACL-IJCNLP 2021 Long paper: Document-le

84 Dec 15, 2022