Code associated with the Don't Stop Pretraining ACL 2020 paper

Overview

dont-stop-pretraining

Code associated with the Don't Stop Pretraining ACL 2020 paper

Citation

@inproceedings{dontstoppretraining2020,
 author = {Suchin Gururangan and Ana Marasović and Swabha Swayamdipta and Kyle Lo and Iz Beltagy and Doug Downey and Noah A. Smith},
 title = {Don't Stop Pretraining: Adapt Language Models to Domains and Tasks},
 year = {2020},
 booktitle = {Proceedings of ACL},
}

Installation

conda env create -f environment.yml
conda activate domains

Working with the latest allennlp version

This repository works with a pinned allennlp version for reproducibility purposes. This pinned version of allennlp relies on pytorch-transformers==1.2.0, which requires you to manually download custom transformer models on disk.

To run this code with the latest allennlp/ transformers version (and use the huggingface model repository to its full capacity) checkout the branch latest-allennlp. Caution that we haven't tested out all models on this branch, so your results may vary from what we report in paper.

If you'd like to use this pinned allennlp version, read on. Otherwise, checkout latest-allennlp.

Available Pretrained Models

We've uploaded DAPT and TAPT models to huggingface.

DAPT models

Available DAPT models:

allenai/cs_roberta_base
allenai/biomed_roberta_base
allenai/reviews_roberta_base
allenai/news_roberta_base

TAPT models

Available TAPT models:

allenai/dsp_roberta_base_dapt_news_tapt_ag_115K
allenai/dsp_roberta_base_tapt_ag_115K
allenai/dsp_roberta_base_dapt_reviews_tapt_amazon_helpfulness_115K
allenai/dsp_roberta_base_tapt_amazon_helpfulness_115K
allenai/dsp_roberta_base_dapt_biomed_tapt_chemprot_4169
allenai/dsp_roberta_base_tapt_chemprot_4169
allenai/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688
allenai/dsp_roberta_base_tapt_citation_intent_1688
allenai/dsp_roberta_base_dapt_news_tapt_hyperpartisan_news_5015
allenai/dsp_roberta_base_dapt_news_tapt_hyperpartisan_news_515
allenai/dsp_roberta_base_tapt_hyperpartisan_news_5015
allenai/dsp_roberta_base_tapt_hyperpartisan_news_515
allenai/dsp_roberta_base_dapt_reviews_tapt_imdb_20000
allenai/dsp_roberta_base_dapt_reviews_tapt_imdb_70000
allenai/dsp_roberta_base_tapt_imdb_20000
allenai/dsp_roberta_base_tapt_imdb_70000
allenai/dsp_roberta_base_dapt_biomed_tapt_rct_180K
allenai/dsp_roberta_base_tapt_rct_180K
allenai/dsp_roberta_base_dapt_biomed_tapt_rct_500
allenai/dsp_roberta_base_tapt_rct_500
allenai/dsp_roberta_base_dapt_cs_tapt_sciie_3219
allenai/dsp_roberta_base_tapt_sciie_3219

The final numbers in each model above are the dataset sizes. Larger dataset sizes (e.g. imdb_70000 vs. imdb_20000) are curated TAPT models. These only exist for imdb, rct, and hyperpartisan_news.

Downloading Pretrained models

You can download a pretrained model using the scripts/download_model.py script.

Just supply a model type and serialization directory, like so:

python -m scripts.download_model \
        --model allenai/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688 \
        --serialization_dir $(pwd)/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688

This will output the allenai/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688 model for Citation Intent corpus in $(pwd)/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688

Downloading data

All task data is available on a public S3 url; check environments/datasets.py.

If you run the scripts/train.py command (see next step), we will automatically download the relevant dataset(s) using the URLs in environments/datasets.py. However, if you'd like to download the data for use outside of this repository, you will have to curl each dataset individually:

curl -Lo train.jsonl https://allennlp.s3-us-west-2.amazonaws.com/dont_stop_pretraining/data/chemprot/train.jsonl
curl -Lo dev.jsonl https://allennlp.s3-us-west-2.amazonaws.com/dont_stop_pretraining/data/chemprot/dev.jsonl
curl -Lo test.jsonl https://allennlp.s3-us-west-2.amazonaws.com/dont_stop_pretraining/data/chemprot/test.jsonl

Example commands

Run basic RoBERTa model

The following command will train a RoBERTa classifier on the Citation Intent corpus. Check environments/datasets.py for other datasets you can pass to the --dataset flag.

python -m scripts.train \
        --config training_config/classifier.jsonnet \
        --serialization_dir model_logs/citation_intent_base \
        --hyperparameters ROBERTA_CLASSIFIER_SMALL \
        --dataset citation_intent \
        --model roberta-base \
        --device 0 \
        --perf +f1 \
        --evaluate_on_test

You can supply other downloaded models to this script, by providing a path to the model:

python -m scripts.train \
        --config training_config/classifier.jsonnet \
        --serialization_dir model_logs/citation-intent-dapt-dapt \
        --hyperparameters ROBERTA_CLASSIFIER_SMALL \
        --dataset citation_intent \
        --model $(pwd)/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688 \
        --device 0 \
        --perf +f1 \
        --evaluate_on_test

Perform hyperparameter search

First, install allentune: https://github.com/allenai/allentune

Modify search_space/classifier.jsonnet accordingly.

Then run:

allentune search \
            --experiment-name ag_search \
            --num-cpus 56 \
            --num-gpus 4 \
            --search-space search_space/classifier.jsonnet \
            --num-samples 100 \
            --base-config training_config/classifier.jsonnet  \
            --include-package dont_stop_pretraining

Modify --num-gpus and --num-samples accordingly.

Convolutional 2D Knowledge Graph Embeddings resources

ConvE Convolutional 2D Knowledge Graph Embeddings resources. Paper: Convolutional 2D Knowledge Graph Embeddings Used in the paper, but do not use thes

Tim Dettmers 586 Dec 24, 2022
Contact Extraction with Question Answering.

contactsQA Extraction of contact entities from address blocks and imprints with Extractive Question Answering. Goal Input: Dr. Max Mustermann Hauptstr

Jan 2 Apr 20, 2022
This repository contains the code for "Exploiting Cloze Questions for Few-Shot Text Classification and Natural Language Inference"

Pattern-Exploiting Training (PET) This repository contains the code for Exploiting Cloze Questions for Few-Shot Text Classification and Natural Langua

Timo Schick 1.4k Dec 30, 2022
Knowledge Management for Humans using Machine Learning & Tags

HyperTag helps humans intuitively express how they think about their files using tags and machine learning. Represent how you think using tags. Find what you look for using semantic search for your t

Ravn Tech, Inc. 166 Jan 07, 2023
NewsMTSC: (Multi-)Target-dependent Sentiment Classification in News Articles

NewsMTSC: (Multi-)Target-dependent Sentiment Classification in News Articles NewsMTSC is a dataset for target-dependent sentiment classification (TSC)

Felix Hamborg 79 Dec 30, 2022
A design of MIDI language for music generation task, specifically for Natural Language Processing (NLP) models.

MIDI Language Introduction Reference Paper: Pop Music Transformer: Beat-based Modeling and Generation of Expressive Pop Piano Compositions: code This

Robert Bogan Kang 3 May 25, 2022
PyTorch implementation of convolutional neural networks-based text-to-speech synthesis models

Deepvoice3_pytorch PyTorch implementation of convolutional networks-based text-to-speech synthesis models: arXiv:1710.07654: Deep Voice 3: Scaling Tex

Ryuichi Yamamoto 1.8k Dec 30, 2022
Text to speech converter with GUI made in Python.

Text-to-speech-with-GUI Text to speech converter with GUI made in Python. To run this download the zip file and run the main file or clone this repo.

SidTheMiner 1 Nov 15, 2021
ETM - R package for Topic Modelling in Embedding Spaces

ETM - R package for Topic Modelling in Embedding Spaces This repository contains an R package called topicmodels.etm which is an implementation of ETM

bnosac 37 Nov 06, 2022
Natural Language Processing for Adverse Drug Reaction (ADR) Detection

Natural Language Processing for Adverse Drug Reaction (ADR) Detection This repo contains code from a project to identify ADRs in discharge summaries a

Medicines Optimisation Service - Austin Health 21 Aug 05, 2022
CPT: A Pre-Trained Unbalanced Transformer for Both Chinese Language Understanding and Generation

CPT This repository contains code and checkpoints for CPT. CPT: A Pre-Trained Unbalanced Transformer for Both Chinese Language Understanding and Gener

fastNLP 342 Jan 05, 2023
NLP project that works with news (NER, context generation, news trend analytics)

СоАвтор СоАвтор – платформа и открытый набор инструментов для редакций и журналистов-фрилансеров, который призван сделать процесс создания контента ма

38 Jan 04, 2023
Machine Learning Course Project, IMDB movie review sentiment analysis by lstm, cnn, and transformer

IMDB Sentiment Analysis This is the final project of Machine Learning Courses in Huazhong University of Science and Technology, School of Artificial I

Daniel 0 Dec 27, 2021
Pipeline for training LSA models using Scikit-Learn.

Latent Semantic Analysis Pipeline for training LSA models using Scikit-Learn. Usage Instead of writing custom code for latent semantic analysis, you j

Dani El-Ayyass 23 Sep 05, 2022
Twewy-discord-chatbot - Build a Discord AI Chatbot that Speaks like Your Favorite Character

Build a Discord AI Chatbot that Speaks like Your Favorite Character! This is a Discord AI Chatbot that uses the Microsoft DialoGPT conversational mode

Lynn Zheng 231 Dec 30, 2022
Baseline code for Korean open domain question answering(ODQA)

Open-Domain Question Answering(ODQA)는 다양한 주제에 대한 문서 집합으로부터 자연어 질의에 대한 답변을 찾아오는 task입니다. 이때 사용자 질의에 답변하기 위해 주어지는 지문이 따로 존재하지 않습니다. 따라서 사전에 구축되어있는 Knowl

VUMBLEB 69 Nov 04, 2022
DomainWordsDict, Chinese words dict that contains more than 68 domains, which can be used as text classification、knowledge enhance task

DomainWordsDict, Chinese words dict that contains more than 68 domains, which can be used as text classification、knowledge enhance task。涵盖68个领域、共计916万词的专业词典知识库,可用于文本分类、知识增强、领域词汇库扩充等自然语言处理应用。

liuhuanyong 357 Dec 24, 2022
Pytorch implementation of winner from VQA Chllange Workshop in CVPR'17

2017 VQA Challenge Winner (CVPR'17 Workshop) pytorch implementation of Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challeng

Mark Dong 166 Dec 11, 2022
Exploring dimension-reduced embeddings

sleepwalk Exploring dimension-reduced embeddings This is the code repository. See here for the Sleepwalk web page. License and disclaimer This program

S. Anders's research group at ZMBH 91 Nov 29, 2022
PyTorch Implementation of Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation

StyleSpeech - PyTorch Implementation PyTorch Implementation of Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation. Status (2021.06.09

Keon Lee 142 Jan 06, 2023