DEMix Layers for Modular Language Modeling

Related tags

Deep Learningdemix
Overview

DEMix

This repository contains modeling utilities for "DEMix Layers: Disentangling Domains for Modular Language Modeling" (Gururangan et. al, 2021).

This code is a fork of Fairseq. It is based on Python 3.8, CUDA 11 and includes PyTorch 1.8.0, NCCL 2.8.4 and apex.

Dataset

The multidomain dataset scripts are housed in another repository, located here. Clone that repository and follow instructions to setup data to train on.

Follow that tutorial to generate data-bins on eight (small) example domains.

Make sure to set the DATA_DIR accordingly.

Fairseq Installation

If you've already made an environment from the dataset creation phase, just use that. Otherwise:

conda create env --name demix
cd demix/
pip install --editable .

Additionally, please make sure you have the dependencies above installed (check Fairseq documentation for more information).

Tutorial

Here we will follow a tutorial to train on the example domains from the tutorial in the DEMix-data repository. Note that the model that results from this tutorial is pretty bad, because we're working with very small amounts of data and also a small LM. This tutorial is there to help you quickly understand the pipeline, and ensure that each script completes successfully.

To replicate the DEMix paper, with a GPT-3 model, follow the instructions here.

Basic Training

After setting up the example domains, run the following to train a small language model. Note that the scripts in this paper assume you are running on a multi-node GPU cluster with SLURM.

First, allocate some nodes, with GPUs with at least 32GB of RAM. Here we allocate 1 node with 8 volta32GB GPUs.

salloc --gpus-per-node 8 --nodes 1  -C 'volta32gb' --ntasks-per-node 8 --cpus-per-task 10 --mem 400G --time XXX --partition YYY

Then run:

export NUM_GPUS=8
export DISTRIBUTED_PORT=12345
export MODEL=transformer_lm
export EXPERIMENT=demix
# $DATA_DIR was set in DEMix-data tutorial.
export DATA_BIN=${DATA_DIR}/data-bin/
export EXPERIMENT_SUFFIX=tutorial
export SERIALIZATION_DIR=$(pwd)/demix_tutorial_model
bash tutorial/train.sh $NUM_GPUS \
                    $DISTRIBUTED_PORT \
                    $MODEL \
                    $EXPERIMENT \
                    $DATA_BIN \
                    $SERIALIZATION_DIR \
                    $EXPERIMENT_SUFFIX

This will output a trained language model in ${SERIALIZATION_DIR}

To train balanced dense LM, set export EXPERIMENT=dense, to train unbalanced dense LM, set export EXPERIMENT=unbalanced, to train "+Domain Token" LM , set export EXPERIMENT=domain_token.

We have provided a simple script demix/train.sh, with the same interface, with all hyperparameter preset to help replicate results in the paper.

Evaluation

We have two ways to evaluate the demix language model: with and without mixing experts.

Evaluating without mixing experts

To evaluate the language model without mixing experts, you can supply the checkpoint from a GPU on a particular rank (to specify the use of the domain expert that was trained on that GPU):

export DATA_BIN=${DATA_DIR}/data-bin/
export GPU_RANK=0
export PATH_TO_CHECKPOINT=${SERIALIZATION_DIR}/checkpoint_last-rank-${GPU_RANK}.pt
export OUTPUT_PATH=eval_output.jsonl
export SPLIT=valid
export DOMAIN=imdb
bash tutorial/eval_lm.sh $DATA_BIN $PATH_TO_CHECKPOINT $OUTPUT_PATH $SPLIT $DOMAIN

To evaluate on test data, set export SPLIT=test

The same script is used for the other baselines.

For the +domain token model, you can additionally supply a domain token to use at test time:

export DOMAIN_TOKEN=XXX
bash tutorial/eval_lm.sh $DATA_BIN $PATH_TO_CHECKPOINT $OUTPUT_PATH $SPLIT $DOMAIN $DOMAIN_TOKEN

Evaluating with mixing experts

First, we estimate the posterior distribution on 100 sequences of validation data of the domain using the following command:

export DATA_BIN=${DATA_DIR}/data-bin
export DOMAIN=imdb
export DEV_POSTERIOR_OUTPUT=dev_posteriors.jsonl
# set NUM_EVALUATION_GPUS equal to the number of experts you'd like to ensemble.
export NUM_EVALUATION_GPUS=8;
bash tutorial/mix_eval_lm.sh $NUM_EVALUATION_GPUS $DATA_BIN  ${SERIALIZATION_DIR}/checkpoint_last-rank-0.pt:${SERIALIZATION_DIR}/checkpoint_last-rank-1.pt:${SERIALIZATION_DIR}/checkpoint_last-rank-2.pt:${SERIALIZATION_DIR}/checkpoint_last-rank-3.pt:${SERIALIZATION_DIR}/checkpoint_last-rank-4.pt:${SERIALIZATION_DIR}/checkpoint_last-rank-5.pt:${SERIALIZATION_DIR}/checkpoint_last-rank-6.pt:${SERIALIZATION_DIR}/checkpoint_last-rank-7.pt $DOMAIN $DEV_POSTERIOR_OUTPUT estimate;

Then, we open $POSTERIOR_OUTPUT, extracting the exp_avg_posterior value of the last line in that file:

export POSTERIOR=$(tail -n 1 $DEV_POSTERIOR_OUTPUT | jq -rc '.exp_avg_posterior | join(",")')

We use this posterior as the domain prior (supplied as a string) when evaluating on test data, like so:

bash tutorial/mix_eval_lm.sh $NUM_EVALUATION_GPUS $DATA_BIN  ${SERIALIZATION_DIR}/checkpoint_last-rank-0.pt:${SERIALIZATION_DIR}/checkpoint_last-rank-1.pt:${SERIALIZATION_DIR}/checkpoint_last-rank-2.pt:${SERIALIZATION_DIR}/checkpoint_last-rank-3.pt:${SERIALIZATION_DIR}/checkpoint_last-rank-4.pt:${SERIALIZATION_DIR}/checkpoint_last-rank-5.pt:${SERIALIZATION_DIR}/checkpoint_last-rank-6.pt:${SERIALIZATION_DIR}/checkpoint_last-rank-7.pt $DOMAIN $DEV_POSTERIOR_OUTPUT eval $POSTERIOR cached_prior;

Adapting the Language Model

We additionally provide scripts to adapt the language model to a new domain.

DEMix DAPT

In this tutorial, we just adapt one of the existing experts to a new example domain in the demix-data project, located in /path/to/demix-data/new_example_domains.

First, we need to figure out which domain expert has the most affinity to the target domain we want to adapt to:

export NEW_DATA_BIN=/private/home/suching/demix-data/new_example_domains/data-bin/
export NEW_DOMAIN=acl_papers
export DEV_POSTERIOR_OUTPUT=${NEW_DOMAIN}_posterior.jsonl
# set NUM_EVALUATION_GPUS equal to the number of experts you'd like to ensemble.
export NUM_EVALUATION_GPUS=8;
bash tutorial/mix_eval_lm.sh $NUM_EVALUATION_GPUS $NEW_DATA_BIN  ${SERIALIZATION_DIR}/checkpoint_last-rank-0.pt:${SERIALIZATION_DIR}/checkpoint_last-rank-1.pt:${SERIALIZATION_DIR}/checkpoint_last-rank-2.pt:${SERIALIZATION_DIR}/checkpoint_last-rank-3.pt:${SERIALIZATION_DIR}/checkpoint_last-rank-4.pt:${SERIALIZATION_DIR}/checkpoint_last-rank-5.pt:${SERIALIZATION_DIR}/checkpoint_last-rank-6.pt:${SERIALIZATION_DIR}/checkpoint_last-rank-7.pt $NEW_DOMAIN $DEV_POSTERIOR_OUTPUT estimate;
export POSTERIOR=$(tail -n 1 $DEV_POSTERIOR_OUTPUT | jq -rc '.exp_avg_posterior | join(",")')

Here, we find that the most likely expert is expert number 5.

export POSTERIOR=$(tail -n 1 $DEV_POSTERIOR_OUTPUT | jq -rc '.exp_avg_posterior | join(",")')
echo $POSTERIOR

We then adapt expert 5 to the target domain using the tutorial/dapt.sh script, using DEMix DAPT:

export PATH_TO_CHECKPOINT=${SERIALIZATION_DIR}/checkpoint_last-rank-5.pt
export UNFREEZE_PARAMETERS=feedforward
export NEW_SERIALIZATION_DIR=$(pwd)/${NEW_DOMAIN}_demix_dapt
export EXPERIMENT_SUFFIX=test
bash tutorial/dapt.sh $NEW_DATA_BIN $NEW_DOMAIN $PATH_TO_CHECKPOINT $UNFREEZE_PARAMETERS $NEW_SERIALIZATION_DIR $EXPERIMENT_SUFFIX

Once this is trained, you can add that expert to your ensemble when evaluating on new data:

export NEW_DATA_BIN=/path/to/demix-data/new_example_domains/data-bin/
export NEW_DOMAIN=acl_papers
export DEV_POSTERIOR_OUTPUT=${NEW_DOMAIN}_posterior.jsonl
# set NUM_EVALUATION_GPUS equal to the number of experts you'd like to ensemble.
export NUM_EVALUATION_GPUS=8;
export PATH_TO_NEW_EXPERT=${NEW_SERIALIZATION_DIR}/checkpoint_last-rank-0.pt
bash tutorial/mix_eval_lm.sh $NUM_EVALUATION_GPUS $NEW_DATA_BIN  ${SERIALIZATION_DIR}/checkpoint_last-rank-0.pt:${SERIALIZATION_DIR}/checkpoint_last-rank-1.pt:${SERIALIZATION_DIR}/checkpoint_last-rank-2.pt:${SERIALIZATION_DIR}/checkpoint_last-rank-3.pt:${SERIALIZATION_DIR}/checkpoint_last-rank-4.pt:${SERIALIZATION_DIR}/checkpoint_last-rank-5.pt:${SERIALIZATION_DIR}/checkpoint_last-rank-6.pt:${PATH_TO_NEW_EXPERT} $NEW_DOMAIN $DEV_POSTERIOR_OUTPUT estimate;
export POSTERIOR=$(tail -n 1 $DEV_POSTERIOR_OUTPUT | jq -rc '.exp_avg_posterior | join(",")')

Dense DAPT

If you wanted to do Dense DAPT instead, just change the environment variables:

export PATH_TO_CHECKPOINT=/path/to/dense/model/checkpoint_last.pt
export FEEDFORWARD_OR_FULL=full
export SERIALIZATION_DIR=$(pwd)/${NEW_DOMAIN}_dense_dapt
export EXPERIMENT_SUFFIX=test
bash tutorial/dapt.sh $NEW_DATA_BIN $NEW_DOMAIN $PATH_TO_CHECKPOINT $FEEDFORWARD_OR_FULL $SERIALIZATION_DIR $EXPERIMENT_SUFFIX
Owner
Suchin
Allen Institute for AI / Facebook AI
Suchin
Code for NeurIPS 2020 article "Contrastive learning of global and local features for medical image segmentation with limited annotations"

Contrastive learning of global and local features for medical image segmentation with limited annotations The code is for the article "Contrastive lea

Krishna Chaitanya 152 Dec 22, 2022
An implementation of DeepMind's Relational Recurrent Neural Networks in PyTorch.

relational-rnn-pytorch An implementation of DeepMind's Relational Recurrent Neural Networks (Santoro et al. 2018) in PyTorch. Relational Memory Core (

Sang-gil Lee 241 Nov 18, 2022
A minimalist environment for decision-making in autonomous driving

highway-env A collection of environments for autonomous driving and tactical decision-making tasks An episode of one of the environments available in

Edouard Leurent 1.6k Jan 07, 2023
Lolviz - A simple Python data-structure visualization tool for lists of lists, lists, dictionaries; primarily for use in Jupyter notebooks / presentations

lolviz By Terence Parr. See Explained.ai for more stuff. A very nice looking javascript lolviz port with improvements by Adnan M.Sagar. A simple Pytho

Terence Parr 785 Dec 30, 2022
Temporal Segment Networks (TSN) in PyTorch

TSN-Pytorch We have released MMAction, a full-fledged action understanding toolbox based on PyTorch. It includes implementation for TSN as well as oth

1k Jan 03, 2023
Official Implementation and Dataset of "PPR10K: A Large-Scale Portrait Photo Retouching Dataset with Human-Region Mask and Group-Level Consistency", CVPR 2021

Portrait Photo Retouching with PPR10K Paper | Supplementary Material PPR10K: A Large-Scale Portrait Photo Retouching Dataset with Human-Region Mask an

184 Dec 11, 2022
DexterRedTool - Dexter's Red Team Tool that creates cronjob/task scheduler to consistently creates users

DexterRedTool Author: Dexter Delandro CSEC 473 - Spring 2022 This tool persisten

2 Feb 16, 2022
Source code for Fixed-Point GAN for Cloud Detection

FCD: Fixed-Point GAN for Cloud Detection PyTorch source code of Nyborg & Assent (2020). Abstract The detection of clouds in satellite images is an ess

Joachim Nyborg 8 Dec 22, 2022
A colab notebook for training Stylegan2-ada on colab, transfer learning onto your own dataset.

Stylegan2-Ada-Google-Colab-Starter-Notebook A no thrills colab notebook for training Stylegan2-ada on colab. transfer learning onto your own dataset h

Harnick Khera 66 Dec 16, 2022
GuideDog is an AI/ML-based mobile app designed to assist the lives of the visually impaired, 100% voice-controlled

Guidedog Authors: Kyuhee Jo, Steven Gunarso, Jacky Wang, Raghav Sharma GuideDog is an AI/ML-based mobile app designed to assist the lives of the visua

Kyuhee Jo 5 Nov 24, 2021
This repository accompanies the ACM TOIS paper "What can I cook with these ingredients?" - Understanding cooking-related information needs in conversational search

In this repository you find data that has been gathered when conducting in-situ experiments in a conversational cooking setting. These data include tr

6 Sep 22, 2022
Official Pytorch implementation of "Beyond Static Features for Temporally Consistent 3D Human Pose and Shape from a Video", CVPR 2021

TCMR: Beyond Static Features for Temporally Consistent 3D Human Pose and Shape from a Video Qualtitative result Paper teaser video Introduction This r

Hongsuk Choi 215 Jan 06, 2023
Fast and Simple Neural Vocoder, the Multiband RNNMS

Multiband RNN_MS Fast and Simple vocoder, Multiband RNN_MS. Demo Quick training How to Use System Details Results References Demo ToDO: Link super gre

tarepan 5 Jan 11, 2022
Code for the Interspeech 2021 paper "AST: Audio Spectrogram Transformer".

AST: Audio Spectrogram Transformer Introduction Citing Getting Started ESC-50 Recipe Speechcommands Recipe AudioSet Recipe Pretrained Models Contact I

Yuan Gong 603 Jan 07, 2023
MiniHack the Planet: A Sandbox for Open-Ended Reinforcement Learning Research

MiniHack the Planet: A Sandbox for Open-Ended Reinforcement Learning Research

Facebook Research 338 Dec 29, 2022
Source codes for "Structure-Aware Abstractive Conversation Summarization via Discourse and Action Graphs"

Structure-Aware-BART This repo contains codes for the following paper: Jiaao Chen, Diyi Yang:Structure-Aware Abstractive Conversation Summarization vi

GT-SALT 56 Dec 08, 2022
CLIP (Contrastive Language–Image Pre-training) for Italian

Italian CLIP CLIP (Radford et al., 2021) is a multimodal model that can learn to represent images and text jointly in the same space. In this project,

Italian CLIP 114 Dec 29, 2022
Sharpened cosine similarity torch - A Sharpened Cosine Similarity layer for PyTorch

Sharpened Cosine Similarity A layer implementation for PyTorch Install At your c

Brandon Rohrer 203 Nov 30, 2022
High-fidelity 3D Model Compression based on Key Spheres

High-fidelity 3D Model Compression based on Key Spheres This repository contains the implementation of the paper: Yuanzhan Li, Yuqi Liu, Yujie Lu, Siy

5 Oct 11, 2022
Official code for 'Pixel-wise Energy-biased Abstention Learning for Anomaly Segmentationon Complex Urban Driving Scenes'

PEBAL This repo contains the Pytorch implementation of our paper: Pixel-wise Energy-biased Abstention Learning for Anomaly Segmentationon Complex Urba

Yu Tian 115 Dec 29, 2022