DEMix Layers for Modular Language Modeling

Related tags

Deep Learningdemix
Overview

DEMix

This repository contains modeling utilities for "DEMix Layers: Disentangling Domains for Modular Language Modeling" (Gururangan et. al, 2021).

This code is a fork of Fairseq. It is based on Python 3.8, CUDA 11 and includes PyTorch 1.8.0, NCCL 2.8.4 and apex.

Dataset

The multidomain dataset scripts are housed in another repository, located here. Clone that repository and follow instructions to setup data to train on.

Follow that tutorial to generate data-bins on eight (small) example domains.

Make sure to set the DATA_DIR accordingly.

Fairseq Installation

If you've already made an environment from the dataset creation phase, just use that. Otherwise:

conda create env --name demix
cd demix/
pip install --editable .

Additionally, please make sure you have the dependencies above installed (check Fairseq documentation for more information).

Tutorial

Here we will follow a tutorial to train on the example domains from the tutorial in the DEMix-data repository. Note that the model that results from this tutorial is pretty bad, because we're working with very small amounts of data and also a small LM. This tutorial is there to help you quickly understand the pipeline, and ensure that each script completes successfully.

To replicate the DEMix paper, with a GPT-3 model, follow the instructions here.

Basic Training

After setting up the example domains, run the following to train a small language model. Note that the scripts in this paper assume you are running on a multi-node GPU cluster with SLURM.

First, allocate some nodes, with GPUs with at least 32GB of RAM. Here we allocate 1 node with 8 volta32GB GPUs.

salloc --gpus-per-node 8 --nodes 1  -C 'volta32gb' --ntasks-per-node 8 --cpus-per-task 10 --mem 400G --time XXX --partition YYY

Then run:

export NUM_GPUS=8
export DISTRIBUTED_PORT=12345
export MODEL=transformer_lm
export EXPERIMENT=demix
# $DATA_DIR was set in DEMix-data tutorial.
export DATA_BIN=${DATA_DIR}/data-bin/
export EXPERIMENT_SUFFIX=tutorial
export SERIALIZATION_DIR=$(pwd)/demix_tutorial_model
bash tutorial/train.sh $NUM_GPUS \
                    $DISTRIBUTED_PORT \
                    $MODEL \
                    $EXPERIMENT \
                    $DATA_BIN \
                    $SERIALIZATION_DIR \
                    $EXPERIMENT_SUFFIX

This will output a trained language model in ${SERIALIZATION_DIR}

To train balanced dense LM, set export EXPERIMENT=dense, to train unbalanced dense LM, set export EXPERIMENT=unbalanced, to train "+Domain Token" LM , set export EXPERIMENT=domain_token.

We have provided a simple script demix/train.sh, with the same interface, with all hyperparameter preset to help replicate results in the paper.

Evaluation

We have two ways to evaluate the demix language model: with and without mixing experts.

Evaluating without mixing experts

To evaluate the language model without mixing experts, you can supply the checkpoint from a GPU on a particular rank (to specify the use of the domain expert that was trained on that GPU):

export DATA_BIN=${DATA_DIR}/data-bin/
export GPU_RANK=0
export PATH_TO_CHECKPOINT=${SERIALIZATION_DIR}/checkpoint_last-rank-${GPU_RANK}.pt
export OUTPUT_PATH=eval_output.jsonl
export SPLIT=valid
export DOMAIN=imdb
bash tutorial/eval_lm.sh $DATA_BIN $PATH_TO_CHECKPOINT $OUTPUT_PATH $SPLIT $DOMAIN

To evaluate on test data, set export SPLIT=test

The same script is used for the other baselines.

For the +domain token model, you can additionally supply a domain token to use at test time:

export DOMAIN_TOKEN=XXX
bash tutorial/eval_lm.sh $DATA_BIN $PATH_TO_CHECKPOINT $OUTPUT_PATH $SPLIT $DOMAIN $DOMAIN_TOKEN

Evaluating with mixing experts

First, we estimate the posterior distribution on 100 sequences of validation data of the domain using the following command:

export DATA_BIN=${DATA_DIR}/data-bin
export DOMAIN=imdb
export DEV_POSTERIOR_OUTPUT=dev_posteriors.jsonl
# set NUM_EVALUATION_GPUS equal to the number of experts you'd like to ensemble.
export NUM_EVALUATION_GPUS=8;
bash tutorial/mix_eval_lm.sh $NUM_EVALUATION_GPUS $DATA_BIN  ${SERIALIZATION_DIR}/checkpoint_last-rank-0.pt:${SERIALIZATION_DIR}/checkpoint_last-rank-1.pt:${SERIALIZATION_DIR}/checkpoint_last-rank-2.pt:${SERIALIZATION_DIR}/checkpoint_last-rank-3.pt:${SERIALIZATION_DIR}/checkpoint_last-rank-4.pt:${SERIALIZATION_DIR}/checkpoint_last-rank-5.pt:${SERIALIZATION_DIR}/checkpoint_last-rank-6.pt:${SERIALIZATION_DIR}/checkpoint_last-rank-7.pt $DOMAIN $DEV_POSTERIOR_OUTPUT estimate;

Then, we open $POSTERIOR_OUTPUT, extracting the exp_avg_posterior value of the last line in that file:

export POSTERIOR=$(tail -n 1 $DEV_POSTERIOR_OUTPUT | jq -rc '.exp_avg_posterior | join(",")')

We use this posterior as the domain prior (supplied as a string) when evaluating on test data, like so:

bash tutorial/mix_eval_lm.sh $NUM_EVALUATION_GPUS $DATA_BIN  ${SERIALIZATION_DIR}/checkpoint_last-rank-0.pt:${SERIALIZATION_DIR}/checkpoint_last-rank-1.pt:${SERIALIZATION_DIR}/checkpoint_last-rank-2.pt:${SERIALIZATION_DIR}/checkpoint_last-rank-3.pt:${SERIALIZATION_DIR}/checkpoint_last-rank-4.pt:${SERIALIZATION_DIR}/checkpoint_last-rank-5.pt:${SERIALIZATION_DIR}/checkpoint_last-rank-6.pt:${SERIALIZATION_DIR}/checkpoint_last-rank-7.pt $DOMAIN $DEV_POSTERIOR_OUTPUT eval $POSTERIOR cached_prior;

Adapting the Language Model

We additionally provide scripts to adapt the language model to a new domain.

DEMix DAPT

In this tutorial, we just adapt one of the existing experts to a new example domain in the demix-data project, located in /path/to/demix-data/new_example_domains.

First, we need to figure out which domain expert has the most affinity to the target domain we want to adapt to:

export NEW_DATA_BIN=/private/home/suching/demix-data/new_example_domains/data-bin/
export NEW_DOMAIN=acl_papers
export DEV_POSTERIOR_OUTPUT=${NEW_DOMAIN}_posterior.jsonl
# set NUM_EVALUATION_GPUS equal to the number of experts you'd like to ensemble.
export NUM_EVALUATION_GPUS=8;
bash tutorial/mix_eval_lm.sh $NUM_EVALUATION_GPUS $NEW_DATA_BIN  ${SERIALIZATION_DIR}/checkpoint_last-rank-0.pt:${SERIALIZATION_DIR}/checkpoint_last-rank-1.pt:${SERIALIZATION_DIR}/checkpoint_last-rank-2.pt:${SERIALIZATION_DIR}/checkpoint_last-rank-3.pt:${SERIALIZATION_DIR}/checkpoint_last-rank-4.pt:${SERIALIZATION_DIR}/checkpoint_last-rank-5.pt:${SERIALIZATION_DIR}/checkpoint_last-rank-6.pt:${SERIALIZATION_DIR}/checkpoint_last-rank-7.pt $NEW_DOMAIN $DEV_POSTERIOR_OUTPUT estimate;
export POSTERIOR=$(tail -n 1 $DEV_POSTERIOR_OUTPUT | jq -rc '.exp_avg_posterior | join(",")')

Here, we find that the most likely expert is expert number 5.

export POSTERIOR=$(tail -n 1 $DEV_POSTERIOR_OUTPUT | jq -rc '.exp_avg_posterior | join(",")')
echo $POSTERIOR

We then adapt expert 5 to the target domain using the tutorial/dapt.sh script, using DEMix DAPT:

export PATH_TO_CHECKPOINT=${SERIALIZATION_DIR}/checkpoint_last-rank-5.pt
export UNFREEZE_PARAMETERS=feedforward
export NEW_SERIALIZATION_DIR=$(pwd)/${NEW_DOMAIN}_demix_dapt
export EXPERIMENT_SUFFIX=test
bash tutorial/dapt.sh $NEW_DATA_BIN $NEW_DOMAIN $PATH_TO_CHECKPOINT $UNFREEZE_PARAMETERS $NEW_SERIALIZATION_DIR $EXPERIMENT_SUFFIX

Once this is trained, you can add that expert to your ensemble when evaluating on new data:

export NEW_DATA_BIN=/path/to/demix-data/new_example_domains/data-bin/
export NEW_DOMAIN=acl_papers
export DEV_POSTERIOR_OUTPUT=${NEW_DOMAIN}_posterior.jsonl
# set NUM_EVALUATION_GPUS equal to the number of experts you'd like to ensemble.
export NUM_EVALUATION_GPUS=8;
export PATH_TO_NEW_EXPERT=${NEW_SERIALIZATION_DIR}/checkpoint_last-rank-0.pt
bash tutorial/mix_eval_lm.sh $NUM_EVALUATION_GPUS $NEW_DATA_BIN  ${SERIALIZATION_DIR}/checkpoint_last-rank-0.pt:${SERIALIZATION_DIR}/checkpoint_last-rank-1.pt:${SERIALIZATION_DIR}/checkpoint_last-rank-2.pt:${SERIALIZATION_DIR}/checkpoint_last-rank-3.pt:${SERIALIZATION_DIR}/checkpoint_last-rank-4.pt:${SERIALIZATION_DIR}/checkpoint_last-rank-5.pt:${SERIALIZATION_DIR}/checkpoint_last-rank-6.pt:${PATH_TO_NEW_EXPERT} $NEW_DOMAIN $DEV_POSTERIOR_OUTPUT estimate;
export POSTERIOR=$(tail -n 1 $DEV_POSTERIOR_OUTPUT | jq -rc '.exp_avg_posterior | join(",")')

Dense DAPT

If you wanted to do Dense DAPT instead, just change the environment variables:

export PATH_TO_CHECKPOINT=/path/to/dense/model/checkpoint_last.pt
export FEEDFORWARD_OR_FULL=full
export SERIALIZATION_DIR=$(pwd)/${NEW_DOMAIN}_dense_dapt
export EXPERIMENT_SUFFIX=test
bash tutorial/dapt.sh $NEW_DATA_BIN $NEW_DOMAIN $PATH_TO_CHECKPOINT $FEEDFORWARD_OR_FULL $SERIALIZATION_DIR $EXPERIMENT_SUFFIX
Owner
Suchin
Allen Institute for AI / Facebook AI
Suchin
Implementation of QuickDraw - an online game developed by Google, combined with AirGesture - a simple gesture recognition application

QuickDraw - AirGesture Introduction Here is my python source code for QuickDraw - an online game developed by google, combined with AirGesture - a sim

Viet Nguyen 89 Dec 18, 2022
PyTorch implemention of ICCV'21 paper SGPA: Structure-Guided Prior Adaptation for Category-Level 6D Object Pose Estimation

SGPA: Structure-Guided Prior Adaptation for Category-Level 6D Object Pose Estimation This is the PyTorch implemention of ICCV'21 paper SGPA: Structure

Chen Kai 24 Dec 05, 2022
Code for the IJCAI 2021 paper "Structure Guided Lane Detection"

SGNet Project for the IJCAI 2021 paper "Structure Guided Lane Detection" Abstract Recently, lane detection has made great progress with the rapid deve

Jinming Su 27 Dec 08, 2022
Implementation of "A MLP-like Architecture for Dense Prediction"

A MLP-like Architecture for Dense Prediction (arXiv) Updates (22/07/2021) Initial release. Model Zoo We provide CycleMLP models pretrained on ImageNet

Shoufa Chen 244 Dec 27, 2022
Joint parameterization and fitting of stroke clusters

StrokeStrip: Joint Parameterization and Fitting of Stroke Clusters Dave Pagurek van Mossel1, Chenxi Liu1, Nicholas Vining1,2, Mikhail Bessmeltsev3, Al

Dave Pagurek 44 Dec 01, 2022
Randstad Artificial Intelligence Challenge (powered by VGEN). Soluzione proposta da Stefano Fiorucci (anakin87) - primo classificato

Randstad Artificial Intelligence Challenge (powered by VGEN) Soluzione proposta da Stefano Fiorucci (anakin87) - primo classificato Struttura director

Stefano Fiorucci 1 Nov 13, 2021
Pytorch implementation of forward and inverse Haar Wavelets 2D

Pytorch implementation of forward and inverse Haar Wavelets 2D

Sergei Belousov 9 Oct 30, 2022
Lazy, a tool for running things in idle time

Lazy, a tool for running things in idle time Mostly used to stop CUDA ML model training from making my desktop unusable. Simply monitors keyboard/mous

N Shepperd 46 Nov 06, 2022
Semi-supervised Representation Learning for Remote Sensing Image Classification Based on Generative Adversarial Networks

SSRL-for-image-classification Semi-supervised Representation Learning for Remote Sensing Image Classification Based on Generative Adversarial Networks

Feng 2 Nov 19, 2021
Quick program made to generate alpha and delta tables for Hidden Markov Models

HMM_Calc Functions for generating Alpha and Delta tables from a Hidden Markov Model. Parameters: a: Matrix of transition probabilities. a[i][j] = a_{i

Adem Odza 1 Dec 04, 2021
Pytorch Implementation of Residual Vision Transformers(ResViT)

ResViT Official Pytorch Implementation of Residual Vision Transformers(ResViT) which is described in the following paper: Onat Dalmaz and Mahmut Yurt

ICON Lab 41 Dec 08, 2022
PyTorch implementation of our ICCV 2021 paper, Interpretation of Emergent Communication in Heterogeneous Collaborative Embodied Agents.

PyTorch implementation of our ICCV 2021 paper, Interpretation of Emergent Communication in Heterogeneous Collaborative Embodied Agents.

Saim Wani 4 May 08, 2022
一个多语言支持、易使用的 OCR 项目。An easy-to-use OCR project with multilingual support.

AgentOCR 简介 AgentOCR 是一个基于 PaddleOCR 和 ONNXRuntime 项目开发的一个使用简单、调用方便的 OCR 项目 本项目目前包含 Python Package 【AgentOCR】 和 OCR 标注软件 【AgentOCRLabeling】 使用指南 Pytho

AgentMaker 98 Nov 10, 2022
⚖️🔁🔮🕵️‍♂️🦹🖼️ Code for *Measuring the Contribution of Multiple Model Representations in Detecting Adversarial Instances* paper.

Measuring the Contribution of Multiple Model Representations in Detecting Adversarial Instances This repository contains the code for Measuring the Co

Daniel Steinberg 0 Nov 06, 2022
Simple sinc interpolation in PyTorch.

Kazane: simple sinc interpolation for 1D signal in PyTorch Kazane utilize FFT based convolution to provide fast sinc interpolation for 1D signal when

Chin-Yun Yu 10 May 03, 2022
MASA-SR: Matching Acceleration and Spatial Adaptation for Reference-Based Image Super-Resolution (CVPR2021)

MASA-SR Official PyTorch implementation of our CVPR2021 paper MASA-SR: Matching Acceleration and Spatial Adaptation for Reference-Based Image Super-Re

DV Lab 126 Dec 20, 2022
The story of Chicken for Club Bing

Chicken Story tl;dr: The time when Microsoft banned my entire country for cheating at Club Bing. (A lot of the details are from memory so I've recreat

Eyal 142 May 16, 2022
StyleGAN2-ADA - Official PyTorch implementation

Need Help? If you’re new to StyleGAN2-ADA and looking to get started, please check out this video series from a course Lia Coleman and I taught in Oct

Derrick Schultz 217 Jan 04, 2023
Code repository for our paper regarding the L3D dataset.

The Large Labelled Logo Dataset (L3D): A Multipurpose and Hand-Labelled Continuously Growing Dataset Website: https://lhf-labs.github.io/tm-dataset Da

LHF Labs 9 Dec 14, 2022
《Where am I looking at? Joint Location and Orientation Estimation by Cross-View Matching》(CVPR 2020)

This contains the codes for cross-view geo-localization method described in: Where am I looking at? Joint Location and Orientation Estimation by Cross-View Matching, CVPR2020.

41 Oct 27, 2022