Code for the paper "Are Sixteen Heads Really Better than One?"

Last update: Dec 14, 2022

Overview

Are Sixteen Heads Really Better than One?

This repository contains code to reproduce the experiments in our paper Are Sixteen Heads Really Better than One?.

Prerequisite

First, you will need python >=3.6 with pytorch>=1.0. Then, clone our forks of fairseq (for MT experiments) and pytorch-pretrained-BERT (for BERT):

# Fairseq
git clone https://github.com/pmichel31415/fairseq
# Pytorch pretrained BERT
git clone https://github.com/pmichel31415/pytorch-pretrained-BERT
cd pytorch-pretrained-BERT
git checkout paul
cd ..

If you are running into issues with pytorch-pretrained-BERT (because you have another version installed globally for instance), check out this work around (thanks @insop).

You will also need sacrebleu to evaluate BLEU score (pip install sacrebleu).

Ablation experiments

BERT

Running

bash experiments/BERT/heads_ablation.sh MNLI

Will fine-tune a pretrained BERT on MNLI (stored in ./models/MNLI) and perform the individual head ablation experiment from Section 3.1 in the paper alternatively you can run the experiment with CoLA, MRCP or SST-2 as a task in place of MNLI.

MT

You can obtain the pretrained WMT model from ~~this link from the fairseq repo~~ now this link. Use the Moses tokenizer and subword-nmt in conjunction to the BPE codes provided with the pretrained model to prepair any input file you want. Then run:

bash experiments/MT/wmt_ablation.sh $BPE_SEGMENTED_SRC_FILE $DETOKENIZED_REF_FILE

Systematic Pruning Experiments

BERT

To iteratively prune 10% heads in order of increasing importance run

bash experiments/BERT/heads_pruning.sh MNLI --normalize_pruning_by_layer

This will reuse the BERT model fine-tuned if you have run the ablation experiment before (otherwise it'll just fine-tune it for you). The output of this is very verbose, but you can get the gist of the result by calling grep "strategy\|results" -A1 on the output.

WMT

Similarly, just run:

bash experiments/MT/prune_wmt.sh $BPE_SEGMENTED_SRC_FILE $DETOKENIZED_REF_FILE

You might want to change the paths in the experiment files to point to the binarized fairseq dataset on whic you want to estimate importance scores.

Code for the paper "Are Sixteen Heads Really Better than One?"

Related tags

Overview

Are Sixteen Heads Really Better than One?

Prerequisite

Ablation experiments

BERT

MT

Systematic Pruning Experiments

BERT

WMT

Owner

Paul Michel

中文生成式预训练模型

Hierarchical unsupervised and semi-supervised topic models for sparse count data with CorEx

Application for shadowing Chinese.

Japanese Long-Unit-Word Tokenizer with RemBertTokenizerFast of Transformers

A PyTorch implementation of paper "Learning Shared Semantic Space for Speech-to-Text Translation", ACL (Findings) 2021

DomainWordsDict, Chinese words dict that contains more than 68 domains, which can be used as text classification、knowledge enhance task

A Plover python dictionary allowing for consistent symbol input with specification of attachment and capitalisation in one stroke.

A Word Level Transformer layer based on PyTorch and 🤗 Transformers.

中文問句產生器；使用台達電閱讀理解資料集(DRCD)

SimpleChinese2 集成了许多基本的中文NLP功能，使基于 Python 的中文文字处理和信息提取变得简单方便。

An example project using OpenPrompt under pytorch-lightning for prompt-based SST2 sentiment analysis model

Code for lyric-section-to-comment generation based on huggingface transformers.

A collection of Classical Chinese natural language processing models, including Classical Chinese related models and resources on the Internet.

Source code for the paper "TearingNet: Point Cloud Autoencoder to Learn Topology-Friendly Representations"

Code associated with the Don't Stop Pretraining ACL 2020 paper

MASS: Masked Sequence to Sequence Pre-training for Language Generation

本项目是作者们根据个人面试和经验总结出的自然语言处理(NLP)面试准备的学习笔记与资料，该资料目前包含自然语言处理各领域的面试题积累。

Knowledge Oriented Programming Language

Black for Python docstrings and reStructuredText (rst).

spaCy plugin for Transformers , Udify, ELmo, etc.

Code for the paper "Are Sixteen Heads Really Better than One?"

Related tags

Overview

Are Sixteen Heads Really Better than One?

Prerequisite

Ablation experiments

BERT

MT

Systematic Pruning Experiments

BERT

WMT

Owner

Paul Michel

中文生成式预训练模型

Hierarchical unsupervised and semi-supervised topic models for sparse count data with CorEx

Application for shadowing Chinese.

Japanese Long-Unit-Word Tokenizer with RemBertTokenizerFast of Transformers

A PyTorch implementation of paper "Learning Shared Semantic Space for Speech-to-Text Translation", ACL (Findings) 2021

DomainWordsDict, Chinese words dict that contains more than 68 domains, which can be used as text classification、knowledge enhance task

A Plover python dictionary allowing for consistent symbol input with specification of attachment and capitalisation in one stroke.

A Word Level Transformer layer based on PyTorch and 🤗 Transformers.

中文問句產生器；使用台達電閱讀理解資料集(DRCD)

SimpleChinese2 集成了许多基本的中文NLP功能，使基于 Python 的中文文字处理和信息提取变得简单方便。

An example project using OpenPrompt under pytorch-lightning for prompt-based SST2 sentiment analysis model

Code for lyric-section-to-comment generation based on huggingface transformers.

A collection of Classical Chinese natural language processing models, including Classical Chinese related models and resources on the Internet.

Source code for the paper "TearingNet: Point Cloud Autoencoder to Learn Topology-Friendly Representations"

Code associated with the Don't Stop Pretraining ACL 2020 paper

MASS: Masked Sequence to Sequence Pre-training for Language Generation

本项目是作者们根据个人面试和经验总结出的自然语言处理(NLP)面试准备的学习笔记与资料，该资料目前包含 自然语言处理各领域的 面试题积累。

Knowledge Oriented Programming Language

Black for Python docstrings and reStructuredText (rst).

spaCy plugin for Transformers , Udify, ELmo, etc.

本项目是作者们根据个人面试和经验总结出的自然语言处理(NLP)面试准备的学习笔记与资料，该资料目前包含自然语言处理各领域的面试题积累。