Understanding and Overcoming the Challenges of Efficient Transformer Quantization

Overview

Transformer Quantization

This repository contains the implementation and experiments for the paper presented in

Yelysei Bondarenko1, Markus Nagel1, Tijmen Blankevoort1, "Understanding and Overcoming the Challenges of Efficient Transformer Quantization", EMNLP 2021. [ACL Anthology] [ArXiv]

1 Qualcomm AI Research (Qualcomm AI Research is an initiative of Qualcomm Technologies, Inc.)

Reference

If you find our work useful, please cite

@inproceedings{bondarenko-etal-2021-understanding,
    title = "Understanding and Overcoming the Challenges of Efficient Transformer Quantization",
    author = "Bondarenko, Yelysei  and
      Nagel, Markus  and
      Blankevoort, Tijmen",
    booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2021",
    address = "Online and Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.emnlp-main.627",
    pages = "7947--7969",
    abstract = "Transformer-based architectures have become the de-facto standard models for a wide range of Natural Language Processing tasks. However, their memory footprint and high latency are prohibitive for efficient deployment and inference on resource-limited devices. In this work, we explore quantization for transformers. We show that transformers have unique quantization challenges {--} namely, high dynamic activation ranges that are difficult to represent with a low bit fixed-point format. We establish that these activations contain structured outliers in the residual connections that encourage specific attention patterns, such as attending to the special separator token. To combat these challenges, we present three solutions based on post-training quantization and quantization-aware training, each with a different set of compromises for accuracy, model size, and ease of use. In particular, we introduce a novel quantization scheme {--} per-embedding-group quantization. We demonstrate the effectiveness of our methods on the GLUE benchmark using BERT, establishing state-of-the-art results for post-training quantization. Finally, we show that transformer weights and embeddings can be quantized to ultra-low bit-widths, leading to significant memory savings with a minimum accuracy loss. Our source code is available at \url{https://github.com/qualcomm-ai-research/transformer-quantization}.",
}

How to install

First, ensure locale variables are set as follows:

export LC_ALL=C.UTF-8
export LANG=C.UTF-8

Second, make sure to have Python ≥3.6 (tested with Python 3.6.8) and ensure the latest version of pip (tested with 21.2.4):

pip install --upgrade --no-deps pip

Next, install PyTorch 1.4.0 with the appropriate CUDA version (tested with CUDA 10.0, CuDNN 7.6.3):

pip install torch==1.4.0 torchvision==0.5.0 -f https://download.pytorch.org/whl/torch_stable.html

Finally, install the remaining dependencies using pip:

pip install -r requirements.txt

To run the code, the project root directory needs to be added to your pythonpath:

export PYTHONPATH="${PYTHONPATH}:/path/to/this/dir"

Running experiments

The main run file to reproduce all experiments is main.py. It contains 4 commands to train and validate FP32 and quantized model:

Usage: main.py [OPTIONS] COMMAND [ARGS]...

Options:
  --help  Show this message and exit.

Commands:
  train-baseline
  train-quantized
  validate-baseline
  validate-quantized

You can see the full list of options for each command using python main.py [COMMAND] --help.

A. FP32 fine-tuning

To start with, you need to get the fune-tuned model(s) for the GLUE task of interest. Example run command for fine-tuning:

python main.py train-baseline --cuda --save-model --model-name bert_base_uncased --task rte \
    --learning-rate 3e-05 --batch-size 8 --eval-batch-size 8 --num-epochs 3 --max-seq-length 128 \
    --seed 1000 --output-dir /path/to/output/dir/

You can also do it directly using HuggingFace library [examples]. In all experiments we used seeds 1000 - 1004 and reported the median score. The sample output directory looks as follows:

/path/to/output/dir
├── config.out
├── eval_results_rte.txt
├── final_score.txt
├── out
│   ├── config.json  # Huggingface model config
│   ├── pytorch_model.bin  # PyTorch model checkpoint
│   ├── special_tokens_map.json
│   ├── tokenizer_config.json  # Huggingface tokenizer config
│   ├── training_args.bin
│   └── vocab.txt  # Vocabulary
└── tb_logs  # TensorBoard logs
    ├── 1632747625.1250594
    │   └── events.out.tfevents.*
    └── events.out.tfevents.*

For validation (both full-precision and quantized), it is assumed that these output directories with the fine-tuned checkpoints are aranged as follows (you can also use a subset of GLUE tasks):

/path/to/saved_models/
├── rte/rte_model_dir
│   ├── out
│   │   ├── config.json  # Huggingface model config
│   │   ├── pytorch_model.bin  # PyTorch model checkpoint
│   │   ├── tokenizer_config.json  # Huggingface tokenizer config
│   │   ├── vocab.txt  # Vocabulary
│   │   ├── (...)
├── cola/cola_model_dir
│   ├── out
│   │   ├── (...)
├── mnli/mnli_model_dir
│   ├── out
│   │   ├── (...)
├── mrpc/mrpc_model_dir
│   ├── out
│   │   ├── (...)
├── qnli/qnli_model_dir
│   ├── out
│   │   ├── (...)
├── qqp/qqp_model_dir
│   ├── out
│   │   ├── (...)
├── sst2/sst2_model_dir
│   ├── out
│   │   ├── (...)
└── stsb/stsb_model_dir
    ├── out
    │   ├── (...)

Note, that you have to create this file structure manually.

The model can then be validated as follows:

python main.py validate-baseline --eval-batch-size 32 --seed 1000 --model-name bert_base_uncased \
    --model-path /path/to/saved_models/ --task rte

You can also validate multiple or all checkpoints by specifying --task --task [...] or --task all, respectively.

B. Post-training quantization (PTQ)

1) Standard (naïve) W8A8 per-tensor PTQ / base run command for all PTQ experiments

python main.py validate-quantized --act-quant --weight-quant --no-pad-to-max-length \
	--est-ranges-no-pad --eval-batch-size 16 --seed 1000 --model-path /path/to/saved_models/ \
	--task rte --n-bits 8 --n-bits-act 8 --qmethod symmetric_uniform \
	--qmethod-act asymmetric_uniform --weight-quant-method MSE --weight-opt-method golden_section \
	--act-quant-method current_minmax --est-ranges-batch-size 1 --num-est-batches 1 \
	--quant-setup all

Note that the range estimation settings are slightly different for each task.

2) Mixed precision W8A{8,16} PTQ

Specify --quant-dict "{'y': 16, 'h': 16, 'x': 16}":

  • 'x': 16 will set FFN's input to 16-bit
  • 'h': 16 will set FFN's output to 16-bit
  • 'y': 16 will set FFN's residual sum to 16-bit

For STS-B regression task, you will need to specify --quant-dict "{'y': 16, 'h': 16, 'x': 16, 'P': 16, 'C': 16}" and --quant-setup MSE_logits, which will also quantize pooler and the final classifier to 16-bit and use MSE estimator for the output.

3) Per-embedding and per-embedding-group (PEG) activation quantization

  • --per-embd -- Per-embedding quantization for all activations
  • --per-groups [N_GROUPS] -- PEG quantization for all activations, no permutation
  • --per-groups [N_GROUPS] --per-groups-permute -- PEG quantization for all activations, apply range-based permutation (separate for each quantizer)
  • --quant-dict "{'y': 'ng6', 'h': 'ng6', 'x': 'ng6'}" -- PEG quantization using 6 groups for FFN's input, output and residual sum, no permutation
  • --quant-dict "{'y': 'ngp6', 'h': 'ngp6', 'x': 'ngp6'}" --per-groups-permute-shared-h -- PEG quantization using 6 groups for FFN's input, output and residual sum, apply range-based permutation (shared between tensors in the same layer)

4) W4A32 PTQ with AdaRound

python main.py validate-quantized --weight-quant --no-act-quant --no-pad-to-max-length \
	--est-ranges-no-pad --eval-batch-size 16 --seed 1000 --model-path /path/to/saved_models/ \
	--task rte --qmethod symmetric_uniform --qmethod-act asymmetric_uniform --n-bits 4 \
	--weight-quant-method MSE --weight-opt-method grid --num-candidates 100 --quant-setup all \
	--adaround all --adaround-num-samples 1024 --adaround-init range_estimator \
	--adaround-mode learned_hard_sigmoid --adaround-asym --adaround-iters 10000 \
	--adaround-act-quant no_act_quant

C. Quantization-aware training (QAT)

Base run command for QAT experiments (using W4A8 for example):

python main.py train-quantized --cuda --do-eval --logging-first-step --weight-quant --act-quant \
	--pad-to-max-length --learn-ranges --tqdm --batch-size 8 --seed 1000 \
	--model-name bert_base_uncased --learning-rate 5e-05 --num-epochs 6 --warmup-steps 186 \
	--weight-decay 0.0 --attn-dropout 0.0 --hidden-dropout 0.0 --max-seq-length 128 --n-bits 4 \
	--n-bits-act 8 --qmethod symmetric_uniform --qmethod-act asymmetric_uniform \
	--weight-quant-method MSE --weight-opt-method golden_section --act-quant-method current_minmax \
	--est-ranges-batch-size 16 --num-est-batches 1 --quant-setup all \
	--model-path /path/to/saved_models/rte/out --task rte --output-dir /path/to/qat_output/dir

Note that the settings are slightly different for each task (see Appendix).

To run mixed-precision QAT with 2-bit embeddings and 4-bit weights, add --quant-dict "{'Et': 2}".

Owner
An initiative of Qualcomm Technologies, Inc.
code for "Feature Importance-aware Transferable Adversarial Attacks"

Feature Importance-aware Attack(FIA) This repository contains the code for the paper: Feature Importance-aware Transferable Adversarial Attacks (ICCV

Hengchang Guo 44 Nov 24, 2022
DeFMO: Deblurring and Shape Recovery of Fast Moving Objects (CVPR 2021)

Evaluation, Training, Demo, and Inference of DeFMO DeFMO: Deblurring and Shape Recovery of Fast Moving Objects (CVPR 2021) Denys Rozumnyi, Martin R. O

Denys Rozumnyi 139 Dec 26, 2022
3D dataset of humans Manipulating Objects in-the-Wild (MOW)

MOW dataset [Website] This repository maintains our 3D dataset of humans Manipulating Objects in-the-Wild (MOW). The dataset contains 512 images in th

Zhe Cao 28 Nov 06, 2022
A 3D sparse LBM solver implemented using Taichi

taichi_LBM3D Background Taichi_LBM3D is a 3D lattice Boltzmann solver with Multi-Relaxation-Time collision scheme and sparse storage structure impleme

Jianhui Yang 121 Jan 06, 2023
Pervasive Attention: 2D Convolutional Networks for Sequence-to-Sequence Prediction

This is a fork of Fairseq(-py) with implementations of the following models: Pervasive Attention - 2D Convolutional Neural Networks for Sequence-to-Se

Maha 490 Dec 15, 2022
Fbone (Flask bone) is a Flask (Python microframework) starter/template/bootstrap/boilerplate application.

Fbone (Flask bone) is a Flask (Python microframework) starter/template/bootstrap/boilerplate application.

Wilson 1.7k Dec 30, 2022
LSTC: Boosting Atomic Action Detection with Long-Short-Term Context

LSTC: Boosting Atomic Action Detection with Long-Short-Term Context This Repository contains the code on AVA of our ACM MM 2021 paper: LSTC: Boosting

Tencent YouTu Research 9 Oct 11, 2022
PyTorch Language Model for 1-Billion Word (LM1B / GBW) Dataset

PyTorch Large-Scale Language Model A Large-Scale PyTorch Language Model trained on the 1-Billion Word (LM1B) / (GBW) dataset Latest Results 39.98 Perp

Ryan Spring 114 Nov 04, 2022
CTC segmentation python package

CTC segmentation CTC segmentation can be used to find utterances alignments within large audio files. This repository contains the ctc-segmentation py

Ludwig Kürzinger 217 Jan 04, 2023
Supplementary code for the AISTATS 2021 paper "Matern Gaussian Processes on Graphs".

Matern Gaussian Processes on Graphs This repo provides an extension for gpflow with Matérn kernels, inducing variables and trainable models implemente

41 Dec 17, 2022
Merlion: A Machine Learning Framework for Time Series Intelligence

Merlion: A Machine Learning Library for Time Series Table of Contents Introduction Installation Documentation Getting Started Anomaly Detection Foreca

Salesforce 2.8k Dec 30, 2022
This was initially the repo for the project of [email protected] of Asaf Mazar, Millad Kassaie and Georgios Chochlakis named "Powered by the Will? Exploring Lay Theories of Behavior Change through Social Media"

Subreddit Analysis This repo includes tools for Subreddit analysis, originally developed for our class project of PSYC 626 in USC, titled "Powered by

Georgios Chochlakis 1 Dec 17, 2021
Local-Global Stratified Transformer for Efficient Video Recognition

DualFormer This repo is the implementation of our manuscript entitled "Local-Global Stratified Transformer for Efficient Video Recognition". Our model

Sea AI Lab 19 Dec 07, 2022
TensorFlow implementation of Elastic Weight Consolidation

Elastic weight consolidation Introduction A TensorFlow implementation of elastic weight consolidation as presented in Overcoming catastrophic forgetti

James Stokes 67 Oct 11, 2022
List of content farm sites like g.penzai.com.

内容农场网站清单 Google 中文搜索结果包含了相当一部分的内容农场式条目,比如「小 X 知识网」「小 X 百科网」。此种链接常会 302 重定向其主站,页面内容为自动生成,大量堆叠关键字,揉杂一些爬取到的内容,完全不具可读性和参考价值。 尤为过分的是,该类网站可能有成千上万个分身域名被 Goog

WDMPA 541 Jan 03, 2023
[NeurIPS 2021] The PyTorch implementation of paper "Self-Supervised Learning Disentangled Group Representation as Feature"

IP-IRM [NeurIPS 2021] The PyTorch implementation of paper "Self-Supervised Learning Disentangled Group Representation as Feature". Codes will be relea

Wang Tan 67 Dec 24, 2022
Driller: augmenting AFL with symbolic execution!

Driller Driller is an implementation of the driller paper. This implementation was built on top of AFL with angr being used as a symbolic tracer. Dril

Shellphish 791 Jan 06, 2023
Official pytorch implementation for Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion (CVPR 2022)

Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion This repository contains a pytorch implementation of "Learning to Listen: Modeling

50 Dec 17, 2022
一些经典的CTR算法的复现; LR, FM, FFM, AFM, DeepFM,xDeepFM, PNN, DCN, DCNv2, DIFM, AutoInt, FiBiNet,AFN,ONN,DIN, DIEN ... (pytorch, tf2.0)

CTR Algorithm 根据论文, 博客, 知乎等方式学习一些CTR相关的算法 理解原理并自己动手来实现一遍 pytorch & tf2.0 保持一颗学徒的心! Schedule Model pytorch tensorflow2.0 paper LR ✔️ ✔️ \ FM ✔️ ✔️ Fac

luo han 149 Dec 20, 2022
NAACL'2021: Factual Probing Is [MASK]: Learning vs. Learning to Recall

OptiPrompt This is the PyTorch implementation of the paper Factual Probing Is [MASK]: Learning vs. Learning to Recall. We propose OptiPrompt, a simple

Princeton Natural Language Processing 150 Dec 20, 2022