A library for researching neural networks compression and acceleration methods.

Overview

Model Compression Research Package

This package was developed to enable scalable, reusable and reproducable research of weight pruning, quantization and distillation methods with ease.

Installation

To install the library clone the repository and install using pip

git clone https://github.com/IntelLabs/Model-Compression-Research-Package
cd Model-Compression-Research-Package
pip install [-e] .

Add -e flag to install an editable version of the library.

Quick Tour

This package contains implementations of several weight pruning methods, knowledge distillation and quantization-aware training. Here we will show how to easily use those implementations with your existing model implementation and training loop. It is also possible to combine several methods together in the same training process. Please refer to the packages examples.

Weight Pruning

Weight pruning is a method to induce zeros in a models weight while training. There are several methods to prune a model and it is a widely explored research field.

To list the existing weight pruning implemtations in the package use model_compression_research.list_methods(). For example, applying unstructured magnitude pruning while training your model can be done with a few single lines of code

from model_compression_research import IterativePruningConfig, IterativePruningScheduler

training_args = get_training_args()
model = get_model()
dataloader = get_dataloader()
criterion = get_criterion()

# Initialize a pruning configuration and a scheduler and apply it on the model
pruning_config = IterativePruningConfig(
    pruning_fn="unstructured_magnitude",
    pruning_fn_default_kwargs={"target_sparsity": 0.9}
)
pruning_scheduler = IterativePruningScheduler(model, pruning_config)

# Initialize optimizer after initializing the pruning scheduler
optimizer = get_optimizer()

# Training loop
for e in range(training_args.epochs):
    for batch in dataloader:
        inputs, labels = 
        model.train()
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        # Call pruning scheduler step
        pruning_schduler.step()
        optimizer.zero_grad()

# At the end of training rmeove the pruning parts and get the resulted pruned model
pruning_scheduler.remove_pruning()

For using knowledge distillation with HuggingFace/transformers dedicated transformers Trainer see the implementation of HFTrainerPruningCallback in api_utils.py.

Knowledge Distillation

Model distillation is a method to distill the knowledge learned by a teacher to a smaller student model. A method to do that is to compute the difference between the student's and teacher's output distribution using KL divergence. In this package you can find a simple implementation that does just that.

Assuming that your teacher and student models' outputs are of the same dimension, you can use the implementation in this package as follows:

from model_compression_research import TeacherWrapper, DistillationModelWrapper

training_args = get_training_args()
teacher = get_teacher_trained_model()
student = get_student_model()
dataloader = get_dataloader()
criterion = get_criterion()

# Wrap teacher model with TeacherWrapper and set loss scaling factor and temperature
teacher = TeacherWrapper(teacher, ce_alpha=0.5, ce_temperature=2.0)
# Initialize the distillation model with the student and teacher
distillation_model = DistillationModelWrapper(student, teacher, alpha_student=0.5)

optimizer = get_optimizer()

# Training loop
for e in range(training_args.epochs):
    for batch in dataloader:
        inputs, labels = batch
        distillation_model.train()
        # Calculate student loss w.r.t labels as you usually do
        student_outputs = distillation_model(inputs)
        loss_wrt_labels = criterion(student_outputs, labels)
        # Add knowledge distillation term
        loss = distillation_model.compute_loss(loss_wrt_labels, student_outputs)
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

For using knowledge distillation with HuggingFace/transformers see the implementation of HFTeacherWrapper and hf_add_teacher_to_student in api_utils.py.

Quantization-Aware Training

Quantization-Aware Training is a method for training models that will be later quantized at the inference stage, as opposed to other post-training quantization methods where models are trained without any adaptation to the error caused by model quantization.

A similar quantization-aware training method to the one introduced in Q8BERT: Quantized 8Bit BERT generelized to custom models is implemented in this package:

from model_compression_research import QuantizerConfig, convert_model_for_qat

training_args = get_training_args()
model = get_model()
dataloader = get_dataloader()
criterion = get_criterion()

# Initialize quantizer configuration
qat_config = QuantizerConfig()
# Convert model to quantization-aware training model
qat_model = convert_model_for_qat(model, qat_config)

optimizer = get_optimizer()

# Training loop
for e in range(training_args.epochs):
    for batch in dataloader:
        inputs, labels = 
        model.train()
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

Papers Implemented in Model Compression Research Package

Methods from the following papers were implemented in this package and are ready for use:

Citation

If you want to cite our paper and library, you can use the following:

@article{zafrir2021prune,
  title={Prune Once for All: Sparse Pre-Trained Language Models},
  author={Zafrir, Ofir and Larey, Ariel and Boudoukh, Guy and Shen, Haihao and Wasserblat, Moshe},
  journal={arXiv preprint arXiv:2111.05754},
  year={2021}
}
@software{zafrir_ofir_2021_5721732,
  author       = {Zafrir, Ofir},
  title        = {Model-Compression-Research-Package by Intel Labs},
  month        = nov,
  year         = 2021,
  publisher    = {Zenodo},
  version      = {v0.1.0},
  doi          = {10.5281/zenodo.5721732},
  url          = {https://doi.org/10.5281/zenodo.5721732}
}
Comments
  • Uniform magnitude pruning implementation problem

    Uniform magnitude pruning implementation problem

    Hello, when the uniform magnitude pruning method is set to "pruning_fn_default_kwargs": { "block_size": 8, "target_sparsity": 0.85 }, The model ends up retaining the parameter 0.75, why?

    opened by LYF915 13
  • Difference between end_pruning_step and policy_end_step

    Difference between end_pruning_step and policy_end_step

    Hi, Could you please clarify the difference between end_pruning_step and policy_end_step in the pruning config file (for example: https://github.com/IntelLabs/Model-Compression-Research-Package/blob/main/examples/transformers/language-modeling/config/iterative_unstructured_magnitude_90_config.json)?

    opened by eldarkurtic 6
  • Issue of max_seq_length in MLM pretraining data preprocessing

    Issue of max_seq_length in MLM pretraining data preprocessing

    Hi, I find that in the functions segment_pair_nsp_process and doc_sentences_process in examples/transformers/language-modeling/dataset_processing.py, the sequence length of the processed data is actually max_seq_length - tokenizer.num_special_tokens_to_add(pair=False) since variable max_seq_length is replaced by this value and have been passed to the tokenizer.prepare_for_model function. Such as user set max_seq_length=128, and the processed data will have a sequence length of 125. I'm not sure is it the standard way of pretraining data preprocessing?

    opened by XinyuYe-Intel 5
  • How to save QAT quantized model?

    How to save QAT quantized model?

    Hi, thank you for your model compression package. I am a little confused about how to save QAT quantized model. Do you have an official website or documentation for this package?

    opened by OctoberKat 4
  • LR scheduler clarification

    LR scheduler clarification

    Hi, Running the Language Modelling example (https://github.com/IntelLabs/Model-Compression-Research-Package/tree/main/examples/transformers/language-modeling) ends with a slightly different LR schedule compared to the one presented in the Figure 2.b of the "Prune Once For All" paper. (particularly the warmup phase seems to be a bit different)

    train/learning_rate logged by Weights&Biases: Screenshot 2021-12-20 at 11 25 39

    Learning rate in the paper, Figure 2.b: Screenshot 2021-12-20 at 11 31 35

    opened by eldarkurtic 4
  • Sparse models available for download?

    Sparse models available for download?

    Hello :-)

    I found your Prune-Once-For-All paper very interesting and would like to play with the sparse models that it produced. Are you going to open-source them soon?

    I have noticed you have open-sourced the sparse-pretrained models, but I couldn't find the corresponding models finetuned on downstream tasks (SQuAD, MNLI, QQP, etc.).

    opened by eldarkurtic 2
  • How to interpret hyperparams?

    How to interpret hyperparams?

    Hi, I have a few questions about hyperparams in the Table 6:

    1. Since there are three models: {BERT-Base, BERT-Large, DistilBERT}, how to interpret learning rate for SQuAD with only two values: {1.5e-4, 1.8e-4}?
    2. I assume that for GLUE {1e-4, 1.2e-4, 1.5e-5} are learning rate values for each model respectively. Is this correct?
    3. Since weight decay row has only two values {0, 0.01}, I assume 0 is for all models on SQuAD and 0.01 is for all models on GLUE?
    4. Since warmup ratio row has three values {0, 0.01, 0.1}, I assume these are for each model respectively, no matter which dataset is used?
    5. Does "Epochs {3, 6, 9}" for GLUE mean BERT-base tuned for 3 epochs, BERT-Large for 6 and DistilBERT for 9 epochs?
    opened by eldarkurtic 2
  • Upstream pruning

    Upstream pruning

    Hi! First of all, thanks for open-sourcing your code for the "Prune Once for All" paper. I would like to ask a few questions:

    1. Are you planning to release your teacher model for upstream task? I have noticed that at https://huggingface.co/Intel , only the sparse checkpoints have been released. I would like to run some experiments with your compression package.
    2. From the published scripts, I have noticed that you have been using only English Wikipedia dataset for pruning at upstream tasks (MLM and NSP) but the bert-base-uncased model you use as a starting point is pre-trained on BookCorpus and English Wikipedia. Is there any specific reason why you haven't included BookCorpus dataset too?
    opened by eldarkurtic 1
  • Code analysis identified several places where objects were either not

    Code analysis identified several places where objects were either not

    declared or were declared as None which could result in an unsupported operation error from python.

    Change descriptions:

    • added forward declarations of 4 variables in both the modeling_bert and modeling_roberta
    • removed assignment of all_hidden_states to None if output_hidden_states is none
    • removed assignment of all_attentions to None if output_attentions is none
    • removed assignment of all_self_attentions to None if output_attentions is None
    • removed assignment of all_cross_attentions to Non if output_attentions is None
    opened by michaelbeale-IL 0
  • Fix distillation of different HF/transformers models

    Fix distillation of different HF/transformers models

    Until now, if the teacher had a different signature than the student, transformers.trainer would delete the input that is not matching to the student's signature leading to the teacher not getting all the input it needs.

    For example, training a DistilBERT student with a BERT-Base teacher will not work properly since BERT-Base requires token_type_ids which DistilBERT doesn't require. The trainer deletes the token_type_ids from the input and BERT teacher would get an all zeros token type ids leading to wrong predictions.

    This PR fixes this issue.

    opened by ofirzaf 0
  • Small optimizations

    Small optimizations

    • Implement fast threshold compute: Execute best threshold compute according to target hardware (cpu/cuda) and implement fast compute using histogram
    • Refactor block pruning computation: move computation to utils and reuse in the rest of the pruning methods
    opened by ofirzaf 0
Releases(v0.1.0)
  • v0.1.0(Nov 23, 2021)

    First release of Intel Labs' Model Compression Research Package, the current version includes model compression methods from previous published papers and our own research papers implementations:

    • Pruning, quantization and knowledge distillation methods and schedulers that may fit various PyTorch models out-of-the-box
    • Integration to HuggingFace/transformers library for most of the available methods
    • Various examples showing how to use the library
    • Prune Once for All: Sparse Pre-Trained Language Models reproduction guide and scripts
    Source code(tar.gz)
    Source code(zip)
Owner
Intel Labs
Intel Labs
novel deep learning research works with PaddlePaddle

Research 发布基于飞桨的前沿研究工作,包括CV、NLP、KG、STDM等领域的顶会论文和比赛冠军模型。 目录 计算机视觉(Computer Vision) 自然语言处理(Natrual Language Processing) 知识图谱(Knowledge Graph) 时空数据挖掘(Spa

1.5k Dec 29, 2022
PyTorch code for EMNLP 2021 paper: Don't be Contradicted with Anything! CI-ToD: Towards Benchmarking Consistency for Task-oriented Dialogue System

PyTorch code for EMNLP 2021 paper: Don't be Contradicted with Anything! CI-ToD: Towards Benchmarking Consistency for Task-oriented Dialogue System

Libo Qin 25 Sep 06, 2022
the official implementation of the paper "Isometric Multi-Shape Matching" (CVPR 2021)

Isometric Multi-Shape Matching (IsoMuSh) Paper-CVF | Paper-arXiv | Video | Code Citation If you find our work useful in your research, please consider

Maolin Gao 9 Jul 17, 2022
Code for our paper "Sematic Representation for Dialogue Modeling" in ACL2021

AMR-Dialogue An implementation for paper "Semantic Representation for Dialogue Modeling". You may find our paper here. Requirements python 3.6 pytorch

xfbai 45 Dec 26, 2022
Pixel-wise segmentation on VOC2012 dataset using pytorch.

PiWiSe Pixel-wise segmentation on the VOC2012 dataset using pytorch. FCN SegNet PSPNet UNet RefineNet For a more complete implementation of segmentati

Bodo Kaiser 378 Dec 30, 2022
Simple cross-platform application for DaVinci surgical video frame annotation

About DaVid is a simple cross-platform GUI for annotating robotic and endoscopic surgical actions for use in deep-learning research. Features Simple a

Cyril Zakka 4 Oct 09, 2021
S2-BNN: Bridging the Gap Between Self-Supervised Real and 1-bit Neural Networks via Guided Distribution Calibration (CVPR 2021)

S2-BNN (Self-supervised Binary Neural Networks Using Distillation Loss) This is the official pytorch implementation of our paper: "S2-BNN: Bridging th

Zhiqiang Shen 52 Dec 24, 2022
This repo contains research materials released by members of the Google Brain team in Tokyo.

Brain Tokyo Workshop 🧠 🗼 This repo contains research materials released by members of the Google Brain team in Tokyo. Past Projects Weight Agnostic

Google 1.2k Jan 02, 2023
This repository contains the implementation of the paper Contrastive Instance Association for 4D Panoptic Segmentation using Sequences of 3D LiDAR Scans

Contrastive Instance Association for 4D Panoptic Segmentation using Sequences of 3D LiDAR Scans This repository contains the implementation of the pap

Photogrammetry & Robotics Bonn 40 Dec 01, 2022
A toolset of Python programs for signal modeling and indentification via sparse semilinear autoregressors.

SPAAR Description A toolset of Python programs for signal modeling via sparse semilinear autoregressors. References Vides, F. (2021). Computing Semili

Fredy Vides 0 Oct 30, 2021
Neural Radiance Fields Using PyTorch

This project is a PyTorch implementation of Neural Radiance Fields (NeRF) for reproduction of results whilst running at a faster speed.

Vedant Ghodke 1 Feb 11, 2022
Supporting code for "Autoregressive neural-network wavefunctions for ab initio quantum chemistry".

naqs-for-quantum-chemistry This repository contains the codebase developed for the paper Autoregressive neural-network wavefunctions for ab initio qua

Tom Barrett 24 Dec 23, 2022
A python script to convert images to animated sus among us crewmate twerk jifs as seen on r/196

img_sussifier A python script to convert images to animated sus among us crewmate twerk jifs as seen on r/196 Examples How to use install python pip i

41 Sep 30, 2022
Code for paper PairRE: Knowledge Graph Embeddings via Paired Relation Vectors.

PairRE Code for paper PairRE: Knowledge Graph Embeddings via Paired Relation Vectors. This implementation of PairRE for Open Graph Benchmak datasets (

Alipay 65 Dec 19, 2022
Tackling the Class Imbalance Problem of Deep Learning Based Head and Neck Organ Segmentation

Info This is the code repository of the work Tackling the Class Imbalance Problem of Deep Learning Based Head and Neck Organ Segmentation from Elias T

2 Apr 20, 2022
Train Dense Passage Retriever (DPR) with a single GPU

Gradient Cached Dense Passage Retrieval Gradient Cached Dense Passage Retrieval (GC-DPR) - is an extension of the original DPR library. We introduce G

Luyu Gao 92 Jan 02, 2023
This is the repository for CVPR2021 Dynamic Metric Learning: Towards a Scalable Metric Space to Accommodate Multiple Semantic Scales

Intro This is the repository for CVPR2021 Dynamic Metric Learning: Towards a Scalable Metric Space to Accommodate Multiple Semantic Scales Vehicle Sam

39 Jul 21, 2022
Visualizing lattice vibration information from phonon dispersion to atoms (For GPUMD)

Phonon-Vibration-Viewer (For GPUMD) Visualizing lattice vibration information from phonon dispersion for primitive atoms. In this tutorial, we will in

Liangting 6 Dec 10, 2022
Official implementation of Sparse Transformer-based Action Recognition

STAR Official implementation of S parse T ransformer-based A ction R ecognition Dataset download NTU RGB+D 60 action recognition of 2D/3D skeleton fro

Chonghan_Lee 15 Nov 02, 2022
Rohit Ingole 2 Mar 24, 2022