Pytorch implementation for Patient Knowledge Distillation for BERT Model Compression

Last update: Dec 19, 2022

Overview

Patient Knowledge Distillation for BERT Model Compression

Knowledge distillation for BERT model

Installation

Run command below to install the environment

conda install pytorch torchvision cudatoolkit=10.0 -c pytorch
pip install -r requirements.txt

Training

Objective Function

L = (1 - \alpha) L_CE + \alpha * L_DS + \beta * L_PT,

where L_CE is the CrossEntropy loss, DS is the usual Distillation loss, and PT is the proposed loss. Please see our paper below for more details.

Data Preprocess

Modify the HOME_DATA_FOLDER in envs.py and put all data under it (by default it is ./data), RTE data is uploaded for your convenience.

The folder name under HOME_DATA_FOLDER should be
- data_raw: store the raw datas of all tasks. So put downloaded raw data under here
  - MRPC
  - RTE
  - ... (other tasks)
- data_feat: store the tokenized data under this folder (optional)
  - MRPC
  - RTE
  - ...
models
- pretrained: put downloaded pretrained model (bert-base-uncased) under this folder

Predefinted Training

Run NLI_KD_training.py to start training, you can set DEBUG = True to run some pre-defined arguments

set argv = get_predefine_argv('glue', 'RTE', 'finetune_teacher') or argv = get_predefine_argv('glue', 'RTE', 'finetune_student') to start the normal fine-tuning
run run_glue_benchmark.py to get teacher's prediction for KD or PKD.
- set output_all_layers = True for patient teacher
- set output_all_layers = False for normal teacher
set argv = get_predefine_argv('glue', 'RTE', 'kd') to start the vanilla KD
set argv = get_predefine_argv('glue', 'RTE', 'kd.cls') to start the vanilla KD

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Citation

If you find this code useful for your research, please consider citing:

@article{sun2019patient,
title={Patient Knowledge Distillation for BERT Model Compression},
author={Sun, Siqi and Cheng, Yu and Gan, Zhe and Liu, Jingjing},
journal={arXiv preprint arXiv:1908.09355},
year={2019}
}

Paper is available at here.

Pytorch implementation for Patient Knowledge Distillation for BERT Model Compression

Related tags

Overview

Patient Knowledge Distillation for BERT Model Compression

Installation

Training

Objective Function

Data Preprocess

Predefinted Training

Contributing

Citation

Owner

Siqi

IDM: An Intermediate Domain Module for Domain Adaptive Person Re-ID,

SAT Project - The first project I had done at General Assembly, performed EDA, data cleaning and created data visualizations

A minimal solution to hand motion capture from a single color camera at over 100fps. Easy to use, plug to run.

Seeing if I can put together an interactive version of 3b1b's Manim in Streamlit

Python implementation of "Multi-Instance Pose Networks: Rethinking Top-Down Pose Estimation"

[CVPRW 2022] Attentions Help CNNs See Better: Attention-based Hybrid Image Quality Assessment Network

[NeurIPS 2021] Well-tuned Simple Nets Excel on Tabular Datasets

Model that predicts the probability of a Twitter user being anti-vaccination.

DeFMO: Deblurring and Shape Recovery of Fast Moving Objects (CVPR 2021)

Peek-a-Boo: What (More) is Disguised in a Randomly Weighted Neural Network, and How to Find It Efficiently

Transfer SemanticKITTI labeles into other dataset/sensor formats.

A different spin on dataclasses.

High performance Cross-platform Inference-engine, you could run Anakin on x86-cpu,arm, nv-gpu, amd-gpu,bitmain and cambricon devices.

基于Pytorch实现优秀的自然图像分割框架！(包括FCN、U-Net和Deeplab)

Leveraging Two Types of Global Graph for Sequential Fashion Recommendation, ICMR 2021

AI创造营：Metaverse启动机之重构现世，结合PaddlePaddle 和 Wechaty 创造自己的聊天机器人

Progressive Coordinate Transforms for Monocular 3D Object Detection

In Search of Probeable Generalization Measures

Code for a seq2seq architecture with Bahdanau attention designed to map stereotactic EEG data from human brains to spectrograms, using the PyTorch Lightning.

On-device speech-to-intent engine powered by deep learning

Pytorch implementation for Patient Knowledge Distillation for BERT Model Compression

Related tags

Overview

Patient Knowledge Distillation for BERT Model Compression

Installation

Training

Objective Function

Data Preprocess

Predefinted Training

Contributing

Citation

Owner

Siqi

IDM: An Intermediate Domain Module for Domain Adaptive Person Re-ID,

SAT Project - The first project I had done at General Assembly, performed EDA, data cleaning and created data visualizations

A minimal solution to hand motion capture from a single color camera at over 100fps. Easy to use, plug to run.

Seeing if I can put together an interactive version of 3b1b's Manim in Streamlit

Python implementation of "Multi-Instance Pose Networks: Rethinking Top-Down Pose Estimation"

[CVPRW 2022] Attentions Help CNNs See Better: Attention-based Hybrid Image Quality Assessment Network

[NeurIPS 2021] Well-tuned Simple Nets Excel on Tabular Datasets

Model that predicts the probability of a Twitter user being anti-vaccination.

DeFMO: Deblurring and Shape Recovery of Fast Moving Objects (CVPR 2021)

Peek-a-Boo: What (More) is Disguised in a Randomly Weighted Neural Network, and How to Find It Efficiently

Transfer SemanticKITTI labeles into other dataset/sensor formats.

A different spin on dataclasses.

High performance Cross-platform Inference-engine, you could run Anakin on x86-cpu,arm, nv-gpu, amd-gpu,bitmain and cambricon devices.

基于Pytorch实现优秀的自然图像分割框架！(包括FCN、U-Net和Deeplab)

Leveraging Two Types of Global Graph for Sequential Fashion Recommendation, ICMR 2021

AI创造营 ：Metaverse启动机之重构现世，结合PaddlePaddle 和 Wechaty 创造自己的聊天机器人

Progressive Coordinate Transforms for Monocular 3D Object Detection

In Search of Probeable Generalization Measures

Code for a seq2seq architecture with Bahdanau attention designed to map stereotactic EEG data from human brains to spectrograms, using the PyTorch Lightning.

On-device speech-to-intent engine powered by deep learning

AI创造营：Metaverse启动机之重构现世，结合PaddlePaddle 和 Wechaty 创造自己的聊天机器人