meProp: Sparsified Back Propagation for Accelerated Deep Learning (ICML 2017)

Overview

meProp

The codes were used for the paper meProp: Sparsified Back Propagation for Accelerated Deep Learning with Reduced Overfitting (ICML 2017) [pdf] by Xu Sun, Xuancheng Ren, Shuming Ma, Houfeng Wang.

Based on meProp, we further simplify the model by eliminating the rows or columns that are seldom updated, which will reduce the computational cost both in the training and decoding, and potentially accelerate decoding in real-world applications. We name this method meSimp (minimal effort simplification). For more details, please see the paper Training Simplification and Model Simplification for Deep Learning: A Minimal Effort Back Propagation Method [pdf]. The codes are at [here].

Introduction

We propose a simple yet effective technique to simplify the training of neural networks. The technique is based on the top-k selection of the gradients in back propagation.

In back propagation, only a small subset of the full gradient is computed to update the model parameters. The gradient vectors are sparsified in such a way that only the top-k elements (in terms of magnitude) are kept. As a result, only k rows or columns (depending on the layout) of the weight matrix are modified, leading to a linear reduction in the computational cost. We name this method meProp (minimal effort back propagation).

Surprisingly, experimental results demonstrate that most of time we only need to update fewer than 5% of the weights at each back propagation pass. More interestingly, the proposed method improves the accuracy of the resulting models rather than degrades the accuracy, and a detailed analysis is given.

The following figure is an illustration of the idea of meProp.

An illustration of the idea of meProp.

TL;DR: Training with meProp is significantly faster than the original back propagation, and has better accuracy on all of the three tasks we used, Dependency Parsing, POS Tagging and MNIST respectively. The method works with different neural models (MLP and LSTM), with different optimizers (we tested AdaGrad and Adam), with DropOut, and with more hidden layers. The top-k selection works better than the random k-selection, and better than normally-trained k-dimensional network.

Update: Results on test set (please refer to the paper for detailed results and experimental settings):

Method (Adam, CPU) Backprop Time (s) Test (%)
Parsing (MLP 500d) 9,078 89.80
Parsing (meProp top-20) 489 (18.6x) 88.94 (+0.04)
POS-Tag (LSTM 500d) 16,167 97.22
POS-Tag (meProp top-10) 436 (37.1x) 97.25 (+0.03)
MNIST (MLP 500d) 170 98.20
MNIST (meProp top-80) 29 (5.9x) 98.27 (+0.07)

The effect of k, selection (top-k vs. random), and network dimension (top-k vs. k-dimensional):

Effect of k

To achieve speedups on GPUs, a slight change is made to unify the top-k pattern across the mini-batch. The original meProp will cause different top-k patterns across examples of a mini-batch, which will require sparse matrix multiplication. However, sparse matrix multiplication is not very efficient on GPUs compared to dense matrix multiplication on GPUs. Hence, by unifying the top-k pattern, we can extract the parts of the matrices that need computation (dense matrices), get the results, and reconstruct them to the appropriate size for further computation. This leads to actual speedups on GPUs, although we believe if a better method is designed, the speedups on GPUs can be better.

See [pdf] for more details, experimental results, and analysis.

Usage

PyTorch

Requirements

  • Python 3.5
  • PyTorch v0.1.12+ - v0.3.1
  • torchvision
  • CUDA 8.0

Dataset

MNIST: The code will automatically download the dataset and process the dataset (using torchvision). See function get_mnist in the pytorch code for more information.

Run

python3.5 main.py

The code runs unified meProp by default. You could change the lines at the bottom of the main.py to run meProp using sparse matrix multiplication. Or you could pass the arguments through command line.

usage: main.py [-h] [--n_epoch N_EPOCH] [--d_hidden D_HIDDEN]
               [--n_layer N_LAYER] [--d_minibatch D_MINIBATCH]
               [--dropout DROPOUT] [--k K] [--unified] [--no-unified]
               [--random_seed RANDOM_SEED]

optional arguments:
  -h, --help            show this help message and exit
  --n_epoch N_EPOCH     number of training epochs
  --d_hidden D_HIDDEN   dimension of hidden layers
  --n_layer N_LAYER     number of layers, including the output layer
  --d_minibatch D_MINIBATCH
                        size of minibatches
  --dropout DROPOUT     dropout rate
  --k K                 k in meProp (if invalid, e.g. 0, do not use meProp)
  --unified             use unified meProp
  --no-unified          do not use unified meProp
  --random_seed RANDOM_SEED
                        random seed

The results will be written to stdout by default, but you could change the argument file when initializing the TestGroup to write the results to a file.

The code supports simple unified meProp in addition. Please notice, this code will use GPU 0 by default.

C#

Requirements

  • Targeting Microsoft .NET Framework 4.6.1+
  • Compatible versions of Mono should work fine (tested Mono 5.0.1)
  • Developed with Microsoft Visual Studio 2017

Dataset

MNIST: Download from link. Extract the files, and place them at the same location with the executable.

Run

Compile the code first, or use the executable provided in releases.

Then

nnmnist.exe <config.json>

or

mono nnmnist.exe <config.json>

where <config.json> is a configuration file. There is an example configuration file in the source codes. The example configuration file runs the baseline model. Change the NetType to mlptop for experimenting with meProp, and to mlpvar for experimenting with meSimp. The output will be written to a file at the same location with the executable.

The code supports random k selection in addition.

Citation

bibtex:

@InProceedings{sun17meprop,
  title = 	 {me{P}rop: Sparsified Back Propagation for Accelerated Deep Learning with Reduced Overfitting},
  author = 	 {Xu Sun and Xuancheng Ren and Shuming Ma and Houfeng Wang},
  booktitle = 	 {Proceedings of the 34th International Conference on Machine Learning},
  pages = 	 {3299--3308},
  year = 	 {2017},
  volume = 	 {70},
  series = 	 {Proceedings of Machine Learning Research},
  address = 	 {International Convention Centre, Sydney, Australia}
}
You might also like...
[ICML 2021, Long Talk] Delving into Deep Imbalanced Regression
[ICML 2021, Long Talk] Delving into Deep Imbalanced Regression

Delving into Deep Imbalanced Regression This repository contains the implementation code for paper: Delving into Deep Imbalanced Regression Yuzhe Yang

Code for the ICML 2021 paper
Code for the ICML 2021 paper "Bridging Multi-Task Learning and Meta-Learning: Towards Efficient Training and Effective Adaptation", Haoxiang Wang, Han Zhao, Bo Li.

Bridging Multi-Task Learning and Meta-Learning Code for the ICML 2021 paper "Bridging Multi-Task Learning and Meta-Learning: Towards Efficient Trainin

 Neural-Pull: Learning Signed Distance Functions from Point Clouds by Learning to Pull Space onto Surfaces(ICML 2021)
Neural-Pull: Learning Signed Distance Functions from Point Clouds by Learning to Pull Space onto Surfaces(ICML 2021)

Neural-Pull: Learning Signed Distance Functions from Point Clouds by Learning to Pull Space onto Surfaces(ICML 2021) This repository contains the code

Some tentative models that incorporate label propagation to graph neural networks for graph representation learning in nodes, links or graphs.

Some tentative models that incorporate label propagation to graph neural networks for graph representation learning in nodes, links or graphs.

Back to the Feature: Learning Robust Camera Localization from Pixels to Pose (CVPR 2021)
Back to the Feature: Learning Robust Camera Localization from Pixels to Pose (CVPR 2021)

Back to the Feature with PixLoc We introduce PixLoc, a neural network for end-to-end learning of camera localization from an image and a 3D model via

The PyTorch improved version of TPAMI 2017 paper: Face Alignment in Full Pose Range: A 3D Total Solution.
The PyTorch improved version of TPAMI 2017 paper: Face Alignment in Full Pose Range: A 3D Total Solution.

Face Alignment in Full Pose Range: A 3D Total Solution By Jianzhu Guo. [Updates] 2020.8.30: The pre-trained model and code of ECCV-20 are made public

An efficient PyTorch implementation of the winning entry of the 2017 VQA Challenge.

Bottom-Up and Top-Down Attention for Visual Question Answering An efficient PyTorch implementation of the winning entry of the 2017 VQA Challenge. The

Fader Networks: Manipulating Images by Sliding Attributes - NIPS 2017
Fader Networks: Manipulating Images by Sliding Attributes - NIPS 2017

FaderNetworks PyTorch implementation of Fader Networks (NIPS 2017). Fader Networks can generate different realistic versions of images by modifying at

PyTorch implementation of NIPS 2017 paper Dynamic Routing Between Capsules
PyTorch implementation of NIPS 2017 paper Dynamic Routing Between Capsules

Dynamic Routing Between Capsules - PyTorch implementation PyTorch implementation of NIPS 2017 paper Dynamic Routing Between Capsules from Sara Sabour,

Comments
  • Regarding the demonstration for faster acceleration results in pytorch

    Regarding the demonstration for faster acceleration results in pytorch

    Hi lancopku,

    I'm currently implementing your meProp code to understand the flow of the architecture in detail.

    However, I couln't see the improved acceleration speed of meprop compared to that of conventional MLP.

    In the table 7 and 8 of paper Sun et al., 2017, pytorch based GPU computation can achieve more faster back-propagation procedure.

    Could you please let me know how to implement meprop to show faster backprop computation?

    Best, Seul-Ki

    opened by seulkiyeom 3
  • Deeper MLP?

    Deeper MLP?

    Have you tried on deeper models?

    Since each step of backprops, gradients are removed with specific portions(like 5%), Will not the gradient vanish in a deeper neural network model?

    Any thoughts?

    opened by ildoonet 1
  • Error RuntimeError: 2D tensors expected, got 1D

    Error RuntimeError: 2D tensors expected, got 1D

    I am trying to integrate meProp into my work, but getting such error. Do you have any idea about this?

        return linearUnified(self.k)(x, self.w, self.b)
     line 39, in forward
        y.addmm_(0, 1, x, w)
    RuntimeError: 2D tensors expected, got 1D, 2D tensors at /pytorch/aten/src/THC/generic/THCTensorMathBlas.cu:258
    
    opened by kayuksel 1
Releases(v0.2.0)
Owner
LancoPKU
Language Computing and Machine Learning Group (Xu Sun's group) at Peking University
LancoPKU
Python scripts to detect faces in Python with the BlazeFace Tensorflow Lite models

Python scripts to detect faces using Python with the BlazeFace Tensorflow Lite models. Tested on Windows 10, Tensorflow 2.4.0 (Python 3.8).

Ibai Gorordo 46 Nov 17, 2022
Code for the paper A Theoretical Analysis of the Repetition Problem in Text Generation

A Theoretical Analysis of the Repetition Problem in Text Generation This repository share the code for the paper "A Theoretical Analysis of the Repeti

Zihao Fu 37 Nov 21, 2022
Code for the SIGIR 2022 paper "Hybrid Transformer with Multi-level Fusion for Multimodal Knowledge Graph Completion"

MKGFormer Code for the SIGIR 2022 paper "Hybrid Transformer with Multi-level Fusion for Multimodal Knowledge Graph Completion" Model Architecture Illu

ZJUNLP 68 Dec 28, 2022
One-line your code easily but still with the fun of doing so!

One-liner-iser One-line your code easily but still with the fun of doing so! Have YOU ever wanted to write one-line Python code, but don't have the sa

5 May 04, 2022
Scalable and Elastic Deep Reinforcement Learning Using PyTorch. Please star. 🔥

ElegantRL “小雅”: Scalable and Elastic Deep Reinforcement Learning ElegantRL is developed for researchers and practitioners with the following advantage

AI4Finance Foundation 2.5k Jan 05, 2023
Unifying Global-Local Representations in Salient Object Detection with Transformer

GLSTR (Global-Local Saliency Transformer) This is the official implementation of paper "Unifying Global-Local Representations in Salient Object Detect

11 Aug 24, 2022
Official pytorch implementation of Active Learning for deep object detection via probabilistic modeling (ICCV 2021)

Active Learning for Deep Object Detection via Probabilistic Modeling This repository is the official PyTorch implementation of Active Learning for Dee

NVIDIA Research Projects 130 Jan 06, 2023
Minimal implementation of Denoised Smoothing: A Provable Defense for Pretrained Classifiers in TensorFlow.

Denoised-Smoothing-TF Minimal implementation of Denoised Smoothing: A Provable Defense for Pretrained Classifiers in TensorFlow. Denoised Smoothing is

Sayak Paul 19 Dec 11, 2022
PyTorch implementation for STIN

STIN This repository contains PyTorch implementation for STIN. Abstract: In single-photon LiDAR, photon-efficient imaging captures the 3D structure of

Yiweins 2 Nov 22, 2022
Ready-to-use code and tutorial notebooks to boost your way into few-shot image classification.

Easy Few-Shot Learning Ready-to-use code and tutorial notebooks to boost your way into few-shot image classification. This repository is made for you

Sicara 399 Jan 08, 2023
Simple keras FCN Encoder/Decoder model for MS-COCO (food subset) segmentation

FCN_MSCOCO_Food_Segmentation Simple keras FCN Encoder/Decoder model for MS-COCO (food subset) segmentation Input data: [http://mscoco.org/dataset/#ove

Alexander Kalinovsky 11 Jan 08, 2019
Learning from History: Modeling Temporal Knowledge Graphs with Sequential Copy-Generation Networks

CyGNet This repository reproduces the AAAI'21 paper “Learning from History: Modeling Temporal Knowledge Graphs with Sequential Copy-Generation Network

CunchaoZ 89 Jan 03, 2023
[CVPR'20] TTSR: Learning Texture Transformer Network for Image Super-Resolution

TTSR Official PyTorch implementation of the paper Learning Texture Transformer Network for Image Super-Resolution accepted in CVPR 2020. Contents Intr

Multimedia Research 689 Dec 28, 2022
Codes for NAACL 2021 Paper "Unsupervised Multi-hop Question Answering by Question Generation"

Unsupervised-Multi-hop-QA This repository contains code and models for the paper: Unsupervised Multi-hop Question Answering by Question Generation (NA

Liangming Pan 70 Nov 27, 2022
PyTorch implementation of the paper The Lottery Ticket Hypothesis for Object Recognition

LTH-ObjectRecognition The Lottery Ticket Hypothesis for Object Recognition Sharath Girish*, Shishira R Maiya*, Kamal Gupta, Hao Chen, Larry Davis, Abh

16 Feb 06, 2022
Implementation of popular SOTA self-supervised learning algorithms as Fastai Callbacks.

Self Supervised Learning with Fastai Implementation of popular SOTA self-supervised learning algorithms as Fastai Callbacks. Install pip install self-

Kerem Turgutlu 276 Dec 23, 2022
The PyTorch implementation for paper "Neural Texture Extraction and Distribution for Controllable Person Image Synthesis" (CVPR2022 Oral)

ArXiv | Get Start Neural-Texture-Extraction-Distribution The PyTorch implementation for our paper "Neural Texture Extraction and Distribution for Cont

Ren Yurui 111 Dec 10, 2022
Drone detection using YOLOv5

This drone detection system uses YOLOv5 which is a family of object detection architectures and we have trained the model on Drone Dataset. Overview I

Tushar Sarkar 27 Dec 20, 2022
Implementation of Deformable Attention in Pytorch from the paper "Vision Transformer with Deformable Attention"

Deformable Attention Implementation of Deformable Attention from this paper in Pytorch, which appears to be an improvement to what was proposed in DET

Phil Wang 128 Dec 24, 2022
Bald-to-Hairy Translation Using CycleGAN

GANiry: Bald-to-Hairy Translation Using CycleGAN Official PyTorch implementation of GANiry. GANiry: Bald-to-Hairy Translation Using CycleGAN, Fidan Sa

Fidan Samet 10 Oct 27, 2022