Official Repsoitory for "Mish: A Self Regularized Non-Monotonic Neural Activation Function" [BMVC 2020]

Overview

Run on Gradient

Mish: Self Regularized
Non-Monotonic Activation Function

BMVC 2020 (Official Paper)



Notes: (Click to expand)
  • A considerably faster version based on CUDA can be found here - Mish CUDA (All credits to Thomas Brandon for the same)
  • Memory Efficient Experimental version of Mish can be found here
  • Faster variants for Mish and H-Mish by Yashas Samaga can be found here - ConvolutionBuildingBlocks
  • Alternative (experimental improved) variant of H-Mish developed by Páll Haraldsson can be found here - H-Mish (Available in Julia)
  • Variance based initialization method for Mish (experimental) by Federico Andres Lois can be found here - Mish_init
Changelogs/ Updates: (Click to expand)

News/ Media Coverage:

   

  • (02/2020): Talk on Mish and Non-Linear Dynamics at Sicara is out now. Watch on:

   

  • (07/2020): CROWN: A comparison of morphology for Mish, Swish and ReLU produced in collaboration with Javier Ideami. Watch on:

   

   

  • (12/2020): Talk on From Smooth Activations to Robustness to Catastrophic Forgetting at Weights & Biases Salon is out now. Watch on:

   


MILA/ CIFAR 2020 DLRLSS (Click on arrow to view)

Contents: (Click to expand)
  1. Mish
    a. Loss landscape
  2. ImageNet Scores
  3. MS-COCO
  4. Variation of Parameter Comparison
    a. MNIST
    b. CIFAR10
  5. Significance Level
  6. Results
    a. Summary of Results (Vision Tasks)
    b. Summary of Results (Language Tasks)
  7. Try It!
  8. Future Work
  9. Acknowledgements
  10. Cite this work

Mish:

Minimum of f(x) is observed to be ≈-0.30884 at x≈-1.1924
Mish has a parametric order of continuity of: C

Derivative of Mish with respect to Swish and Δ(x) preconditioning:

Further simplifying:

Alternative derivative form:

where:

We hypothesize the Δ(x) to be exhibiting the properties of a pre-conditioner making the gradient more smoother. Further details are provided in the paper.

Loss Landscape:

To visit the interactive Loss Landscape visualizer, click here.

Loss landscape visualizations for a ResNet-20 for CIFAR 10 using ReLU, Mish and Swish (from L-R) for 200 epochs training:


Mish provides much better accuracy, overall lower loss, smoother and well conditioned easy-to-optimize loss landscape as compared to both Swish and ReLU. For all loss landscape visualizations please visit this readme.

We also investigate the output landscape of randomly initialized neural networks as shown below. Mish has a much smoother profile than ReLU.

ImageNet Scores:

PWC

For Installing DarkNet framework, please refer to darknet(Alexey AB)

For PyTorch based ImageNet scores, please refer to this readme

Network Activation Top-1 Accuracy Top-5 Accuracy cfg Weights Hardware
ResNet-50 Mish 74.244% 92.406% cfg weights AWS p3.16x large, 8 Tesla V100
DarkNet-53 Mish 77.01% 93.75% cfg weights AWS p3.16x large, 8 Tesla V100
DenseNet-201 Mish 76.584% 93.47% cfg weights AWS p3.16x large, 8 Tesla V100
ResNext-50 Mish 77.182% 93.318% cfg weights AWS p3.16x large, 8 Tesla V100
Network Activation Top-1 Accuracy Top-5 Accuracy
CSPResNet-50 Leaky ReLU 77.1% 94.1%
CSPResNet-50 Mish 78.1% 94.2%
Pelee Net Leaky ReLU 70.7% 90%
Pelee Net Mish 71.4% 90.4%
Pelee Net Swish 71.5% 90.7%
CSPPelee Net Leaky ReLU 70.9% 90.2%
CSPPelee Net Mish 71.2% 90.3%

Results on CSPResNext-50:

MixUp CutMix Mosaic Blur Label Smoothing Leaky ReLU Swish Mish Top -1 Accuracy Top-5 Accuracy cfg weights
✔️ 77.9%(=) 94%(=)
✔️ ✔️ 77.2%(-) 94%(=)
✔️ ✔️ 78%(+) 94.3%(+)
✔️ ✔️ 78.1%(+) 94.5%(+)
✔️ ✔️ 77.5%(-) 93.8%(-)
✔️ ✔️ 78.1%(+) 94.4%(+)
✔️ 64.5%(-) 86%(-)
✔️ 78.9%(+) 94.5%(+)
✔️ ✔️ ✔️ ✔️ 78.5%(+) 94.8%(+)
✔️ ✔️ ✔️ ✔️ 79.8%(+) 95.2%(+) cfg weights

Results on CSPResNet-50:

CutMix Mosaic Label Smoothing Leaky ReLU Mish Top -1 Accuracy Top-5 Accuracy cfg weights
✔️ 76.6%(=) 93.3%(=)
✔️ ✔️ ✔️ ✔️ 77.1%(+) 94.1%(+)
✔️ ✔️ ✔️ ✔️ 78.1%(+) 94.2%(+) cfg weights

Results on CSPDarkNet-53:

CutMix Mosaic Label Smoothing Leaky ReLU Mish Top -1 Accuracy Top-5 Accuracy cfg weights
✔️ 77.2%(=) 93.6%(=)
✔️ ✔️ ✔️ ✔️ 77.8%(+) 94.4%(+)
✔️ ✔️ ✔️ ✔️ 78.7%(+) 94.8%(+) cfg weights

Results on SpineNet-49:

CutMix Mosaic Label Smoothing ReLU Swish Mish Top -1 Accuracy Top-5 Accuracy cfg weights
✔️ 77%(=) 93.3%(=) - -
✔️ ✔️ 78.1%(+) 94%(+) - -
✔️ ✔️ ✔️ ✔️ 78.3%(+) 94.6%(+) - -

MS-COCO:

PWC PWC

For PyTorch based MS-COCO scores, please refer to this readme

Model Mish AP50...95 mAP50 CPU - 90 Watt - FP32 (Intel Core i7-6700K, 4GHz, 8 logical cores) OpenCV-DLIE, FPS VPU-2 Watt- FP16 (Intel MyriadX) OpenCV-DLIE, FPS GPU-175 Watt- FP32/16 (Nvidia GeForce RTX 2070) DarkNet-cuDNN, FPS
CSPDarkNet-53 (512 x 512) 42.4% 64.5% 3.5 1.23 43
CSPDarkNet-53 (512 x 512) ✔️ 43% 64.9% - - 41
CSPDarkNet-53 (608 x 608) ✔️ 43.5% 65.7% - - 26
Architecture Mish CutMix Mosaic Label Smoothing Size AP AP50 AP75
CSPResNext50-PANet-SPP 512 x 512 42.4% 64.4% 45.9%
CSPResNext50-PANet-SPP ✔️ ✔️ ✔️ 512 x 512 42.3% 64.3% 45.7%
CSPResNext50-PANet-SPP ✔️ ✔️ ✔️ ✔️ 512 x 512 42.3% 64.2% 45.8%
CSPDarkNet53-PANet-SPP ✔️ ✔️ ✔️ 512 x 512 42.4% 64.5% 46%
CSPDarkNet53-PANet-SPP ✔️ ✔️ ✔️ ✔️ 512 x 512 43% 64.9% 46.5%

Credits to AlexeyAB, Wong Kin-Yiu and Glenn Jocher for all the help with benchmarking MS-COCO and ImageNet.

Variation of Parameter Comparison:

MNIST:

To observe how increasing the number of layers in a network while maintaining other parameters constant affect the test accuracy, fully connected networks of varying depths on MNIST, with each layer having 500 neurons were trained. Residual Connections were not used because they enable the training of arbitrarily deep networks. BatchNorm was used to lessen the dependence on initialization along with a dropout of 25%. The network is optimized using SGD on a batch size of 128, and for fair comparison, the same learning rates for each activation function was maintained. In the experiments, all 3 activations maintained nearly the same test accuracy for 15 layered Network. Increasing number of layers from 15 gradually resulted in a sharp decrease in test accuracy for Swish and ReLU, however, Mish outperformed them both in large networks where optimization becomes difficult.

The consistency of Mish providing better test top-1 accuracy as compared to Swish and ReLU was also observed by increasing Batch Size for a ResNet v2-20 on CIFAR-10 for 50 epochs while keeping all other network parameters to be constant for fair comparison.

Gaussian Noise with varying standard deviation was added to the input in case of MNIST classification using a simple conv net to observe the trend in decreasing test top-1 accuracy for Mish and compare it to that of ReLU and Swish. Mish mostly maintained a consistent lead over that of Swish and ReLU (Less than ReLU in just 1 instance and less than Swish in 3 instance) as shown below. The trend for test loss was also observed following the same procedure. (Mish has better loss than both Swish and ReLU except in 1 instance)

CIFAR10:

Significance Level:

The P-values were computed for different activation functions in comparison to that of Mish on terms of Top-1 Testing Accuracy of a Squeeze Net Model on CIFAR-10 for 50 epochs for 23 runs using Adam Optimizer at a Learning Rate of 0.001 and Batch Size of 128. It was observed that Mish beats most of the activation functions at a high significance level in the 23 runs, specifically it beats ReLU at a high significance of P < 0.0001. Mish also had a comparatively lower standard deviation across 23 runs which proves the consistency of performance for Mish.

Activation Function Mean Accuracy Mean Loss Standard Deviation of Accuracy P-value Cohen's d Score 95% CI
Mish 87.48% 4.13% 0.3967 - - -
Swish-1 87.32% 4.22% 0.414 P = 0.1973 0.386 -0.3975 to 0.0844
E-Swish (β=1.75) 87.49% 4.156% 0.411 P = 0.9075 0.034444 -0.2261 to 0.2539
GELU 87.37% 4.339% 0.472 P = 0.4003 0.250468 -0.3682 to 0.1499
ReLU 86.66% 4.398% 0.584 P < 0.0001 1.645536 -1.1179 to -0.5247
ELU(α=1.0) 86.41% 4.211% 0.3371 P < 0.0001 2.918232 -1.2931 to -0.8556
Leaky ReLU(α=0.3) 86.85% 4.112% 0.4569 P < 0.0001 1.47632 -0.8860 to -0.3774
RReLU 86.87% 4.138% 0.4478 P < 0.0001 1.444091 -0.8623 to -0.3595
SELU 83.91% 4.831% 0.5995 P < 0.0001 7.020812 -3.8713 to -3.2670
SoftPlus(β = 1) 83.004% 5.546% 1.4015 P < 0.0001 4.345453 -4.7778 to -4.1735
HardShrink(λ = 0.5) 75.03% 7.231% 0.98345 P < 0.0001 16.601747 -12.8948 to -12.0035
Hardtanh 82.78% 5.209% 0.4491 P < 0.0001 11.093842 -4.9522 to -4.4486
LogSigmoid 81.98% 5.705% 1.6751 P < 0.0001 4.517156 -6.2221 to -4.7753
PReLU 85.66% 5.101% 2.2406 P = 0.0004 1.128135 -2.7715 to -0.8590
ReLU6 86.75% 4.355% 0.4501 P < 0.0001 1.711482 -0.9782 to -0.4740
CELU(α=1.0) 86.23% 4.243% 0.50941 P < 0.0001 2.741669 -1.5231 to -0.9804
Sigmoid 74.82% 8.127% 5.7662 P < 0.0001 3.098289 -15.0915 to -10.2337
Softshrink(λ = 0.5) 82.35% 5.4915% 0.71959 P < 0.0001 8.830541 -5.4762 to -4.7856
Tanhshrink 82.35% 5.446% 0.94508 P < 0.0001 7.083564 -5.5646 to -4.7032
Tanh 83.15% 5.161% 0.6887 P < 0.0001 7.700198 -4.6618 to -3.9938
Softsign 82.66% 5.258% 0.6697 P < 0.0001 8.761157 -5.1493 to -4.4951
Aria-2(β = 1, α=1.5) 81.31% 6.0021% 2.35475 P < 0.0001 3.655362 -7.1757 to -5.1687
Bent's Identity 85.03% 4.531% 0.60404 P < 0.0001 4.80211 -2.7576 to -2.1502
SQNL 83.44% 5.015% 0.46819 P < 0.0001 9.317237 -4.3009 to -3.7852
ELisH 87.38% 4.288% 0.47731 P = 0.4283 0.235784 -0.3643 to 0.1573
Hard ELisH 85.89% 4.431% 0.62245 P < 0.0001 3.048849 -1.9015 to -1.2811
SReLU 85.05% 4.541% 0.5826 P < 0.0001 4.883831 -2.7306 to -2.1381
ISRU (α=1.0) 86.85% 4.669% 0.1106 P < 0.0001 5.302987 -4.4855 to -3.5815
Flatten T-Swish 86.93% 4.459% 0.40047 P < 0.0001 1.378742 -0.7865 to -0.3127
SineReLU (ε = 0.001) 86.48% 4.396% 0.88062 P < 0.0001 1.461675 -1.4041 to -0.5924
Weighted Tanh (Weight = 1.7145) 80.66% 5.985% 1.19868 P < 0.0001 7.638298 -7.3502 to -6.2890
LeCun's Tanh 82.72% 5.322% 0.58256 P < 0.0001 9.551812 -5.0566 to -4.4642
Soft Clipping (α=0.5) 55.21% 18.518% 10.831994 P < 0.0001 4.210373 -36.8255 to -27.7154
ISRLU (α=1.0) 86.69% 4.231% 0.5788 P < 0.0001 1.572874 -1.0753 to -0.4856

Values rounded up which might cause slight deviation in the statistical values reproduced from these tests

Results:

PWC PWC

News: Ajay Arasanipalai recently submitted benchmark for CIFAR-10 training for the Stanford DAWN Benchmark using a Custom ResNet-9 + Mish which achieved 94.05% accuracy in just 10.7 seconds in 14 epochs on the HAL Computing Cluster. This is the current fastest training of CIFAR-10 in 4 GPUs and 2nd fastest training of CIFAR-10 overall in the world.

Summary of Results (Vision Tasks):

Comparison is done based on the high priority metric, for image classification the Top-1 Accuracy while for Generative Networks and Image Segmentation the Loss Metric. Therefore, for the latter, Mish > Baseline is indicative of better loss and vice versa. For Embeddings, the AUC metric is considered.

Activation Function Mish > Baseline Model Mish < Baseline Model
ReLU 55 20
Swish-1 53 22
SELU 26 1
Sigmoid 24 0
TanH 24 0
HardShrink(λ = 0.5) 23 0
Tanhshrink 23 0
PReLU(Default Parameters) 23 2
Softsign 22 1
Softshrink (λ = 0.5) 22 1
Hardtanh 21 2
ELU(α=1.0) 21 7
LogSigmoid 20 4
GELU 19 3
E-Swish (β=1.75) 19 7
CELU(α=1.0) 18 5
SoftPlus(β = 1) 17 7
Leaky ReLU(α=0.3) 17 8
Aria-2(β = 1, α=1.5) 16 2
ReLU6 16 8
SQNL 13 1
Weighted TanH (Weight = 1.7145) 12 1
RReLU 12 11
ISRU (α=1.0) 11 1
Le Cun's TanH 10 2
Bent's Identity 10 5
Hard ELisH 9 1
Flatten T-Swish 9 3
Soft Clipping (α=0.5) 9 3
SineReLU (ε = 0.001) 9 4
ISRLU (α=1.0) 9 4
ELisH 7 3
SReLU 7 6
Hard Sigmoid 1 0
Thresholded ReLU(θ=1.0) 1 0

Summary of Results (Language Tasks):

Comparison is done based on the best metric score (Test accuracy) across 3 runs.

Activation Function Mish > Baseline Model Mish < Baseline Model
Penalized TanH 5 0
ELU 5 0
Sigmoid 5 0
SReLU 4 0
TanH 4 1
Swish 3 2
ReLU 2 3
Leaky ReLU 2 3
GELU 1 2

Try It!

Torch DarkNet Julia FastAI TensorFlow Keras CUDA
Source Source Source Source Source Source Source
Future Work: (Click to view)
  • Comparison of Convergence Rates.
  • Normalizing constant for Mish to eliminate the use of Batch Norm.
  • Regularizing effect of the first derivative of Mish with repect to Swish.
Acknowledgments: (Click to expand)

Thanks to all the people who have helped and supported me massively through this project who include:

  1. Sparsha Mishra
  2. Alexandra Deis
  3. Alexey Bochkovskiy
  4. Chien-Yao Wang
  5. Thomas Brandon
  6. Less Wright
  7. Manjunath Bhat
  8. Ajay Uppili Arasanipalai
  9. Federico Lois
  10. Javier Ideami
  11. Ioannis Anifantakis
  12. George Christopoulos
  13. Miklos Toth

And many more including the Fast AI community, Weights and Biases Community, TensorFlow Addons team, SpaCy/Thinc team, Sicara team, Udacity scholarships team to name a few. Apologies if I missed out anyone.

Cite this work:

@article{misra2019mish,
  title={Mish: A self regularized non-monotonic neural activation function},
  author={Misra, Diganta},
  journal={arXiv preprint arXiv:1908.08681},
  year={2019}
}
Owner
Xa9aX ツ
Research MSc at @mila-iqia. VRS @VITA-Group, and Founder @landskape-ai. ボイド
Xa9aX ツ
Official code for On Path Integration of Grid Cells: Group Representation and Isotropic Scaling (NeurIPS 2021)

On Path Integration of Grid Cells: Group Representation and Isotropic Scaling This repo contains the official implementation for the paper On Path Int

Ruiqi Gao 39 Nov 10, 2022
Plenoxels: Radiance Fields without Neural Networks

Plenoxels: Radiance Fields without Neural Networks Alex Yu*, Sara Fridovich-Keil*, Matthew Tancik, Qinhong Chen, Benjamin Recht, Angjoo Kanazawa UC Be

Sara Fridovich-Keil 81 Dec 25, 2022
Colossal-AI: A Unified Deep Learning System for Large-Scale Parallel Training

ColossalAI An integrated large-scale model training system with efficient parallelization techniques. arXiv: Colossal-AI: A Unified Deep Learning Syst

HPC-AI Tech 7.9k Jan 08, 2023
Implementation of ECCV20 paper: the devil is in classification: a simple framework for long-tail object detection and instance segmentation

Implementation of our ECCV 2020 paper The Devil is in Classification: A Simple Framework for Long-tail Instance Segmentation This repo contains code o

twang 98 Sep 17, 2022
“袋鼯麻麻——智能购物平台”能够精准地定位识别每一个商品

“袋鼯麻麻——智能购物平台”能够精准地定位识别每一个商品,并且能够返回完整地购物清单及顾客应付的实际商品总价格,极大地降低零售行业实际运营过程中巨大的人力成本,提升零售行业无人化、自动化、智能化水平。

thomas-yanxin 192 Jan 05, 2023
The openspoor package is intended to allow easy transformation between different geographical and topological systems commonly used in Dutch Railway

Openspoor The openspoor package is intended to allow easy transformation between different geographical and topological systems commonly used in Dutch

7 Aug 22, 2022
A benchmark framework for Tensorflow

TensorFlow benchmarks This repository contains various TensorFlow benchmarks. Currently, it consists of two projects: PerfZero: A benchmark framework

1.1k Dec 30, 2022
This repository provides an efficient PyTorch-based library for training deep models.

s3sec Test AWS S3 buckets for read/write/delete access This tool was developed to quickly test a list of s3 buckets for public read, write and delete

Bytedance Inc. 123 Jan 05, 2023
Python3 / PyTorch implementation of the following paper: Fine-grained Semantics-aware Representation Enhancement for Self-supervisedMonocular Depth Estimation. ICCV 2021 (oral)

FSRE-Depth This is a Python3 / PyTorch implementation of FSRE-Depth, as described in the following paper: Fine-grained Semantics-aware Representation

77 Dec 28, 2022
FPGA: Fast Patch-Free Global Learning Framework for Fully End-to-End Hyperspectral Image Classification

FPGA & FreeNet Fast Patch-Free Global Learning Framework for Fully End-to-End Hyperspectral Image Classification by Zhuo Zheng, Yanfei Zhong, Ailong M

Zhuo Zheng 92 Jan 03, 2023
DaReCzech is a dataset for text relevance ranking in Czech

Dataset DaReCzech is a dataset for text relevance ranking in Czech. The dataset consists of more than 1.6M annotated query-documents pairs,

Seznam.cz a.s. 8 Jul 26, 2022
[ICCV 2021] Self-supervised Monocular Depth Estimation for All Day Images using Domain Separation

ADDS-DepthNet This is the official implementation of the paper Self-supervised Monocular Depth Estimation for All Day Images using Domain Separation I

LIU_LINA 52 Nov 24, 2022
Crosslingual Segmental Language Model

Crosslingual Segmental Language Model This repository contains the code from Multilingual unsupervised sequence segmentation transfers to extremely lo

C.M. Downey 1 Jun 13, 2022
An original implementation of "MetaICL Learning to Learn In Context" by Sewon Min, Mike Lewis, Luke Zettlemoyer and Hannaneh Hajishirzi

MetaICL: Learning to Learn In Context This includes an original implementation of "MetaICL: Learning to Learn In Context" by Sewon Min, Mike Lewis, Lu

Meta Research 141 Jan 07, 2023
[NeurIPS 2019] Learning Imbalanced Datasets with Label-Distribution-Aware Margin Loss

Learning Imbalanced Datasets with Label-Distribution-Aware Margin Loss Kaidi Cao, Colin Wei, Adrien Gaidon, Nikos Arechiga, Tengyu Ma This is the offi

Kaidi Cao 528 Jan 01, 2023
A simple Python configuration file operator.

A simple Python configuration file operator This project provides a common way to read configurations using config42. Installation It is possible to i

Scott Lau 2 Nov 08, 2021
[PyTorch] Official implementation of CVPR2021 paper "PointDSC: Robust Point Cloud Registration using Deep Spatial Consistency". https://arxiv.org/abs/2103.05465

PointDSC repository PyTorch implementation of PointDSC for CVPR'2021 paper "PointDSC: Robust Point Cloud Registration using Deep Spatial Consistency",

153 Dec 14, 2022
Group Activity Recognition with Clustered Spatial Temporal Transformer

GroupFormer Group Activity Recognition with Clustered Spatial-TemporalTransformer Backbone Style Action Acc Activity Acc Config Download Inv3+flow+pos

28 Dec 12, 2022
Trash Sorter Extraordinaire is a software which efficiently detects the different types of waste in a pile of random trash through feeding it pictures or videos.

Trash-Sorter-Extraordinaire Trash Sorter Extraordinaire is a software which efficiently detects the different types of waste in a pile of random trash

Rameen Mahmood 1 Nov 07, 2021
ISNAS-DIP: Image Specific Neural Architecture Search for Deep Image Prior [CVPR 2022]

ISNAS-DIP: Image-Specific Neural Architecture Search for Deep Image Prior (CVPR 2022) Metin Ersin Arican*, Ozgur Kara*, Gustav Bredell, Ender Konukogl

Özgür Kara 24 Dec 18, 2022