当前位置：网站首页>Summary of common activation functions for deep learning

Summary of common activation functions for deep learning

2022-07-26 09:03:00 【Weiyaner】

1 Why do we need activation functions

First of all, the distribution of data is nonlinear , The calculation of general neural network is linear , Introduce activation function , Is to introduce nonlinearity into neural networks , Strengthen the learning ability of the network . So the biggest feature of the activation function is nonlinearity .

Different activation functions , According to its characteristics , Applications are also different .

Sigmoid and tanh The feature of is to limit the output to (0,1) and (-1,1) Between , explain Sigmoid and tanh Suitable for processing probability value , for example LSTM Various doors in ;

and ReLU No way. , because ReLU No maximum limit , There may be large values . Again , according to ReLU Characteristics of ,Relu Suitable for deep network training , and Sigmoid and tanh No way. , Because they disappear in gradients .

2 Common activation functions

1 Sigmoid

sigmoid The function is also called Logistic function , because Sigmoid The function can be derived from Logistic Return to （LR） Infer from , It's also LR Activation function specified by the model .

sigmod The value range of the function is （0, 1） Between , The output of the network can be mapped in this range , Easy to analyze .

Activation function	expression	Leading form	Value range	Images	apply
Sigmoid	$\frac{1}{1+e^x}$	$f^{'} = f (1 - f)$	(0,1)		Calculate the probability value

Analysis of advantages and disadvantages ：

advantage ：
Easy to find , The data conforms to Poisson distribution
shortcoming ：
- The activation function is computationally expensive （ Both forward propagation and back propagation contain power operation and division ）;
- When calculating the error gradient by back propagation , Derivation involves division ;
- Sigmoid The derivative range is [0, 0.25], Due to the of neural network back propagation “ The chain reaction ”, It's easy to see the gradient disappear .
- Sigmoid The output of is not 0 mean value （ namely zero-centered）; This will cause the neurons of the latter layer to get the non output of the previous layer 0 Mean signal as input , With the deepening of the network , Will change the original distribution of the data .|

2 Tanh

tanh Is a hyperbolic tangent function , Its English reading is Hyperbolic Tangent.tanh and sigmoid be similar , All belong to saturation activation function , The difference is that the output value range consists of (0,1) Change into (-1,1), You can put tanh The function is seen as sigmoid The result of translation and stretching down .

Activation function	expression	Leading form	Value range	Images	apply
tanh	$\frac{e^x-e^{-x}}{e^x+e^{-x}}$	$f'=\frac{2}{1+e^{-2x}}-1$	(-1,1)

Tanh Characteristics

advantage
- tanh When the output range of (-1, 1), It's solved Sigmoid Function is not zero-centered Output problems ;
shortcoming
- The problem of power operation still exists ;
- tanh The derivative range is (0, 1) Between , comparison sigmoid Of (0, 0.25), The gradient disappears and is relieved , But there is still .

3 Relu And its variants （2012 AlexNet）

Due to the gradient disappearance problem of the above activation function , therefore 2012 Rectification linear unit was proposed in （Relu）.

Activation function	expression	Leading form	Value range	apply
Relu	$f = ma x (0, x)$	$f^{'} = 1, 0$	[0,1)	Avoiding the disappearance of gradients , Suitable for deep network
$PRelu(a_i Variable )//LeakyRelu(a_i=0.01)$	$f(x)=\left\{\begin{aligned}a_ix, x<0 \\x,x>=0\end{aligned}\right.$	$f'(x)=\left\{\begin{aligned}a_i, x<0 \\1,x>=0\end{aligned}\right.$	(-1,1)	improve Relu Of 0 gradient , Is a small negative value , Prevent neuron death
RRelu	$y=\left\{\begin{array}{lc}x, & x \geq 0 \\ a\left(e^{x}-1\right), & x<0\end{array}\right.$		(-1,1)	In the negative part ai It's from a uniform distribution U(I,u) A random number from

summary ：

Leaky ReLU In is constant , General Settings 0.01. This function is usually better than Relu The activation function works better , But the effect is not very stable , So in practice Leaky ReLu Not much is used .
PRelu（ Parameterized modified linear element ） As a learnable parameter , It will be updated during training .
RReLU（ Random correction of linear elements ） It's also Leaky ReLU A variation of . stay RReLU in , The slope of negative value is random in training , In later tests it became fixed .RReLU The highlight is , In the training session ,aji It's from a uniform distribution U(I,u) A random number from .

原网站

版权声明
本文为[Weiyaner]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/207/202207260851385784.html