当前位置:网站首页>Deep learning parameter initialization (II) Kaiming initialization with code
Deep learning parameter initialization (II) Kaiming initialization with code
2022-07-19 12:21:00 【Xiaoshu Xiaoshu】
Catalog
3、 ... and 、Kaiming Initialization assumptions
Four 、Kaiming Simple formula derivation of initialization
5、 ... and 、Pytorch Realization
Deep learning parameter initialization series :
( One )Xavier initialization With code
( Two )Kaiming initialization With code
One 、 Introduce
Kaiming Initialize the paper address :https://arxiv.org/abs/1502.01852
Xavier Initialize at ReLU Layer performance is not good , The main reason is relu The layer maps negative numbers to 0, Affect the overall variance . and Xavier The activation function applicable to the initialization method is limited : Requirements about 0 symmetry ; linear . and ReLU The activation function does not meet these conditions , Experiments can also verify Xavier Initialization does not apply to ReLU Activation function . So he Kaiming has made improvements , Put forward Kaiming initialization , At first, it was mainly used in computer vision 、 Convolution network .
Two 、 Basic knowledge of
1. Suppose that random variables X And random variables Y Are independent of each other , Then there are
(1)
2. A formula for finding variance by expectation , The expectation that the variance is equal to the square minus the square of the expectation .
(2)
3. Independent variable product formula
(3)
4. Continuous random variable X The probability density function of is f(x), If integral converges absolutely , The expected formula is as follows :
(4)
3、 ... and 、Kaiming Initialization assumptions
And Xavier Initialization is similar ,Kaiming Initialization also applies Glorot Conditions , That is, our initialization strategy should make the activation value of each layer consistent with the variance of the state gradient in the propagation process ;Kaiming The initialized parameters still meet the mean value of 0, And the average weight during the update process has always been 0.
And Xavier Initialize different ,Kaiming Initialization no longer requires that the average output of each layer be 0( because Relu Such an activation function cannot be done ); Of course, it is no longer required f′(0)=1.
Kaiming In the initialization , Forward propagation and back propagation use their own initialization strategies , But ensure that the variance of each layer in forward propagation and the variance of gradient in back propagation are 1.
Four 、Kaiming The initialization of the Simple formula derivation
We use convolution to derive , And the activation function uses ReLU.
1. Forward propagation
For a layer of convolution , Yes :
(5)
among
Is the output before activating the function ,
Is the number of weights ,
Weight. ,
It's input .
according to (3) type , Can be (4) The formula is deduced as :
(6)
Based on assumptions
, however
It is the upper layer that passes ReLU Got , therefore
, be :
![Var(y_{i})=n_{i}[Var(w_{i})Var(x_{i})+Var(w_{i})(E(x_{i}))^{2}]](http://img.inotgo.com/imagesLocal/202207/19/202207171612297823_22.gif)
(7)
adopt (2) Type available
, be (7) The formula is deduced as :
(8)
According to the expectation formula (4), Pass the first
Layer output to find this expectation , We have
, among
Express ReLU function .
(9)
among
Represents the probability density function , because
When
, So you can remove less than 0 The range of , And greater than 0 When
, Can be launched :
(10)
because
It is assumed that 0 It is symmetrically distributed around and the mean value is 0, therefore
Also in 0 The nearby distribution is symmetrical , And the mean is 0( Assume here that the offset is 0), be
(11)
therefore
The expectation is :

(12)
According to the formula (2), because
Our expectations are equal to 0, So there is :

The type (12) Derived as :
(13)
take (13) Type in (8) type :
(14)
Carry out forward propagation from the first floor , The variance of a certain layer can be obtained as :

there
Is the input sample , We will normalize it , therefore
, Now let the output variance of each layer be equal to 1, namely :


So when it spreads forward ,Kaiming The implementation of initialization is the following uniform distribution :
![W\sim U[-\sqrt{\frac{6}{n_{i}}},\sqrt{\frac{6}{n_{i}}}]](http://img.inotgo.com/imagesLocal/202207/19/202207171612297823_40.gif)
Gaussian distribution :
![W\sim N[0,\frac{2}{n_{i}}]](http://img.inotgo.com/imagesLocal/202207/19/202207171612297823_7.gif)
2. Back propagation
Because back propagation
(15)
among
Means that the loss function takes its derivative .
Is the parameter
according to (3) type :

![=\hat{n}[Var(\hat{w})Var(\Delta y_{i})+Var(\hat{w_{i}})(E\Delta y_{i})^{2}+Var(\Delta y_{i})(E\hat{w}_{i})^{2}]](http://img.inotgo.com/imagesLocal/202207/19/202207171612297823_1.gif)

among
Indicates the number of output channels during back propagation , At last, it comes to


So back propagation ,Kaiming The implementation of initialization is the following uniform distribution :
![W\sim U[-\sqrt{\frac{6}{\hat{n}_{i}}},\sqrt{\frac{6}{\hat{n}_{i}}}]](http://img.inotgo.com/imagesLocal/202207/19/202207171612297823_37.gif)
Gaussian distribution :
![W\sim N[0,\frac{2}{\hat{n}_{i}}]](http://img.inotgo.com/imagesLocal/202207/19/202207171612297823_39.gif)
5、 ... and 、Pytorch Realization
import torch
class DemoNet(torch.nn.Module):
def __init__(self):
super(DemoNet, self).__init__()
self.conv1 = torch.nn.Conv2d(1, 1, 3)
print('random init:', self.conv1.weight)
'''
kaiming The initialization method obeys uniform distribution U~(-bound, bound), bound = sqrt(6/(1+a^2)*fan_in)
a Is the slope of the negative half axis of the activation function ,relu yes 0
mode- Optional fan_in or fan_out, fan_in When propagating forward , The variance is consistent ; fan_out When making back propagation , The variance is consistent
nonlinearity- Optional relu and leaky_relu , The default value is . leaky_relu
'''
torch.nn.init.kaiming_uniform_(self.conv1.weight, a=0, mode='fan_out')
print('xavier_uniform_:', self.conv1.weight)
'''
kaiming The initialization method obeys the normal distribution , This is a 0 The normal distribution of the mean ,N~ (0,std), among std = sqrt(2/(1+a^2)*fan_in)
a Is the slope of the negative half axis of the activation function ,relu yes 0
mode- Optional fan_in or fan_out, fan_in When propagating forward , The variance is consistent ;fan_out When making back propagation , The variance is consistent
nonlinearity- Optional relu and leaky_relu , The default value is . leaky_relu
'''
torch.nn.init.kaiming_normal_(self.conv1.weight, a=0, mode='fan_out')
print('kaiming_normal_:', self.conv1.weight)
if __name__ == '__main__':
demoNet = DemoNet()边栏推荐
猜你喜欢

Leetcode 150. Evaluation of inverse Polish expression

HCIP(6)

一个技巧;教你轻松下载抖音直播视频,抖音直播视频下载新方案!

如何应用小程序容器技术开发Hybrid App

渐变色按钮功能按钮绘制C语言示例
米哈游2023秋季招聘正式开始~提前批有机会免笔试!

李宏毅《机器学习》|1. Introduction of this course(机器学习介绍)

第四天作业

Energy saving resource allocation strategy based on time-varying graph in satellite networks

C语言绘图示例-商标徽标
随机推荐
熟悉NestJS (新手篇)
QT implementation traverses folders
文件一键备份
Two misunderstandings of digital transformation
[C # WPF] summary of personal online disk exercise items
一个技巧;教你轻松下载抖音直播视频,抖音直播视频下载新方案!
C# .NET 云南农信国密签名(SM2)简要解析
MGRE 环境下配置OSPF实验
SwiftUI 颜色教程大全之中创建自定义调色板
Leetcode 150. Evaluation of inverse Polish expression
Leetcode 239. 滑动窗口最大值
STL string input / output overload 1
七月集训(第17天) —— 广度优先搜索
getchar()
电路故障的分析简略
psd.js 解析PSD文件
机器学习(上)吴恩达
Leetcode 20. 有效的括号
HCIP(5)
Genesis and bluerun ventures have in-depth exchanges