当前位置:网站首页>Deep learning parameter initialization (II) Kaiming initialization with code

Deep learning parameter initialization (II) Kaiming initialization with code

2022-07-19 12:21:00 Xiaoshu Xiaoshu

Catalog

One 、 Introduce

Two 、 Basic knowledge of

3、 ... and 、Kaiming Initialization assumptions  

Four 、Kaiming Simple formula derivation of initialization

1. Forward propagation

2. Back propagation

5、 ... and 、Pytorch Realization


Deep learning parameter initialization series :

( One )Xavier initialization With code

( Two )Kaiming initialization With code

One 、 Introduce

        Kaiming Initialize the paper address :https://arxiv.org/abs/1502.01852

        Xavier Initialize at ReLU Layer performance is not good , The main reason is relu The layer maps negative numbers to 0, Affect the overall variance . and Xavier The activation function applicable to the initialization method is limited : Requirements about 0 symmetry ; linear . and ReLU The activation function does not meet these conditions , Experiments can also verify Xavier Initialization does not apply to ReLU Activation function . So he Kaiming has made improvements , Put forward Kaiming initialization , At first, it was mainly used in computer vision 、 Convolution network .

Two 、 Basic knowledge of

1. Suppose that random variables X And random variables Y Are independent of each other , Then there are

gif.latex?Var%28X+Y%29%3DVar%28X%29+Var%28Y%29        (1) 

 2. A formula for finding variance by expectation , The expectation that the variance is equal to the square minus the square of the expectation .

Var(X)=E(X^{2})-(E(X))^{2}                (2)

 3. Independent variable product formula

Var(XY)=Var(X)Var(Y)+Var(X)(E(Y))^{2}+Var(Y)(E(X))^{2}        (3)

4. Continuous random variable X The probability density function of is f(x), If integral converges absolutely , The expected formula is as follows :

E(X)=\int_{-\infty }^{\infty }xf(x)dx                        (4)

3、 ... and 、Kaiming Initialization assumptions  

        And Xavier Initialization is similar ,Kaiming Initialization also applies Glorot Conditions , That is, our initialization strategy should make the activation value of each layer consistent with the variance of the state gradient in the propagation process ;Kaiming The initialized parameters still meet the mean value of 0, And the average weight during the update process has always been 0.

        And Xavier Initialize different ,Kaiming Initialization no longer requires that the average output of each layer be 0( because Relu Such an activation function cannot be done ); Of course, it is no longer required f′(0)=1.

        Kaiming In the initialization , Forward propagation and back propagation use their own initialization strategies , But ensure that the variance of each layer in forward propagation and the variance of gradient in back propagation are 1.

Four 、Kaiming The initialization of the Simple formula derivation

         We use convolution to derive , And the activation function uses ReLU.

1. Forward propagation

         For a layer of convolution , Yes :

Var(y_{i})=n_{i}Var(w_{i}\cdot x_{i})                        (5)

          among y_{i} Is the output before activating the function ,n_{i} Is the number of weights ,w_{i} Weight. ,x_{i} It's input .

         according to (3) type , Can be (4) The formula is deduced as :

Var(y_{i})=n_{i}[Var(w_{i})Var(x_{i})+Var(w_{i})(E(x_{i}))^{2}+(E(w_{i}))^{2}Var(x_{i})]        (6)

          Based on assumptions E(w_{i})=0, however x_{i} It is the upper layer that passes ReLU Got , therefore E(x_{i})\neq 0, be :

Var(y_{i})=n_{i}[Var(w_{i})Var(x_{i})+Var(w_{i})(E(x_{i}))^{2}]

=n_{i}Var(w_{i})(Var(x_{i})+(E(x_{i}))^{2})                (7)

  adopt (2) Type available Var(x_{i})+(E(x_{i}))^{2}=E(x_{i}^{2}), be (7) The formula is deduced as :

Var(y_{i})=n_{i}Var(w_{i})E(x_{i}^{2})                        (8)

According to the expectation formula (4), Pass the first i-1 Layer output to find this expectation , We have x_{i}=f(y_{i-1}), among f Express ReLU function .

 E(x_{i}^{2})=E(f^{2}(y_{i-1}))=\int_{-\infty }^{\infty }f^{2}(y_{i-1})p(f^{2}(y_{i-1}))df^{2}(y_{i-1})                        (9)

among p(f^{2}(y_{i-1})) Represents the probability density function , because y_{i-1}\in (-\infty ,0) When f(y_{i-1})=0, So you can remove less than 0 The range of , And greater than 0 When f(y_{i-1})=y_{i-1}, Can be launched :

E(x_{i}^{2})=E(f^{2}(y_{i-1}))=\int_{0 }^{\infty }f^{2}(y_{i-1})p(f^{2}(y_{i-1}))df^{2}(y_{i-1})                   (10)

because w_{i-1} It is assumed that 0 It is symmetrically distributed around and the mean value is 0, therefore y_{i-1} Also in 0 The nearby distribution is symmetrical , And the mean is 0( Assume here that the offset is 0), be

\int_{-\infty }^{0 }f^{2}(y_{i-1})p(f^{2}(y_{i-1}))df^{2}(y_{i-1})=\int_{0 }^{\infty }f^{2}(y_{i-1})p(f^{2}(y_{i-1}))df^{2}(y_{i-1})       (11) 

therefore x_{i}^{2} The expectation is :

E(x_{i}^{2})=E(f^{2}(y_{i-1}))=\frac{1}{2}(\int_{-\infty }^{0 }f^{2}(y_{i-1})p(f^{2}(y_{i-1}))df^{2}(y_{i-1})+\int_{0 }^{\infty }f^{2}(y_{i-1})p(f^{2}(y_{i-1}))df^{2}(y_{i-1}))

=\frac{1}{2}\int_{-\infty }^{\infty }f^{2}(y_{i-1})p(f^{2}(y_{i-1}))df^{2}(y_{i-1})=E(f^{2}(y_{i-1}))=\frac{1}{2}E(y_{i-1}^{2})              (12)

  According to the formula (2), because y_{i-1} Our expectations are equal to 0, So there is :

Var(y_{i-1})=E(y_{i-1}^{2})

The type (12) Derived as :

E(x_{i}^{2})=\frac{1}{2}E(y_{i-1}^{2})=\frac{1}{2}Var(y_{i-1})                        (13)

take (13) Type in (8) type :

Var(y_{i})=\frac{1}{2}n_{i}Var(w_{i})Var(y_{i-1})                        (14)

Carry out forward propagation from the first floor , The variance of a certain layer can be obtained as  :

Var(y_{i})=Var(y_{1})(\prod_{i=0}^{L}\frac{1}{2}n_{i}Var(w_{i}))

there x_{1} Is the input sample , We will normalize it , therefore Var(x_{1})=1, Now let the output variance of each layer be equal to 1, namely :

\frac{1}{2}n_{i}Var(w_{i})=1

Var(w_{i})=\frac{2}{n_{i}}

So when it spreads forward ,Kaiming The implementation of initialization is the following uniform distribution :

W\sim U[-\sqrt{\frac{6}{n_{i}}},\sqrt{\frac{6}{n_{i}}}]

Gaussian distribution :

W\sim N[0,\frac{2}{n_{i}}]

2. Back propagation

Because back propagation

\Delta x_{i}=\hat{w_{i}}\Delta y_{i}                (15)

  among \Delta Means that the loss function takes its derivative . \hat{w}_{i} Is the parameter

according to (3) type :

Var(\Delta x_{i})=\hat{n}Var(\hat{w_{i}}\Delta y_{i})

=\hat{n}[Var(\hat{w})Var(\Delta y_{i})+Var(\hat{w_{i}})(E\Delta y_{i})^{2}+Var(\Delta y_{i})(E\hat{w}_{i})^{2}]

=\hat{n}Var(\hat{w_{i}})Var(\Delta y_{i})=\frac{1}{2}\hat{n}Var(\hat{w_{i}})Var(\Delta x_{i+1})

  among \hat{n} Indicates the number of output channels during back propagation , At last, it comes to

\frac{1}{2}\hat{n}_{i}Var(w_{i})=1

Var(w_{i})=\frac{2}{\hat{n}_{i}}

So back propagation ,Kaiming The implementation of initialization is the following uniform distribution :

W\sim U[-\sqrt{\frac{6}{\hat{n}_{i}}},\sqrt{\frac{6}{\hat{n}_{i}}}]

Gaussian distribution :

W\sim N[0,\frac{2}{\hat{n}_{i}}]

5、 ... and 、Pytorch Realization

import torch

class DemoNet(torch.nn.Module):
    def __init__(self):
        super(DemoNet, self).__init__()
        self.conv1 = torch.nn.Conv2d(1, 1, 3)
        print('random init:', self.conv1.weight)
        '''
            kaiming  The initialization method obeys uniform distribution  U~(-bound, bound), bound = sqrt(6/(1+a^2)*fan_in)
            a  Is the slope of the negative half axis of the activation function ,relu  yes  0
            mode-  Optional  fan_in  or  fan_out, fan_in  When propagating forward , The variance is consistent ; fan_out  When making back propagation , The variance is consistent 
            nonlinearity-  Optional  relu  and  leaky_relu , The default value is  . leaky_relu
        '''
        torch.nn.init.kaiming_uniform_(self.conv1.weight, a=0, mode='fan_out')
        print('xavier_uniform_:', self.conv1.weight)

        '''
            kaiming  The initialization method obeys the normal distribution , This is a  0  The normal distribution of the mean ,N~ (0,std), among  std = sqrt(2/(1+a^2)*fan_in)
            a  Is the slope of the negative half axis of the activation function ,relu  yes  0
            mode-  Optional  fan_in  or  fan_out, fan_in  When propagating forward , The variance is consistent ;fan_out  When making back propagation , The variance is consistent 
            nonlinearity-  Optional  relu  and  leaky_relu , The default value is  . leaky_relu
        '''
        torch.nn.init.kaiming_normal_(self.conv1.weight, a=0, mode='fan_out')
        print('kaiming_normal_:', self.conv1.weight)


if __name__ == '__main__':
    demoNet = DemoNet()

原网站

版权声明
本文为[Xiaoshu Xiaoshu]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/200/202207171612297823.html