当前位置：网站首页>Li Hongyi machine learning 2022.7.15 -- gradient descent

Li Hongyi machine learning 2022.7.15 -- gradient descent

2022-07-19 15:09:00 【ww9878】

Introduction to gradient descent

In solving optimization problems
Insert picture description here
We need to find the smallest group θ, Let the loss function be as small as possible .
First, choose any w0 and b0 Value , about w1 and b1 Conduct update:

η For learning rate , Manual setting . Repeat the above steps and keep updating wi and bi Until there is no change . stay update wi and bi The process of is gradient descent .

The partial differential part is the gradient part
Insert picture description here

Learning rate η

Learning rate η The adjustment should be appropriate and just , It is easy to get stuck in a certain position if it is too large . If it is too small, the moving distance is small , The results appear slowly . Pictured ：
Insert picture description here
Usually at the beginning, the distance from the lowest point is large , You can choose η It's worth more , As the moving distance is getting closer and closer to the lowest point, it can be appropriately lowered η Value . Different parameters require different learning rates .

Adagrad Algorithm

The learning rate of each parameter divides it by the root mean square of the previous differential ,σ^t: The root mean square of all the differential of the previous parameter , It's different for each parameter .
Insert picture description here
g^ t The greater the gradient . The greater the distance you move , and σ^t The greater the gradient , The smaller the step . There is a contradiction between the two . Consider the problem of cross parameters , So the best step should be ： A differential / Quadratic differential . Directly proportional to the first derivative , It is inversely proportional to the quadratic differential , The quadratic differential is larger , Parameters update The larger . Only by considering quadratic differentiation can we truly reflect the distance of the lowest point .
Insert picture description here

Stochastic gradient descent

The random gradient descent method is faster than the previous gradient descent , Select one randomly or in order X^n, After calculating the loss function update gradient .
Insert picture description here
The general gradient descent step may include several examples , Random gradient descent has taken many steps .

Feature scaling

Scale its input feature range , Make the range of different features the same .

When xi The input value of is larger wi When the same changes , It has a great influence on the output value . As you can see from the diagram x2 It has a great influence on the loss function , be w2 It's steep around .
Insert picture description here

Zoom method

Insert picture description here

In the i Calculator average in dimensions mi, And its standard deviation σi, Then use the r The... In the first example i Inputs , Subtract the average mi, Then divide by the standard deviation σi, The result is that all dimensions are 0, All variances are 1.

Basis of mathematical theory of gradient descent

There are some places I still don't understand , Temporary remarks

Taylor expansion

if h(x) stay x=x0 There is an infinite derivative in a field of points （ Infinitely differentiable ,infinitely differentiable）, So there are... In this field ：
Insert picture description here
When x Very close to x0 when , Yes h(x)≈h(x0)+h′(x0)(x−x0) The above formula is a function h(x) stay x=x0 Near the point about x Power function expansion of , Also known as Taylor expansion .

Multivariate Taylor expansion

Insert picture description here

Based on Taylor expansion , stay (a,b) Within the red circle of the dot , The loss function can be simplified by Taylor expansion ：