当前位置:网站首页>What is the relationship between softmax and cross enterprise?

What is the relationship between softmax and cross enterprise?

2022-07-19 12:22:00 Xiaobai learns vision

Click on the above “ Xiaobai studies vision ”, Optional plus " Star standard " or “ Roof placement

 Heavy dry goods , First time delivery 

come from |  You know     author | Dong Xin

https://www.zhihu.com/question/294679135/answer/885285177

This article is only for academic sharing , The copyright belongs to the author , If there is any infringement , Please contact to delete

3d089c8b2dbd4007048b368e9fb5d301.jpeg

softmax Simple though , But in fact, there are many details worth mentioning .

Let's go through them one by one . 

   1. What is? Softmax?

First ,softmax Its function is to turn A sequence , Become probability .

6c3bcf2a1e79ea361b2d0b8a08969d78.jpeg

54697d7a5efdd6b579fc861bd90900e0.jpeg

He can guarantee that :

  1. All values are [0, 1] Between ( Because the probability has to be [0, 1])

  2. All the values add up to 1

Explain in terms of probability softmax Words , Namely

cb49724da3c4bfbfab9d17101d271355.jpeg

   2. The document says Softmax The relevant pit

Here's a little bit of “ Small pit ”, quite a lot deep learning frameworks Of file Inside (PyTorch,TensorFlow) It's like this softmax Of ,

take logits and produce probabilities

Obviously , Inside  logits  Namely Fully connected layer ( With or without activation Fine ) Output , probability  Namely softmax Output result of . here  logits  In some places it is also called  unscaled log probabilities. This is very interesting ,unscaled probability You can understand , Then why The full connection layer comes out directly, and the result will be with log It matters ?

d4ebdd6a6d2169851c50e3339b109f2c.jpeg

There are two reasons :

  1. because Fully connected layer The result , It's actually boundless ( There are positive and negative ), This is not consistent with the definition of probability , But if you look at him as Probabilistic log, You can understand .

  2. softmax The role of , We all know it's normalize probability. stay softmax Inside , Input outside_default.png   It's all exponential outside_default.png , All of them outside_default.png Think about it log of probability It's natural that .

   3. Softmax Namely Soft Version of ArgMax

 

well , Let's get back to softmax.

softmax, As the name suggests, it is soft Version of argmax. Let's see why ?

Take a chestnut , If softmax The input is :

outside_default.png

softmax The result is :

outside_default.png

Let's change the input a little bit , hold 3 Make it bigger , become 5, Input is

outside_default.png

softmax The result is :

outside_default.png

so softmax It's a very obvious “ Matthew effect ”: strong ( Big ) It's stronger ( Big ), weak ( Small ) Is weaker ( Small ). If you want to pick the largest number , This is actually called hardmax. that softmax Well , In fact, it's really soft Version of max, Choose a maximum value with a certain probability . stay hardmax in , The really biggest number , Must be based on 1(100%) The probability of being chosen , Other values have no chance at all . But in softmax in , All values have a chance to be selected as the maximum value . It's just , because softmax Of “ Matthew effect ”, The next largest number , Even if it's very little different from the really biggest number , It's much smaller than the real maximum number in probability .

therefore , I said before ,“softmax Its function is to turn A sequence , Become probability .” This probability is nothing else , It was chosen as max Probability .

such soft Version of max It's useful in many places . because hard Version of max Good is good , But there's a very serious gradient problem , The gradient of the function itself is very, very sparse ( For example, in neural networks max pooling), after hardmax after , Only the selected variable has a gradient on it , Everything else has no gradient . This is for some tasks ( Such as text generation ) It's almost unacceptable . So either use hard max Variants , such as Gumbel,

Categorical Reparameterization with Gumbel-Softmax

link :https://arxiv.org/abs/1611.01144

Or is it ARSM

ARSM: Augment-REINFORCE-Swap-Merge Estimator for Gradient Backpropagation Through Categorical Variable

link :http://proceedings.mlr.press/v97/yin19c.html

, Or directly softmax.

   4. Softmax And numerical stability

 

softmax The implementation of the code seems to be relatively simple , It's a direct formula

def softmax(x):
    """Compute the softmax of vector x."""
    exps = np.exp(x)
    return exps / np.sum(exps)

But this method is very unstable . Because this method is exponential , As long as your input is a little bit larger , such as :

outside_default.png

The denominator is

outside_default.png

Obviously , There's bound to be overflow in computation . The solution is simple , That is, we multiply the numerator and denominator by a coefficient , Reduce the value size , And make sure the whole thing is right

39e3ec0a36594b099a3b6ee74f648243.jpeg

Put the constant C Absorb into the index

81f180a72ffb2dd359428ba03726f5b7.jpeg

6ac50d691ab53fa37b639faeac8f1597.jpeg

there D It's optional , Generally, you can choose

cf906d7f3f2518c20d1c00898b539580.jpeg

The concrete implementation can be written as follows

def stablesoftmax(x):
    """Compute the softmax of vector x in a numerically stable way."""
    shiftx = x - np.max(x)
    exps = np.exp(shiftx)
    return exps / np.sum(exps)

Such an approach to numerical stability is much better , But there are still problems with numerical stability . For example, when the input values are too different , such as

outside_default.png

In this case, the above method is used , Maybe it's still a newspaper NaN Error of . But this is the problem of mathematics itself , Please pay attention to it when you use it .

One possible alternative is to use  LogSoftmax ( And then ask  exp), Numerical stability ratio softmax Better .

outside_default.png

You can see ,LogSoftmax It saves an index calculation , It saves a division , The numerical value is relatively stable . in addition , Actually  Softmax_Cross_Entropy  That's how it works in it

   5. Softmax Gradient of

So let's see softmax The gradient problem of . Whole softmax The operations inside are differentiable , So the gradient is very simple , It's the derivation formula of the basis , Here's the result .

1b7f90ba8e5032ca15f6c19c69162377.jpeg

92538f4c9cf39a07090bc27fe2556b8f.jpeg

So , If a variable is done softmax And then it was very small , such as outside_default.png , So his gradient is very small , There's almost no gradient . Sometimes , This causes the gradient to be very sparse , Optimization does not move .

   6. Softmax and Cross-Entropy The relationship between

Say first conclusion ,

softmax and cross-entropy It was a big relationship , If you just put the two together , It's faster to count , And more numerically stable .

cross-entropy It's not a unique concept of machine learning , Essentially, it's used to measure the similarity between two probability distributions . Simple understanding ( It's just a simple understanding of !) this is it ,

If you have two sets of variables :

outside_default.png

If you ask for L2 distance , It's a long way to go , But you do it to these two cross entropy, So the distance is 0. therefore cross-entropy In fact, it is more “ flexible ” some .

So we know ,cross entropy Is used to measure the distance between two probability distributions ,softmax It turns everything into a probability distribution , So naturally, the two are often used together . But you just need to deduce , You will find ,softmax + cross entropy It's like

“ Five meters east , Another ten meters to the West ”,

Why don't we just

“ Five meters to the West ” Well ?

cross entropy The formula is

9edb5ed64fbf206ee7896b22f5d0bf7f.jpeg

there outside_default.png That's what we said earlier  LogSoftmax. This thing is compared to softmax It's easy to calculate , The numerical stability is a little better , Why not count him directly ?

So , This has PyTorch Inside torch.nn.CrossEntropyLoss ( Input is what we talked about earlier logits, That is to say Everything that comes directly out of the connection ). This CrossEntropyLoss In fact, it is equal to torch.nn.LogSoftmax + torch.nn.NLLLoss.

The good news !

Xiaobai learns visual knowledge about the planet

Open to the outside world

ffe92068931d1fb6a94a42829d44f4c7.jpeg

 download 1:OpenCV-Contrib Chinese version of extension module 

 stay 「 Xiaobai studies vision 」 Official account back office reply : Extension module Chinese course , You can download the first copy of the whole network OpenCV Extension module tutorial Chinese version , Cover expansion module installation 、SFM Algorithm 、 Stereo vision 、 Target tracking 、 Biological vision 、 Super resolution processing and other more than 20 chapters .


 download 2:Python Visual combat project 52 speak 
 stay 「 Xiaobai studies vision 」 Official account back office reply :Python Visual combat project , You can download, including image segmentation 、 Mask detection 、 Lane line detection 、 Vehicle count 、 Add Eyeliner 、 License plate recognition 、 Character recognition 、 Emotional tests 、 Text content extraction 、 Face recognition, etc 31 A visual combat project , Help fast school computer vision .


 download 3:OpenCV Actual project 20 speak 
 stay 「 Xiaobai studies vision 」 Official account back office reply :OpenCV Actual project 20 speak , You can download the 20 Based on OpenCV Realization 20 A real project , Realization OpenCV Learn advanced .


 Communication group 

 Welcome to join the official account reader group to communicate with your colleagues , There are SLAM、 3 d visual 、 sensor 、 Autopilot 、 Computational photography 、 testing 、 Division 、 distinguish 、 Medical imaging 、GAN、 Wechat groups such as algorithm competition ( It will be subdivided gradually in the future ), Please scan the following micro signal clustering , remarks :” nickname + School / company + Research direction “, for example :” Zhang San  +  Shanghai Jiaotong University  +  Vision SLAM“. Please note... According to the format , Otherwise, it will not pass . After successful addition, they will be invited to relevant wechat groups according to the research direction . Please do not send ads in the group , Or you'll be invited out , Thanks for your understanding ~
原网站

版权声明
本文为[Xiaobai learns vision]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/200/202207171611250582.html