当前位置:网站首页>What is the relationship between softmax and cross enterprise?
What is the relationship between softmax and cross enterprise?
2022-07-19 12:22:00 【Xiaobai learns vision】
Click on the above “ Xiaobai studies vision ”, Optional plus " Star standard " or “ Roof placement ”
Heavy dry goods , First time delivery come from | You know author | Dong Xin
https://www.zhihu.com/question/294679135/answer/885285177
This article is only for academic sharing , The copyright belongs to the author , If there is any infringement , Please contact to delete

softmax Simple though , But in fact, there are many details worth mentioning .
Let's go through them one by one .
1. What is? Softmax?
First ,softmax Its function is to turn A sequence , Become probability .


He can guarantee that :
All values are [0, 1] Between ( Because the probability has to be [0, 1])
All the values add up to 1
Explain in terms of probability softmax Words , Namely

2. The document says Softmax The relevant pit
Here's a little bit of “ Small pit ”, quite a lot deep learning frameworks Of file Inside (PyTorch,TensorFlow) It's like this softmax Of ,
take logits and produce probabilities
Obviously , Inside logits Namely Fully connected layer ( With or without activation Fine ) Output , probability Namely softmax Output result of . here logits In some places it is also called unscaled log probabilities. This is very interesting ,unscaled probability You can understand , Then why The full connection layer comes out directly, and the result will be with log It matters ?

There are two reasons :
because Fully connected layer The result , It's actually boundless ( There are positive and negative ), This is not consistent with the definition of probability , But if you look at him as Probabilistic log, You can understand .
softmax The role of , We all know it's normalize probability. stay softmax Inside , Input
It's all exponential
, All of them
Think about it log of probability It's natural that .
3. Softmax Namely Soft Version of ArgMax
well , Let's get back to softmax.
softmax, As the name suggests, it is soft Version of argmax. Let's see why ?
Take a chestnut , If softmax The input is :

softmax The result is :

Let's change the input a little bit , hold 3 Make it bigger , become 5, Input is

softmax The result is :

so softmax It's a very obvious “ Matthew effect ”: strong ( Big ) It's stronger ( Big ), weak ( Small ) Is weaker ( Small ). If you want to pick the largest number , This is actually called hardmax. that softmax Well , In fact, it's really soft Version of max, Choose a maximum value with a certain probability . stay hardmax in , The really biggest number , Must be based on 1(100%) The probability of being chosen , Other values have no chance at all . But in softmax in , All values have a chance to be selected as the maximum value . It's just , because softmax Of “ Matthew effect ”, The next largest number , Even if it's very little different from the really biggest number , It's much smaller than the real maximum number in probability .
therefore , I said before ,“softmax Its function is to turn A sequence , Become probability .” This probability is nothing else , It was chosen as max Probability .
such soft Version of max It's useful in many places . because hard Version of max Good is good , But there's a very serious gradient problem , The gradient of the function itself is very, very sparse ( For example, in neural networks max pooling), after hardmax after , Only the selected variable has a gradient on it , Everything else has no gradient . This is for some tasks ( Such as text generation ) It's almost unacceptable . So either use hard max Variants , such as Gumbel,
Categorical Reparameterization with Gumbel-Softmax
link :https://arxiv.org/abs/1611.01144
Or is it ARSM
ARSM: Augment-REINFORCE-Swap-Merge Estimator for Gradient Backpropagation Through Categorical Variable
link :http://proceedings.mlr.press/v97/yin19c.html
, Or directly softmax.
4. Softmax And numerical stability
softmax The implementation of the code seems to be relatively simple , It's a direct formula
def softmax(x):
"""Compute the softmax of vector x."""
exps = np.exp(x)
return exps / np.sum(exps)But this method is very unstable . Because this method is exponential , As long as your input is a little bit larger , such as :

The denominator is

Obviously , There's bound to be overflow in computation . The solution is simple , That is, we multiply the numerator and denominator by a coefficient , Reduce the value size , And make sure the whole thing is right

Put the constant C Absorb into the index


there D It's optional , Generally, you can choose

The concrete implementation can be written as follows
def stablesoftmax(x):
"""Compute the softmax of vector x in a numerically stable way."""
shiftx = x - np.max(x)
exps = np.exp(shiftx)
return exps / np.sum(exps)Such an approach to numerical stability is much better , But there are still problems with numerical stability . For example, when the input values are too different , such as

In this case, the above method is used , Maybe it's still a newspaper NaN Error of . But this is the problem of mathematics itself , Please pay attention to it when you use it .
One possible alternative is to use LogSoftmax ( And then ask exp), Numerical stability ratio softmax Better .

You can see , It saves an index calculation , It saves a division , The numerical value is relatively stable . in addition , Actually LogSoftmax That's how it works in it Softmax_Cross_Entropy
5. Softmax Gradient of
So let's see softmax The gradient problem of . Whole softmax The operations inside are differentiable , So the gradient is very simple , It's the derivation formula of the basis , Here's the result .


So , If a variable is done softmax And then it was very small , such as
, So his gradient is very small , There's almost no gradient . Sometimes , This causes the gradient to be very sparse , Optimization does not move .
6. Softmax and Cross-Entropy The relationship between
Say first conclusion ,
softmax and cross-entropy It was a big relationship , If you just put the two together , It's faster to count , And more numerically stable .
cross-entropy It's not a unique concept of machine learning , Essentially, it's used to measure the similarity between two probability distributions . Simple understanding ( It's just a simple understanding of !) this is it ,
If you have two sets of variables :

If you ask for L2 distance , It's a long way to go , But you do it to these two cross entropy, So the distance is 0. therefore cross-entropy In fact, it is more “ flexible ” some .
So we know ,cross entropy Is used to measure the distance between two probability distributions ,softmax It turns everything into a probability distribution , So naturally, the two are often used together . But you just need to deduce , You will find ,softmax + cross entropy It's like
“ Five meters east , Another ten meters to the West ”,
Why don't we just
“ Five meters to the West ” Well ?
cross entropy The formula is

there
That's what we said earlier LogSoftmax. This thing is compared to softmax It's easy to calculate , The numerical stability is a little better , Why not count him directly ?
So , This has PyTorch Inside torch.nn.CrossEntropyLoss ( Input is what we talked about earlier logits, That is to say Everything that comes directly out of the connection ). This CrossEntropyLoss In fact, it is equal to torch.nn.LogSoftmax + torch.nn.NLLLoss.
The good news !
Xiaobai learns visual knowledge about the planet
Open to the outside world

download 1:OpenCV-Contrib Chinese version of extension module
stay 「 Xiaobai studies vision 」 Official account back office reply : Extension module Chinese course , You can download the first copy of the whole network OpenCV Extension module tutorial Chinese version , Cover expansion module installation 、SFM Algorithm 、 Stereo vision 、 Target tracking 、 Biological vision 、 Super resolution processing and other more than 20 chapters .
download 2:Python Visual combat project 52 speak
stay 「 Xiaobai studies vision 」 Official account back office reply :Python Visual combat project , You can download, including image segmentation 、 Mask detection 、 Lane line detection 、 Vehicle count 、 Add Eyeliner 、 License plate recognition 、 Character recognition 、 Emotional tests 、 Text content extraction 、 Face recognition, etc 31 A visual combat project , Help fast school computer vision .
download 3:OpenCV Actual project 20 speak
stay 「 Xiaobai studies vision 」 Official account back office reply :OpenCV Actual project 20 speak , You can download the 20 Based on OpenCV Realization 20 A real project , Realization OpenCV Learn advanced .
Communication group
Welcome to join the official account reader group to communicate with your colleagues , There are SLAM、 3 d visual 、 sensor 、 Autopilot 、 Computational photography 、 testing 、 Division 、 distinguish 、 Medical imaging 、GAN、 Wechat groups such as algorithm competition ( It will be subdivided gradually in the future ), Please scan the following micro signal clustering , remarks :” nickname + School / company + Research direction “, for example :” Zhang San + Shanghai Jiaotong University + Vision SLAM“. Please note... According to the format , Otherwise, it will not pass . After successful addition, they will be invited to relevant wechat groups according to the research direction . Please do not send ads in the group , Or you'll be invited out , Thanks for your understanding ~边栏推荐
- SwiftUI Swift 中的数据持久性,保存数据的不同方法
- Mysql-1366 - Incorrect string value: ‘\xE5\xBC\xA0\xE4\xB8\x89‘ for column ‘userName‘ at row 1
- MATLAB(4)函数及文件
- Day 4 homework
- Nature | the carbon sequestration rate of groundwater is similar to that of oligotrophic marine system
- ros(26):ros::Time::now(),ros::Duration,toSec(),toNSec(); Calculate program execution time
- 数据库每日一题---第25天:银行账户概要 II
- Time consuming test of construction and sorting of set, vector and list
- MyCat2搭建mysql主从分离
- HCIP(4)
猜你喜欢

Nature子刊 | 地下水固碳速率与寡营养海洋系统固碳速率相近

Energy saving resource allocation strategy based on time-varying graph in satellite networks

第一天实验

Valid bracket sequence of "Niuke | daily question"

MySQL learning notes - paging - table creation - data type

getchar()

GET 请求和 POST 请求的区别和使用

Leetcode 150. Evaluation of inverse Polish expression

OpenCV 教程 03: 如何跟踪视频中的某一对象

HCIP(8)
随机推荐
机器学习作业1
RAID 磁盘阵列详解,RAID分类及优缺点
数据库每日一题---第25天:银行账户概要 II
Scrapy关键词 爬虫的简单实现(以新华网和人民网为例)
C# .NET 云南农信国密签名(SM2)简要解析
HCIP(6)
3.Golang字符串string类型
In 2022, how to choose cross end technology solutions?
Mysql-1366 - Incorrect string value: ‘\xE5\xBC\xA0\xE4\xB8\x89‘ for column ‘userName‘ at row 1
HICP first day notes
PPPoE拨号上网
C# . Net Yunnan rural credit national secret signature (SM2) brief analysis
C语言绘图示例-繁花图案
Softmax和Cross-entropy是什么关系?
微机原理与技术接口 实验四 子程序及中断实验
WAV和PCM的关系和区别
我最喜欢的10个机器学习公众号
2022安全员-C证上岗证题目及答案
人大、微软等提出InclusiveFL:异构设备上的包容性联邦学习
七月集训(第17天) —— 广度优先搜索