当前位置:网站首页>Category imbalance in classification tasks
Category imbalance in classification tasks
2022-07-19 11:00:00 【TT ya】
Beginner little rookie , I hope it's like taking notes and recording what I've learned , Also hope to help the same entry-level people , I hope the big guys can help correct it ~ Tort made delete .
Catalog
One 、 Problem definition
The number of training samples in different categories in classification tasks varies greatly .
Two 、 Solutions to problems
1、 Solution introduction
In linear classifier , We use it
When classifying new samples , We predict with a threshold y Value comparison . Usually we choose 0.5 Threshold value ( Think that the possibility of true positive and negative examples is the same ), namely :
when , The prediction is a positive example .
2、 Ideal solution
But when the number of positive and negative examples in the training set is different , The observation probability is
, among
Is the number of positive examples ,
Is the number of counterexamples . Usually, we assume that the training set is unbiased sampling of the real sample population , Then the observed probability represents the real probability . There are :
when , The prediction is a positive example , The opposite is true .
namely :
, This is a basic strategy in category imbalance learning “ Zoom again ”( Similar to cost sensitive learning
Instead of
, among
The cost of being divided into positive classes by mistake into negative classes ).
3、 Practical solutions
The assumption of the above solution is “ The training set is the unbiased sampling of the real sample population ”, But this assumption is not necessarily true , That is, it is difficult to deduce the real probability from the observed probability .
Therefore, there are three actual treatment schemes :
(1) Directly under sample the anti class samples in the training set , That is to remove some anti class samples , Make the number of positive and negative samples close to ( Multiple random undersampling can be performed , Training multiple classifiers , The test results are the most predicted results of these classifiers , In this way, there is not so much information lost )
(2) Oversampling the positive samples in the training set , That is, add some positive samples , Make the number of positive and negative samples close to ( Oversampling —— Additional samples can be generated by interpolating positive samples , You cannot simply repeat the sample , It's easy to over fit )
(3) Use the ideal solution above —— Threshold shift
You are welcome to criticize and correct in the comment area , Thank you very much! ~
边栏推荐
- Modify the default path of jupyter see this article!
- leetcode-08
- "Baidu side" angrily sprayed the interviewer! Isn't it that the tree time increases by a line number?
- How to build dashboard and knowledge base in double chain note taking software? Take the embedded widget library notionpet as an example
- Definable 6G security architecture
- LeetCode 2325. Decrypt message (map)
- Integrated network architecture and network slicing technology of air, earth and sea
- String类型函数传递问题
- 金鱼哥RHCA回忆录:CL210描述OPENSTACK控制平面--识别overclound控制平台服务+章节实验
- Thinking about the integrated communication of air, space and earth based on the "7.20 Zhengzhou rainstorm"
猜你喜欢

Pytoch learning record 2 linear regression (tensor, variable)

Scala's dosing in idea

ENVI_IDL:使用反距离权重法选取最近n个点插值(底层实现)并输出为Geotiff格式(效果等价于Arcgis中反距离权重插值)

(一)了解MySQL

Pytoch and weight decay (L2 norm)

LeetCode 2315. Statistical asterisk (string)

NVIDIA uses AI to design GPU: the latest H100 has been used, which reduces the chip area by 25% compared with traditional EDA

Google Earth engine - Hansen global forest change v1.8 (2000-2020) forest coverage and forest loss data set

Use testeract JS offline recognition picture text record
![[leetcode weekly replay] 302 weekly 20220717](/img/38/446db9b4755f8b30f9887faede7e95.png)
[leetcode weekly replay] 302 weekly 20220717
随机推荐
Win10 install Apache Jena 3.17
2022/7/16
空天地海一体化网络体系架构与网络切片技术
【设计过程】.NET ORM FreeSql WhereDynamicFilter 动态表格查询功能
Thread pool principle
反向散射通信的未来应用与技术挑战
Input number pure digital input limit length limit maximum value
[acwing] 60th weekly match b- 4495 Array operation
Google Earth Engine APP(GEE)—设定中国区域的一个夜间灯光时序分析app
金鱼哥RHCA回忆录:CL210描述OPENSTACK控制平面--识别overclound控制平台服务+章节实验
Beego框架实现文件上传+七牛云存储
[acwing] game 60 c-acwing 4496 eat fruit
Use testeract JS offline recognition picture text record
Google Earth Engine——Hansen Global Forest Change v1.8 (2000-2020) 森林覆盖度和森林损失量数据集
Pytorch框架 学习记录1 CIFAR-10分类
Svn learning
Definable 6G security architecture
After summarizing the surface based knowledge of the database
The difference between journal log and oplog log
Pytoch realizes multi-layer perceptron manually