当前位置:网站首页>Category imbalance in classification tasks
Category imbalance in classification tasks
2022-07-19 11:00:00 【TT ya】
Beginner little rookie , I hope it's like taking notes and recording what I've learned , Also hope to help the same entry-level people , I hope the big guys can help correct it ~ Tort made delete .
Catalog
One 、 Problem definition
The number of training samples in different categories in classification tasks varies greatly .
Two 、 Solutions to problems
1、 Solution introduction
In linear classifier , We use it
When classifying new samples , We predict with a threshold y Value comparison . Usually we choose 0.5 Threshold value ( Think that the possibility of true positive and negative examples is the same ), namely :
when , The prediction is a positive example .
2、 Ideal solution
But when the number of positive and negative examples in the training set is different , The observation probability is
, among
Is the number of positive examples ,
Is the number of counterexamples . Usually, we assume that the training set is unbiased sampling of the real sample population , Then the observed probability represents the real probability . There are :
when , The prediction is a positive example , The opposite is true .
namely :
, This is a basic strategy in category imbalance learning “ Zoom again ”( Similar to cost sensitive learning
Instead of
, among
The cost of being divided into positive classes by mistake into negative classes ).
3、 Practical solutions
The assumption of the above solution is “ The training set is the unbiased sampling of the real sample population ”, But this assumption is not necessarily true , That is, it is difficult to deduce the real probability from the observed probability .
Therefore, there are three actual treatment schemes :
(1) Directly under sample the anti class samples in the training set , That is to remove some anti class samples , Make the number of positive and negative samples close to ( Multiple random undersampling can be performed , Training multiple classifiers , The test results are the most predicted results of these classifiers , In this way, there is not so much information lost )
(2) Oversampling the positive samples in the training set , That is, add some positive samples , Make the number of positive and negative samples close to ( Oversampling —— Additional samples can be generated by interpolating positive samples , You cannot simply repeat the sample , It's easy to over fit )
(3) Use the ideal solution above —— Threshold shift
You are welcome to criticize and correct in the comment area , Thank you very much! ~
边栏推荐
- 6G smart endogenous: technical challenges, architecture and key features
- 使用tesseract.js-offline识别图片文字记录
- Google Earth engine - Hansen global forest change v1.8 (2000-2020) forest coverage and forest loss data set
- About hping streaming test tool
- 一个报错, Uncaught TypeError: ModalFactory is not a constructor
- [Huawei cloud IOT] reading notes, "Internet of things: core technology and security of the Internet of things", Chapter 3 (2)
- Satellite network capacity improvement method based on network coding
- (二)使用MySQL
- Integrated network architecture and network slicing technology of air, earth and sea
- 空天地海一体化网络体系架构与网络切片技术
猜你喜欢

vulnhub inclusiveness: 1

论文笔记:Mind the Gap An Experimental Evaluation of Imputation ofMissing Values Techniques in TimeSeries

Documents required for military product development process - advanced version

(二)使用MySQL

Opencv programming: opencv3 X trains its own classifier

JSP based novel writing and creation website

How much money can you make by inventing flash memory? This is a Japanese dog blood story

如何在双链笔记软件中建立仪表盘和知识库?以嵌入式小组件库 NotionPet 为例

Explanation of tree chain dissection idea + acwing 2568 Tree chain dissection (DFS sequence + mountain climbing method + segment tree)

Redis集群、一主二从三哨兵的搭建
随机推荐
LeetCode 745. 前缀和后缀搜索
"Baidu side" angrily sprayed the interviewer! Isn't it that the tree time increases by a line number?
Paper notes: mind the gap an empirical evaluation of impaction ofmissing values techniques in timeseries
Pytorch. NN implementation of multi-layer perceptron
LeetCode 2249. Count the number of grid points in the circle
6G智慧内生:技术挑战、架构和关键特征
win10开始键点击无响应
数据库锁的介绍与InnoDB共享,排他锁
空天地海一体化网络体系架构与网络切片技术
If you use mybatics to access Damon database, is it exactly the same? Because the SQL syntax has not changed. Right?
The difference between journal log and oplog log
空天地海一体化网络体系架构与网络切片技术
Google Earth Engine APP(GEE)—设定中国区域的一个夜间灯光时序分析app
Integrated network architecture and network slicing technology of air, earth and sea
High number_ Chapter 1 space analytic geometry and vector algebra__ Distance from point to plane
[handwritten numeral recognition] handwritten numeral recognition based on lenet network with matlab code
Summary of port mirroring methods with VDS or NSX under vSphere
反向散射通信的未来应用与技术挑战
Svn learning
如何在双链笔记软件中建立仪表盘和知识库?以嵌入式小组件库 NotionPet 为例