当前位置：网站首页>Order based evaluation index (especially for recommendation system and multi label learning)

Order based evaluation index (especially for recommendation system and multi label learning)

2022-07-26 09:18:00 【Min fan】

Abstract : Some learners output real prediction for recommendation system or multi label learning . Such as , Forecast No $i$ Users to $j$ The score of items is $4.2$ , Or predict the number $i$ Of the sample $j$ The probability of positive labels is $0.46$ . How to evaluate the effectiveness of prediction ? This paper describes several evaluation indexes based on order (Ranking-based evaluation measures) Motivation and physical meaning of .

1. Non order based evaluation index

This section describes several non order based evaluation indicators , And point out its defects .

1.1 Mean absolute error (MAE)

Let the actual score be $r_{ij}$ , The predicted score is $\hat{r}_{ij}$ , Unknown score ( What needs to be predicted ) user - Item set is $\Omega$ , be
$\sum_{(i, j) \in \Omega} \vert r_{ij} - \hat{r}_{ij}\vert / |\Omega|\tag{1}$
It represents the absolute difference between the predicted score and the actual score .
advantage : Simple and direct .
defects : Suppose you recommend it regularly for each user $10$ A project . It is easy to cite such counter examples : Put the user's favorite $10$ Projects are at the top ( The recommended effect is perfect ), But the error is great , Such as : The real score is 5, But the prediction score is only 3.6–3.9 ( The prediction scores of other projects are less than 3.6). Such a counterexample can even be cited : MAE Not bad , But the recommended list is not good ( Put users' favorite , The score is 5 The project forecast of is 4.4 branch ; But users like it for the first time , The score is 4 The project forecast of is 4.5 branch ).

1.2 Root squre mean error (RSME)

And MAE Empathy .

1.3 Accuracy

Here with multiple labels ( It is equivalent to the expansion of two categories ) Take an example to illustrate .
Make the actual label $y_{ij} \in \{0, 1\}$ , The prediction label is $\hat{y}_{ij}$ , The number of test data is $n$ , The number of tags is $q$ , Then the accuracy
$\frac{nq - \sum_{i, j} |y_{ij} - \hat{y}_{ij}|}{nq}$
advantage : Simple and direct , Calculate the correct proportion of the forecast .
shortcoming 1: Because the initial prediction value is the initial value ( As before 0.42), You need a threshold to convert it into a distribution value $0/1$ . If you use thresholds simply and brutally 0.5, The effect is not good .
shortcoming 2: Due to category imbalance , Negative label ( The actual value of the label is $1$ ) Positive label ( The actual value of the label is $0$ ) A lot more . In some extreme multi label datasets , The proportion of negative labels is 99% above , At this time, you only need to judge that all labels are negative or very high Accuray, But it obviously has no practical significance .

1.4 F1

F1-score The main response is Accuracy The shortcomings of 2. See Misclassification cost and class imbalance data , as well as F-measure And cost sensitive evaluation index .

2. Evaluation index based on order

This section describes several order based evaluation indicators .

2.1 Peak-F1

Take all the samples - The tag pair is based on the predicted value ( A pure decimal ) In reverse order . The first $k$ Before the second thought $k$ individual sample - Label alignment is positive . Draw F1 curve , Finally, take the maximum value in the curve , be called Peak-F1.
advantage : Answer 1.3 In the festival Accuracy The shortcomings of 1, There is no need to select the threshold ( Children make choices ).
shortcoming : Only the highlight moment is recorded , Maybe the quality of the front row is very high , But the quality of the back is not good . It's just a weakness .

2.2 ROC curve And AUC

Take all the samples - The tag pair is based on the predicted value ( A pure decimal ) In reverse order . From two-dimensional coordinates (0, 0) set out , The first 1 One is positive , Just walk up 1 Step , Otherwise, go right 1 Step . Go up 1 The distance of steps is $1/ P$ , turn right 1 The distance of steps is $1/ N$ , among $P$ ( $N$ ) Is actually positive ( negative ) Total number of labels . The curve thus obtained is called ROC, See Receiver operating characteristic curve.
AUC (Area Under Curve) Is the area under the curve , Usually a pure decimal (AUC = 1 It's too much ).
advantage 1: Same as Peak-F1.
characteristic 1: Measure as a whole . If you care about the overall performance , It is the index relative to Peak-F1 The advantages of . If you only care about the first few ( Recommendation system ), It may become a disadvantage .