当前位置：网站首页>Neural network and deep learning-6-support vector machine 1-pytorch

Neural network and deep learning-6-support vector machine 1-pytorch

2022-07-26 09:25:00 【Bai Xiaosheng in Ming Dynasty】

Preface

SVM (support vector machines) It is the classic binary classification model in machine learning

It is a linear model with the largest interval defined in the feature space

Linear support vector machines and Nonlinear support vector machines , Hard spacing Soft space ,SMO( Sequence minimum optimal algorithm )

Reference documents << Statistical learning method >>

Gram matrix

Talking about 「 Positive definite matrix 」 and 「 Positive semidefinite matrices 」 - You know

Gram matrix It's semi-positive definite

Catalog

Linear separable support vector machine and hard interval maximization
Linear support vector machine and soft interval maximization
Nonlinear support vector machine and kernel function

One Linear separable support vector machine and hard interval maximization

1.1 Linear separable support vector machine definition

Given a linearly separable data set , The corresponding convex quadratic programming problem is solved by maximizing the interval or equivalent

The obtained separation hyperplane is

$w^{*}*x+b^{*}=0$

And the corresponding classification decision function

$f(x)=sign(w^{*}*x+ b^{*} )$

1.2 Function interval geometric interval

The distance between a point and the hyperplane can indicate the degree of certainty of classification prediction

The function between :

Define hyperplane (w,b) About the sample point (x_i,y_i) The function interval of is

$\hat{r_i}=y_i(w *x_i+b)$

To minimize the ( The point closest to the hyperplane ）

$\hat{r}=min_{i=1,..N}\hat{r_i}$

Geometric interval

Define hyperplane (w,b) About the sample point (x_i,y_i) The geometric interval of is

$r_i=y_i*( \frac{w}{||w||} * x_i + \frac{1}{||w||}*b)$

$\hat{r}=min_{i=1,..N}\hat{r_i}$

When Function spacing and geometric spacing are equal

1.3 Maximize spacing

Find the hyperplane with the largest geometric interval for the training data set , It's not just about separating positive and negative examples , And there is enough certainty that the most difficult instance points are separated

Maximize spacing

Linearly separable SVM Learn optimization problems

$min_{w,b}\frac{1}{2}||w||^2$

$s.t : y_i(w x_i+ b)-1\geq 0, i=1,2,...N$

Equivalent to convex quadratic programming problem

min f(w): Convex function
$st : \left\{\begin{matrix} g_i(w)\leq 0, i=1,2..k\\ h_i(w)=0,i=1,2,...l \end{matrix}\right.$
Constraint is Convex function + Affine function

1.4 Maximize interval algorithm

Input ：
Linearly separable data sets $T=\begin{Bmatrix} (x_1,y_1) & (x_2,y_2) &... & (x_N,y_N) \end{Bmatrix}$
   $y_i \in \begin{Bmatrix} -1,+1 \end{Bmatrix}$
Output ：
Maximum separation hyperplane and classification decision function
step1:
Construct and solve constrained optimization problems
   $min_{w,b} \frac{1}{2}||w||^2$
   $s.t: y_i (w x_i+b-1) \geq 0, i=1,2,..N$
step2: Separating hyperplanes
   $w^{*}x+b^{*}=0$
step3: Classification decision function
$f(x)=sign(w^{*} x +b^{*})$

1.5 Learn the dual algorithm

Define the Lagrange function

$L=\frac{1}{2}||w||^2-\sum_{i=1}^{N} \alpha_i y_i(wx_i+b)+\sum_{i=1}^{N} \alpha_i$

Solution ：

Treat me first w,b Seeking minimum , Then find the maximum of Lagrange multiplier

1： $min_{w,b} L(w,b, \alpha)$

Yes w,b After taking the partial derivative , Bring in

$min_{w,b}L= -\frac{1}{2} \sum_{i}\sum_{j} \alpha_i \alpha_j y_i y_j x_i * x_j^T+\sum \alpha_i$

s.t

$\left\{\begin{matrix} \sum_{i} \alpha_i y_i=0 \\ \alpha_i \geq 0 \end{matrix}\right. i \in {0,1,..N}$

2 Yes $\alpha$ seek Maximum , The dual problem

$min_{\alpha} \frac{1}{2} \sum \alpha_i \alpha_j y_i y_j x_i x_j^T -\sum \alpha_i$

s.t:

$\begin{bmatrix} \sum_{i=1}^{N} \alpha_i y_i=0 \\ \alpha_i \geq 0 \end{bmatrix} i \in 0,1,...N$

3 utilize KKT Condition solving

$\left\{\begin{matrix} \alpha_i^{*}(y_i(w^*x_i+b^*)-1)=0\\ y_i(w^*x_i+b^*)-1\geq 0 \end{matrix}\right.$
Focus on training $\alpha_i^*>0$ Sample point of An instance of is called a support vector , Play a role in classification
$y_i(w^*x+b)-1>0:\alpha_i^* =0$
$y_i(w^*x+b^*)-1=0,\alpha^*>0$

$w^{*}=\sum_{i=1}^{N}\alpha_i^{*}y_i x_i$ ( $\alpha^{*}\geq 0$ )

$b^{*}=y_{j} - \sum_{i=1}^{N} \alpha_{i}^{*} y_i x_{i} * x_j^{T}$ , among x_j Corresponding $\alpha_j^{*}>0$

Linear separable support vector machine learning algorithm
Input ：
Linearly separable data sets $T=\begin{Bmatrix} (x_1,y_1) & (x_2,y_2) &... & (x_N,y_N) \end{Bmatrix}$
   $y_i \in \begin{Bmatrix} -1,+1 \end{Bmatrix}$
Output ：
Maximum separation hyperplane and classification decision function

Construct and solve constrained optimization problems
$min_{\alpha} \frac{1}{2}\sum_{i}\sum_{j} \alpha_i \alpha_j y_i y_j \left ( x_i,x_j \right )-\sum_{i} \alpha_i$
s.t
   $\sum_{i}\alpha_i y_i=0$
   $\alpha_i \geq 0,i =1,2..N$
Calculation
$w^{*}=\sum_i \alpha_i^{*}y_ix_i$
choice $\alpha_j^{*}>0$ Components of
   $b^{*}=y_j -\sum_{i}\alpha_i^{*}y_i(x_i,x_j)$
Get the separated hyperplane
$w^{*} x+b^{*}=0$
Classification decision function ：
$f(x)=sign(w^{*} x +b^{*})$

Two Linear support vector machine and soft interval maximization

There are some special points in the training data set , The function interval cannot be greater than or equal to 1 Constraints of , The relaxation variable is introduced $\epsilon_i$

2.1 Convex quadratic programming primitive problem

$min_{w,b,\epsilon } \frac{1}{2}||w||^2+C \sum_{i=1}^{N} \epsilon_i$

$s.t : \left\{\begin{matrix} y_i(wx_i+b)\geq 1-\varepsilon_i\\ \varepsilon_i \geq 0,i=1,2...,N \end{matrix}\right.$

2.2 Learn the dual algorithm

The Lagrangian function of the original problem

$L=\frac{1}{2}||w||^2+C\sum_{i} \varepsilon_i-\sum_i \alpha_i (y_i(wx_i+b)+\varepsilon_i-1)-\sum_i u_i \varepsilon_i$

First of all L Yes w,b, $\varepsilon$ Find the partial derivative

$w=\sum_i \alpha_i y_i x_i$

$\sum_i \alpha_i y_i=0$

$\alpha_i+u_i=C$

Into the , It's the same as before , We get the dual problem

$min_{\alpha} \frac{1}{2} \sum \alpha_i \alpha_j y_i y_j x_i x_j^T -\sum \alpha_i$

s.t

$\sum_i \alpha_i y_i=0$

$\alpha_i+u_i=C$

$\alpha_i \geq 0$

$u_i \geq 0$

Theorem set up $\alpha^{*}=(a_1^*,\alpha_2^*,....,\alpha_N^*)^T$ Is a solution to the dual problem , If exist $\alpha^*$ A component of $0<\alpha_j^*<C$ ,

Then the solution of the original problem can be obtained as follows ：

$w^{*}=\sum_{i}\alpha_i^* y_i x_i$

b^*=y_j-w^*x_j

prove ：

The original problem is a convex quadratic programming problem , Satisfy KKT Conditions

$\bigtriangledown_wL = w^*-\sum_{i}\alpha_i^*y_ix_i=0$

$\bigtriangledown_bL =\sum \alpha_i^*y_i=0$

$\bigtriangledown_{\varepsilon }L = C-\alpha^*-u^*=0$

$\alpha_i^*(y_i(w^*x_i+b)+\varepsilon_i-1)=0$

$u_i\varepsilon_i=0$

$y_i(w^*x+b^*)+\varepsilon_i^*-1\geq 0$

$\alpha_i^*\geq 0$

$u_i^*\geq 0$

2.3 Learning algorithms

Input ：
Training data set $T=\begin{Bmatrix} (x_1,y_1) &(x_2,y_2) & ... &(x_N,y_N) \end{Bmatrix}$ ,
among    $x_i \in X,y_i =\begin{Bmatrix} -1 ,& +1 \end{Bmatrix},i=1,2...,N$
Output : Separating hyperplane and classification decision function
1： Select the penalty factor C>0, Construct and solve convex quadratic programming problem
   $min_{\alpha}\sum_i\sum_j \alpha_i\alpha_j y_i y_j (x_i \bullet x_j)-\sum_i \alpha_i$
s.t
   $\sum_i \alpha_i y_i=0$
$0\leq \alpha_i\leq C$
2: Calculation
   $w^{*}=\sum \alpha_i^{*}y_i x_i$
Select a $0<\alpha_j^*<C$
   $b_j=y_j-\sum_{i}y_i\alpha_i^{*}(x_i\bullet x_j)$
3: Get the separated hyperplane
   $w^{*}\bullet x+b^*=0$
4: Classification decision function
$f(x)=sign(w^*\bullet x+b^*)$

2.4 Support vector

2.5 Hinge loss function

This is another explanation ： Minimum loss function

$L=\sum_i [1-y_i(w \bullet x_i+b)]+\lambda ||w||^2$

The first is hinge loss （Hinge loss function）

L=[1-y(w x+b)]

$[z]=\left\{\begin{matrix} z,z>0\\ 0,z \leq 0 \end{matrix}\right.$

3、 ... and Nonlinear support vector machine and kernel function

3.1 Nonlinear classification problem

First, use transformation to map the data of the original space to the new space ; Then use the learning method of linear classification in the new space to learn the classification model from the training data

3.2 Kernel function definition

set up $\chi$ It's the input space （ A subset or discrete set of Euclidean spaces ）, And set up H For feature space （ Hilbert space ）,

If there is one from $\chi$ To H Mapping

$\phi(x): \chi \rightarrow H$

Make it all right $x,z \in \chi$ , function $K(x,z)=\phi(x)\bullet \phi(z)$

call

k(x,z) It's a kernel function

$\phi(x)$ For mapping function

3.3 Positive determination

The necessary and sufficient condition to be a sum function is a kernel function

hypothesis K(x,z) Is defined in $\chi * \chi$ Symmetric functions on , And arbitrary $x_1,x_2,...x_m \in \chi$
K(x,z) About Of Gram The matrix is positive semidefinite . According to K(x,z) Form a Hilbert space H
step ：
First define $\phi$ And form a vector space S
And then in S Define inner product on to form inner product space
The final will be S Completion constitutes Hilbert space
Gram matrix - You know

1 First define the mapping , Form vector space S

Define mapping $\phi: x\rightarrow K(\bullet,x )$

According to this mapping , For any $x_i \in R$ ,i=1,2...m, Define linear combinations

$f(\bullet)=\sum_{i=1}^{N}\alpha_i K(\bullet,x_i)$

Consider a set of elements from linear combinations S, Because of the set S It is closed for addition and multiplication , therefore

S It forms a vector space .

2： stay S Define inner product on , Let's call it inner product space

stay S Define an operation on *: For any f,g $\in S$ ,

$f(\bullet)=\sum_{i} \alpha_i K(\bullet,x_i)$

$g(\bullet)=\sum_{j} \beta_j K(\bullet,x_j)$

$f*g=\sum_{i}\sum_{j}\alpha_i\beta_j K(x_i,z_j)$

3 Put vector space S Complete into Hilbert space

$K(\bullet,x)\bullet f=f(x)$

$K(\bullet,x)K(\bullet,z)=K(x,z)$

4： Common kernel functions

Nuclear method is just a skill to deal with problems , Linear non separability in low dimensional space can be linearly separable in high dimensional space , But the computational complexity of high-dimensional space is very large , Then let's calculate the high-dimensional space through Calculation of low dimensional space Plus Some linear transformations To complete

Kernel functions commonly used in machine learning _ Polaris blog to South -CSDN Blog _ Common kernel functions

Kernel function (Kernel function)( Illustrate with examples , Easy to understand )_ I don't like machine learning blogs -CSDN Blog _ Kernel function

Review of Linear Algebra

1： Positive definite matrix definition （positive definite and positive semi-definite）

Given a size of [n,n] Real symmetric matrix of A , If for any length it is n Nonzero vector of x , Yes x^TAx>0 Hang up , Then the matrix A It's a positive definite matrix

1.1 nature A The eigenvalue of is large 0

$Ax=\lambda x, x^TAx=x^T\lambda x=\lambda ||x||^2$

On the contrary, only all eigenvalues are greater than 0 The symmetric matrix of , Then it must be positive definite

1.2 Example

2 Positive semidefinite matrices

Given a size of [n,n] Real symmetric matrix of A , If for any length it is n Nonzero vector of x , Yes $x^TAx\geq 0$ Hang up , Then the matrix A It's a positive semidefinite matrix