当前位置：网站首页>[characteristic Engineering]

[characteristic Engineering]

2022-07-19 08:04:00 【Novice Alchemist】

Feature Engineering

1.1 The meaning of Feature Engineering

It is said that , one can't make bricks without straw . In machine learning , Data and characteristics are “ rice ”, and The model and the algorithm are “ housewife ”. For a machine learning problem , Data and characteristics often determine the upper limit of results , And the model 、 The choice of algorithm is gradually approaching this upper limit .

Feature Engineering , seeing the name of a thing one thinks of its function , It is a series of engineering processing of raw data , Refine it into characteristics , As input for algorithms and models . essentially , Feature engineering is a process of representing and presenting data .

1.2 Feature normalization

in order to Eliminate the dimensional influence between data features , We need to normalize the features , Make different indicators comparable .

For example, we analyze the impact of a person's height and weight on health , If you use rice (m) And kilogram (kg) As a unit
Then most of the height characteristics will be in 1.6~1.8 Within the numerical range of
And the weight will fall in 50~100 Within the scope of
Then the analysis results will favor the weight characteristics with large numerical differences
So we want to get more accurate results , We need to make the characteristics of the same order of magnitude , For analysis

There are two common normalization methods ：

① Normalization of linear function

It's a linear transformation of raw data , Map the results to $[0, 1]$ Within the scope of , Realize the Scale proportionally
$X_{norm}=\frac{X-X_{min}}{X-X_{max}}$
- among $X$ For raw data , $X_{min}、X_{max}$ Are the minimum and maximum values of the original data respectively

② Zero normalization

It maps raw data to The mean for 0、 The standard deviation is 1 Is a normal distribution On
$z=\frac{x-\mu}{\sigma}$
- among $\mu$ Is the mean of the original data , $\sigma$ Is the variance of the original data

So why should we normalize ？

We might as well use the example of random gradient descent to illustrate the importance of normalization .

Suppose we have two numerical characteristics , $x_1$ The value range of is $[0, 10]$ , $x_2$ The value range of is $[0, 3]$
Therefore, we can construct an isogram as shown in the left figure
With the same learning rate , $x_1$ The update speed of will be greater than $x_2$ , It takes many iterations to find the optimal solution

But if we normalize , Its isogram will become a circle as shown in the right figure

$x_1$ and $x_2$ The update speed of will become consistent , It's easier to find the optimal solution

Insert picture description here

Of course, data normalization is not everything . in application ,== The model solved by gradient descent algorithm usually needs normalization ;== But it doesn't apply to decision trees .

1.3 Category features

Categorical features mainly refer to gender 、 Blood type, etc Features that take values only in limited options . Its original input is usually String form , In addition to the decision tree, a few models can directly handle the input in the form of string , For models such as logistic regression , Category type features must be processed into numerical type features to work correctly .

There are usually three ways to deal with it ：

① Serial number code

Serial number coding is usually used to deal with inter category Data with size relationship
For example, grades can be divided into high 、 in 、 Third gear down , And high > in > low
Then we can make Gao =3, in =2, low =1. The size relationship is preserved

② Hot coding alone

Hot coding is usually used to deal with inter class Data without size relationship
For example, the blood type has A、B、AB、O Four kinds of
We can make $A = (1, 0, 0, 0) 、 B = (0, 1, 0, 0) 、 A B = (0, 0, 1, 0) 、 O = (0, 0, 0, 1)$
However, we should pay attention to the following problems when using hot coding ：
- Use sparse vectors to save storage space
- Match feature selection to reduce dimension . Usually, high-dimensional features will bring some problems ：
  - stay K In the nearest neighbor algorithm , The distance between two points in a high dimension is difficult to measure
  - In the logistic regression model , The number of parameters will increase as the dimension increases , Easy to overfit
  - Usually only some dimensions are helpful for prediction

③ Binary code

Binary coding usually has two steps , First, each category is assigned a ID, Then type ID The corresponding binary code as a result
Let's take blood type as an example ：

Blood type	Category ID	Binary representation	Hot coding alone
A	1	001	1000
B	2	010	0100
AB	3	011	0010
O	4	100	0001

You can see that compared with the single hot coding , The binary coding dimension is relatively low , Save storage space

1.4 Processing of high dimensional composite features

In order to improve the fitting ability of complex relationship , In Feature Engineering, some discrete features are often combined in pairs , Build into high-order combined features .

But if the feature will bring some problems
For example, the number of categories of a feature is $m$ , Another characteristic number is $n$
If we cross two features directly , Then we will get $m\times n$ results , Take logical regression as an example , We will also get $m\times n$ Parameters , stay $m$ and $n$ Are very large , Parameters are almost impossible to learn
$Y=sigmoid(\sum\limits_i^m\sum\limits_j^nw_{ij}<x_i,x_j>)$
An effective method is to use these two features separately $k$ A low dimensional vector representation of a dimension ( $m\quad k << n$ ), At this point, the number of parameters becomes $m\times k+n\times k$

$Y=sigmoid(\sum\limits_i^m\sum\limits_j^nw_{ij}<x_i,x_j>)$

among $w_{ij}=x_i^{'}\cdot x_j^{'}$ , $x_i^{'}$ and $x_j^{'}$ Express $x_i$ and $x_j$ Corresponding low dimensional vector

1.5 Text representation model

Text representation model , seeing the name of a thing one thinks of its function , Is the model used to represent text data

① The word bag model

The so-called word bag model is to treat each article as a bag of words , And ignore the order in which each word appears
say concretely , Is to cut the whole text into words , Then each article can be expressed as a long vector
- Each dimension in the vector represents a word
- The weight value corresponding to the dimension reflects the importance of this word in the article . Commonly used tf-idf To calculate
  $tf-idf=tf\times idf$
  - among tf For the word $t$ In the document $d$ Frequency of occurrence in
  - idf It's reverse document frequency , Used to measure words $t$ The importance of expressing semantics ：
    $idf=\frac{ The total number of articles }{ Contains words t Total number of articles published +1}$
    The intuitive explanation is that if a word appears in many articles , Then it may be a general vocabulary , such as “ today ”, Its contribution is relatively low

② N-gram Model

Sometimes it's not good to divide the text into word levels , such as **natural language processing（ natural language processing ）** Split the word , Then the meaning of these three words is quite different from that before
Therefore, continuous $n(n\le N)$ The phrase composed of words is put into the vector representation as a single feature , This is it. N-gram Model
At the same time, for English , The same word may have many parts of speech changes , But it has a similar meaning . Therefore, in actual processing, word stems are usually extracted , Unify words with different parts of speech into a unified stem form

③ Theme model

The author here said that later , I'm too lazy to read it first

④ Word embedding and deep learning model

Word embedding is a kind of model that quantifies words , The core idea is to map every word to a low dimensional space （ Usually $K=50\sim300$ ） A dense vector of （Dense Vector）
K Each dimension of dimensional space can also be seen as an implied theme , It's just not so intuitive
Because word embedding can map each word into a $K$ Dimension vector , So if an article has $N$ Word , Then this article can be written by a $N\times K$ The matrix representation of

1.6 Word2Vec

Word2Vec By Google 2003 in , It is one of the commonly used word embedding models . It is a shallow neural network model , There are two network structures , Namely CBOW and Skip-gram

CBOW The model predicts the generation probability of the current word according to the words appearing in the context , and Skip-gram It predicts the generation probability of each word in the context according to the current word , Their general structure is shown in the figure below ：

Insert picture description here

among $W_t$ Words of current concern , $W_{t-2}...W_{t+2}$ For words that appear in the context , The sliding window size set here is 2
CBOW and Skip-gram Can be represented by the input layer 、 Neural network represented by mapping layer and output layer
Each word in the input layer is represented by a unique encoding , That is, all words are expressed as a $N$ Dimension vector , among $N$ Is the total number of words in the vocabulary
In the mapping layer , $K$ The value of hidden cells can be determined by $N$ Dimension input vector and connection between input and hidden layer $N\times K$ The weight matrix of dimension is calculated stay CBOW in , It is also necessary to sum the implicit units calculated by each input word
Empathy , The value of the vector of the output layer can be determined by 1 Hidden layer vector $K$ Dimension and connecting between hidden layer and output layer $K\times N$ The dimension weight matrix is calculated , The output layer is also a $N$ Dimension vector , Each dimension corresponds to a word in the vocabulary
Last , For the output layer vector, you can use Softmax Activation function , Calculate the generation probability of each word ,Softmax The function is defined as ：
$P(y=w_n|X)=\frac{e^{x_n}}{\sum\limits_{k=1}^ne^{x_k}}$
- among $x$ representative $N$ The original output vector of dimension , $x_n$ Expressed in the original output vector , And $w_n$ The value of the dimension corresponding to the word
The next task is to train the weight of neural network , To maximize the overall generation probability of all words in the corpus
Train to get $N\times K$ and $K\times N$ After two weight matrices of , You can choose any one as $N$ One word $K$ The dimension vector represents

1.7 Processing method when the data image is insufficient

In machine learning , Most models need a lot of data for training and learning , However, in practical applications, there are often insufficient data . Like image classification , As one of the most basic tasks of computer vision , When they train 2 How should we deal with a small number of samples ？

The information that a model can provide often comes from two aspects , One is the information contained in the training data ; The second is in the process of model formation , The transcendental information that people provide
When the training data is insufficient , It shows that the model obtains less information from the original data , In this case, you want to ensure the effect of the model , We need to get more prior information
We can adjust according to specific assumptions 、 Transform or expand training data , Let it show more 、 More useful information , In order to facilitate the training and learning of the model

Specific to the task of image classification , Insufficient training data may cause over fitting of the model , The corresponding processing methods can also be divided into two categories ：

One is acting on the model , Can simplify the model 、 Add constraints to reduce the assumption space （ The regularization ）、 Integrated learning 、Dropout Super parameters, etc
The second is based on data , Mainly through data expansion . That is, under the premise of maintaining specific information , The original data is transformed appropriately to achieve the effect of expanding the data set
- The image can be rotated randomly 、 translation 、 The zoom 、 tailoring 、 fill 、 Turn left and right, etc
- At the same time, noise disturbance can be added to pixels , Like salt and pepper noise 、 Gaussian white noise, etc
- Color change
- Change the brightness of the image 、 clarity 、 Contrast 、 Sharpness, etc
- At the same time, it can extract the features of the image , Then transform in the feature space
- At the same time, the image can be generated directly by using the generation countermeasure network
- Use the pre training model to fine tune and so on

原网站

版权声明
本文为[Novice Alchemist]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/200/202207170603550508.html