当前位置:网站首页>[characteristic Engineering]
[characteristic Engineering]
2022-07-19 08:04:00 【Novice Alchemist】
Feature Engineering
1.1 The meaning of Feature Engineering
It is said that , one can't make bricks without straw . In machine learning , Data and characteristics are “ rice ”, and The model and the algorithm are “ housewife ”. For a machine learning problem , Data and characteristics often determine the upper limit of results , And the model 、 The choice of algorithm is gradually approaching this upper limit .
Feature Engineering , seeing the name of a thing one thinks of its function , It is a series of engineering processing of raw data , Refine it into characteristics , As input for algorithms and models . essentially , Feature engineering is a process of representing and presenting data .
1.2 Feature normalization
in order to Eliminate the dimensional influence between data features , We need to normalize the features , Make different indicators comparable .
- For example, we analyze the impact of a person's height and weight on health , If you use rice (m) And kilogram (kg) As a unit
- Then most of the height characteristics will be in 1.6~1.8 Within the numerical range of
- And the weight will fall in 50~100 Within the scope of
- Then the analysis results will favor the weight characteristics with large numerical differences
- So we want to get more accurate results , We need to make the characteristics of the same order of magnitude , For analysis
There are two common normalization methods :
① Normalization of linear function
It's a linear transformation of raw data , Map the results to [ 0 , 1 ] [0,1] [0,1] Within the scope of , Realize the Scale proportionally
X n o r m = X − X m i n X − X m a x X_{norm}=\frac{X-X_{min}}{X-X_{max}} Xnorm=X−XmaxX−Xmin
- among X X X For raw data , X m i n 、 X m a x X_{min}、X_{max} Xmin、Xmax Are the minimum and maximum values of the original data respectively
② Zero normalization
It maps raw data to The mean for 0、 The standard deviation is 1 Is a normal distribution On
z = x − μ σ z=\frac{x-\mu}{\sigma} z=σx−μ
- among μ \mu μ Is the mean of the original data , σ \sigma σ Is the variance of the original data
So why should we normalize ?
We might as well use the example of random gradient descent to illustrate the importance of normalization .
- Suppose we have two numerical characteristics , x 1 x_1 x1 The value range of is [ 0 , 10 ] [0,10] [0,10], x 2 x_2 x2 The value range of is [ 0 , 3 ] [0,3] [0,3]
- Therefore, we can construct an isogram as shown in the left figure
- With the same learning rate , x 1 x_1 x1 The update speed of will be greater than x 2 x_2 x2, It takes many iterations to find the optimal solution
But if we normalize , Its isogram will become a circle as shown in the right figure
- x 1 x_1 x1 and x 2 x_2 x2 The update speed of will become consistent , It's easier to find the optimal solution
Of course, data normalization is not everything . in application ,== The model solved by gradient descent algorithm usually needs normalization ;== But it doesn't apply to decision trees .
1.3 Category features
Categorical features mainly refer to gender 、 Blood type, etc Features that take values only in limited options . Its original input is usually String form , In addition to the decision tree, a few models can directly handle the input in the form of string , For models such as logistic regression , Category type features must be processed into numerical type features to work correctly .
There are usually three ways to deal with it :
① Serial number code
- Serial number coding is usually used to deal with inter category Data with size relationship
- For example, grades can be divided into high 、 in 、 Third gear down , And high > in > low
- Then we can make Gao =3, in =2, low =1. The size relationship is preserved
② Hot coding alone
- Hot coding is usually used to deal with inter class Data without size relationship
- For example, the blood type has A、B、AB、O Four kinds of
- We can make A = ( 1 , 0 , 0 , 0 ) 、 B = ( 0 , 1 , 0 , 0 ) 、 A B = ( 0 , 0 , 1 , 0 ) 、 O = ( 0 , 0 , 0 , 1 ) A=(1,0,0,0)、B=(0,1,0,0)、AB=(0,0,1,0)、O=(0,0,0,1) A=(1,0,0,0)、B=(0,1,0,0)、AB=(0,0,1,0)、O=(0,0,0,1)
- However, we should pay attention to the following problems when using hot coding :
- Use sparse vectors to save storage space
- Match feature selection to reduce dimension . Usually, high-dimensional features will bring some problems :
- stay K In the nearest neighbor algorithm , The distance between two points in a high dimension is difficult to measure
- In the logistic regression model , The number of parameters will increase as the dimension increases , Easy to overfit
- Usually only some dimensions are helpful for prediction
③ Binary code
- Binary coding usually has two steps , First, each category is assigned a ID, Then type ID The corresponding binary code as a result
- Let's take blood type as an example :
| Blood type | Category ID | Binary representation | Hot coding alone |
|---|---|---|---|
| A | 1 | 001 | 1000 |
| B | 2 | 010 | 0100 |
| AB | 3 | 011 | 0010 |
| O | 4 | 100 | 0001 |
- You can see that compared with the single hot coding , The binary coding dimension is relatively low , Save storage space
1.4 Processing of high dimensional composite features
In order to improve the fitting ability of complex relationship , In Feature Engineering, some discrete features are often combined in pairs , Build into high-order combined features .
But if the feature will bring some problems
For example, the number of categories of a feature is m m m, Another characteristic number is n n n
If we cross two features directly , Then we will get m × n m\times n m×n results , Take logical regression as an example , We will also get m × n m\times n m×n Parameters , stay m m m and n n n Are very large , Parameters are almost impossible to learn
Y = s i g m o i d ( ∑ i m ∑ j n w i j < x i , x j > ) Y=sigmoid(\sum\limits_i^m\sum\limits_j^nw_{ij}<x_i,x_j>) Y=sigmoid(i∑mj∑nwij<xi,xj>)
An effective method is to use these two features separately k k k A low dimensional vector representation of a dimension ( k < < m k < < n k << m\quad k << n k<<mk<<n), At this point, the number of parameters becomes m × k + n × k m\times k+n\times k m×k+n×k
Y = s i g m o i d ( ∑ i m ∑ j n w i j < x i , x j > ) Y=sigmoid(\sum\limits_i^m\sum\limits_j^nw_{ij}<x_i,x_j>) Y=sigmoid(i∑mj∑nwij<xi,xj>)
among w i j = x i ′ ⋅ x j ′ w_{ij}=x_i^{'}\cdot x_j^{'} wij=xi′⋅xj′, x i ′ x_i^{'} xi′ and x j ′ x_j^{'} xj′ Express x i x_i xi and x j x_j xj Corresponding low dimensional vector
1.5 Text representation model
Text representation model , seeing the name of a thing one thinks of its function , Is the model used to represent text data
① The word bag model
The so-called word bag model is to treat each article as a bag of words , And ignore the order in which each word appears
say concretely , Is to cut the whole text into words , Then each article can be expressed as a long vector
Each dimension in the vector represents a word
The weight value corresponding to the dimension reflects the importance of this word in the article . Commonly used tf-idf To calculate
t f − i d f = t f × i d f tf-idf=tf\times idf tf−idf=tf×idf
among tf For the word t t t In the document d d d Frequency of occurrence in
idf It's reverse document frequency , Used to measure words t t t The importance of expressing semantics :
i d f = The total number of articles Contains words t Total number of articles published + 1 idf=\frac{ The total number of articles }{ Contains words t Total number of articles published +1} idf= Contains words t Total number of articles published +1 The total number of articles
The intuitive explanation is that if a word appears in many articles , Then it may be a general vocabulary , such as “ today ”, Its contribution is relatively low
② N-gram Model
- Sometimes it's not good to divide the text into word levels , such as **natural language processing( natural language processing )** Split the word , Then the meaning of these three words is quite different from that before
- Therefore, continuous n ( n ≤ N ) n(n\le N) n(n≤N) The phrase composed of words is put into the vector representation as a single feature , This is it. N-gram Model
- At the same time, for English , The same word may have many parts of speech changes , But it has a similar meaning . Therefore, in actual processing, word stems are usually extracted , Unify words with different parts of speech into a unified stem form
③ Theme model
The author here said that later , I'm too lazy to read it first
④ Word embedding and deep learning model
- Word embedding is a kind of model that quantifies words , The core idea is to map every word to a low dimensional space ( Usually K = 50 ∼ 300 K=50\sim300 K=50∼300) A dense vector of (Dense Vector)
- K Each dimension of dimensional space can also be seen as an implied theme , It's just not so intuitive
- Because word embedding can map each word into a K K K Dimension vector , So if an article has N N N Word , Then this article can be written by a N × K N\times K N×K The matrix representation of
1.6 Word2Vec
Word2Vec By Google 2003 in , It is one of the commonly used word embedding models . It is a shallow neural network model , There are two network structures , Namely CBOW and Skip-gram
CBOW The model predicts the generation probability of the current word according to the words appearing in the context , and Skip-gram It predicts the generation probability of each word in the context according to the current word , Their general structure is shown in the figure below :

among W t W_t Wt Words of current concern , W t − 2 . . . W t + 2 W_{t-2}...W_{t+2} Wt−2...Wt+2 For words that appear in the context , The sliding window size set here is 2
CBOW and Skip-gram Can be represented by the input layer 、 Neural network represented by mapping layer and output layer
Each word in the input layer is represented by a unique encoding , That is, all words are expressed as a N N N Dimension vector , among N N N Is the total number of words in the vocabulary
In the mapping layer , K K K The value of hidden cells can be determined by N N N Dimension input vector and connection between input and hidden layer N × K N\times K N×K The weight matrix of dimension is calculated stay CBOW in , It is also necessary to sum the implicit units calculated by each input word
Empathy , The value of the vector of the output layer can be determined by 1 Hidden layer vector K K K Dimension and connecting between hidden layer and output layer K × N K\times N K×N The dimension weight matrix is calculated , The output layer is also a N N N Dimension vector , Each dimension corresponds to a word in the vocabulary
Last , For the output layer vector, you can use Softmax Activation function , Calculate the generation probability of each word ,Softmax The function is defined as :
P ( y = w n ∣ X ) = e x n ∑ k = 1 n e x k P(y=w_n|X)=\frac{e^{x_n}}{\sum\limits_{k=1}^ne^{x_k}} P(y=wn∣X)=k=1∑nexkexn
- among x x x representative N N N The original output vector of dimension , x n x_n xn Expressed in the original output vector , And w n w_n wn The value of the dimension corresponding to the word
The next task is to train the weight of neural network , To maximize the overall generation probability of all words in the corpus
Train to get N × K N\times K N×K and K × N K\times N K×N After two weight matrices of , You can choose any one as N N N One word K K K The dimension vector represents
1.7 Processing method when the data image is insufficient
In machine learning , Most models need a lot of data for training and learning , However, in practical applications, there are often insufficient data . Like image classification , As one of the most basic tasks of computer vision , When they train 2 How should we deal with a small number of samples ?
- The information that a model can provide often comes from two aspects , One is the information contained in the training data ; The second is in the process of model formation , The transcendental information that people provide
- When the training data is insufficient , It shows that the model obtains less information from the original data , In this case, you want to ensure the effect of the model , We need to get more prior information
- We can adjust according to specific assumptions 、 Transform or expand training data , Let it show more 、 More useful information , In order to facilitate the training and learning of the model
Specific to the task of image classification , Insufficient training data may cause over fitting of the model , The corresponding processing methods can also be divided into two categories :
- One is acting on the model , Can simplify the model 、 Add constraints to reduce the assumption space ( The regularization )、 Integrated learning 、Dropout Super parameters, etc
- The second is based on data , Mainly through data expansion . That is, under the premise of maintaining specific information , The original data is transformed appropriately to achieve the effect of expanding the data set
- The image can be rotated randomly 、 translation 、 The zoom 、 tailoring 、 fill 、 Turn left and right, etc
- At the same time, noise disturbance can be added to pixels , Like salt and pepper noise 、 Gaussian white noise, etc
- Color change
- Change the brightness of the image 、 clarity 、 Contrast 、 Sharpness, etc
- At the same time, it can extract the features of the image , Then transform in the feature space
- At the same time, the image can be generated directly by using the generation countermeasure network
- Use the pre training model to fine tune and so on
边栏推荐
- 从赌场逻辑,分析平台币的投资价值 2020-03-03
- A set of Jenkins style for personal use
- Jira --- workflow call external api
- Leetcode daily question 2021/7/11-2021/7/17
- Ruffian Heng embedded bimonthly issue 58
- 在线问题反馈模块实战(五):实现对通用字段内容自动填充功能
- 175. Combine two tables (MySQL database connection)
- CCF-CSP《202206-2—寻宝!大冒险!》
- [C# Console]-C# 控制台类
- Precautions for MySQL statements
猜你喜欢

Pytorch notes (5)

RISC-V技术杂谈

数据库复习--数据库恢复技术

Xinlinx zynq7020, 7045 domestic replacement fmql45t900 national production arm core board + expansion board

Wechat oauth2.0 login process and security analysis

How to choose flash for new products?

redis分布式锁

Convolutional neural network CNN

Xinlinx zynq7010 domestic replacement fmql10s400 national production arm core board + expansion board

代码学习(DeamNet)CVPR | Adaptive Consistency Prior based Deep Network for Image Denoising
随机推荐
LeetCode 每日一题 2021/7/11-2021/7/17
INSTALL_PARSE_FAILED_MANIFEST_MALFORMED
【JVM】之虚拟机栈
代码学习(DeamNet)CVPR | Adaptive Consistency Prior based Deep Network for Image Denoising
INSTALL_ PARSE_ FAILED_ MANIFEST_ MALFORMED
Is it necessary to buy pension insurance? What are the pension products suitable for the elderly?
Object detection and bounding box
Xinlinx zynq7010 domestic replacement fmql10s400 national production arm core board + expansion board
FMC sub card: 8-channel 125msps sampling rate 16 bit AD acquisition sub card
Leetcode daily question 2021/7/11-2021/7/17
redis6新功能
Beijing Jiewen technology, an acquiring outsourcing service provider, transferred 60% of its shares for about 480million
网站APP数据库里的用户信息被泄露篡改怎么办
Local storage sessionstorage
Discussion sur la technologie RISC - V
Is there any cumulative error in serial communication and the requirements for clock accuracy
FMC sub card: 4-channel 250msps sampling rate 16 bit AD acquisition sub card
通过Dao投票STI的销毁,SeekTiger真正做到由社区驱动
Transferring multiple pictures is the judgment of empty situation.
数据库复习--数据库恢复技术