当前位置：网站首页>Using PCA to simplify data

Using PCA to simplify data

2022-07-19 07:50:00 【The soul is on the way】

Principal component analysis

1. What is principal component analysis
2.PCA principle
- Why orthogonality ？
- PCA Advantages and disadvantages
3.PCA Thought
4.sklearn API
5、 Case study ： Explore users' preferences for item categories, subdivision and dimensionality reduction
6.PCA Algorithm is summarized
- PCA advantage ：
- PCA shortcoming ：

1. What is principal component analysis

summary

Principal component analysis (Principal Component Analysis, PCA)： Easy to understand ： Is to find out one of the most important characteristics , And then analyze it .

Definition

The process of transforming high dimensional data into low dimensional data , In this process, the original data may be discarded 、 Create new variables

effect

It's data dimension compression , Reduce the dimension of the original data as much as possible （ Complexity ）, Lose a little information .

application

Regression analysis or cluster analysis

2.PCA principle

Find the direction of the first principal component , That's data The largest variance The direction of .
Find the direction of the second principal component , That's data Second largest variance The direction of , And this direction is the same as the first principal component direction orthogonal (orthogonal If it is a two-dimensional space, it is called vertical ).
All principal component directions are calculated in this way .
Through the covariance matrix of the data set and its eigenvalue analysis , We can get the values of these principal components .
Once the eigenvalues and eigenvectors of the covariance matrix are obtained , We can keep the largest N Features . These eigenvectors also give N The real structure of the most important features , We can multiply the data by this N eigenvectors So as to transform it into a new space .

Why orthogonality ？

Orthogonality is to minimize the loss of data validity
One reason for orthogonality is that the eigenvectors of eigenvalues are orthogonal

PCA Advantages and disadvantages

advantage ： Reduce the complexity of data , Identify the most important features .
shortcoming ： Not necessarily , And may lose useful information .
Applicable data type ： Numerical data .

3.PCA Thought

PCA seeing the name of a thing one thinks of its function , It's about finding out the main aspects of the data , Replace the original data with the most important aspect of the data . Concrete , If our data set is n Dimensional , share m Data . We hope to bring this m Dimensions of data from n Dimension down to n’ dimension , I hope this m individual n’ Dimensional datasets represent the original datasets as much as possible . We know the data from n Dimension down to n’ There's bound to be a loss , But we want the loss to be as small as possible . So how to make this n’ Dimension data represents the original data as much as possible ？

Let's look at the simplest case first , That is to say n=2, n’=1, That is to reduce the dimension of data from two dimensions to one dimension . The data is as follows . We want to find a certain dimension direction , It can represent the data of these two dimensions . Two vector directions are listed ,u1 and u2, So which vector can better represent the original data set ？ Intuitively, we can see that ,u1 Than u2 good .
Insert picture description here
Why? u1 Than u2 Okay? ？ There are two explanations , The first explanation is that the sample point is close enough to the line , The second explanation is that the projections of the sample points on this line can be separated as far as possible .

If we put n’ from 1 Dimension is extended to any dimension , Then our standard of dimension reduction is ： The sample point is close enough to the hyperplane , In other words, the projection of sample points on this hyperplane can be separated as far as possible .

4.sklearn API

sklearn.decomposition.PCA(n_components=None)
- Decompose the data into lower dimension space
- n_components:
  - decimal ： What percentage of information is retained
  - Integers ： How many features are reduced to
- PCA.fit_transform(X) X:numpy array Formatted data [n_samples,n_features]
- Return value ： After the transformation, specify the array

5、 Case study ： Explore users' preferences for item categories, subdivision and dimensionality reduction

ps： The data is in kaggle Last game
Insert picture description here

The data are as follows ：

order_products__prior.csv： Order and product information
- Field ：order_id, product_id, add_to_cart_order, reordered
products.csv： Commodity information
- Field ：product_id, product_name, aisle_id, department_id
orders.csv： User's order information
- Field ：order_id,user_id,eval_set,order_number,….
aisles.csv： The specific item category to which the commodity belongs
- Field ： aisle_id, aisle

Insert picture description here

#  Read the data from four tables 
prior = pd.read_csv("./data/instacart/order_products__prior.csv")
products = pd.read_csv("./data/instacart/products.csv")
orders = pd.read_csv("./data/instacart/orders.csv")
aisles = pd.read_csv("./data/instacart/aisles.csv")

#  Merge four tables 
mt = pd.merge(prior, products, on=['product_id', 'product_id'])
mt1 = pd.merge(mt, orders, on=['order_id', 'order_id'])
mt2 = pd.merge(mt1, aisles, on=['aisle_id', 'aisle_id'])

# pd.crosstab  Count the frequency relationship between users and items （ Count the times ）
cross = pd.crosstab(mt2['user_id'], mt2['aisle'])

# PCA Principal component analysis 
pc = PCA(n_components=0.95)
data = pc.fit_transform(cross)

6.PCA Algorithm is summarized

Here to PCA The algorithm makes a summary . As a dimension reduction method of unsupervised learning , It just needs eigenvalue decomposition , You can compress the data , Denoise . So it's widely used in real scenes . In order to overcome PCA Some shortcomings of , There's a lot of PCA Variants , For example, the sixth section is to solve the problem of nonlinear dimensionality reduction KPCA, There's also the incremental solution to the memory limit PCA Method Incremental PCA, And solve the problem of sparse data dimensionality reduction PCA Method Sparse PCA etc. .

PCA advantage ：

1） Just measure the amount of information by variance , Not affected by factors other than data set .

2） The principal components are orthogonal , It can eliminate the mutual influence factors among the original data components .

3） The calculation method is simple , The main operation is eigenvalue decomposition , Easy to implement .

PCA shortcoming ：

1） The meaning of each characteristic dimension of principal component has certain fuzziness , It is not as explanatory as the original sample features .

2） Non principal components with small variance may also contain important information about sample differences , Because dimension reduction and discarding may have an impact on subsequent data processing .

原网站

版权声明
本文为[The soul is on the way]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/200/202207170521099466.html