当前位置:网站首页>Using PCA to simplify data
Using PCA to simplify data
2022-07-19 07:50:00 【The soul is on the way】
Principal component analysis
1. What is principal component analysis
summary
- Principal component analysis (Principal Component Analysis, PCA): Easy to understand : Is to find out one of the most important characteristics , And then analyze it .
Definition
- The process of transforming high dimensional data into low dimensional data , In this process, the original data may be discarded 、 Create new variables
effect
- It's data dimension compression , Reduce the dimension of the original data as much as possible ( Complexity ), Lose a little information .
application
- Regression analysis or cluster analysis
2.PCA principle
- Find the direction of the first principal component , That's data The largest variance The direction of .
- Find the direction of the second principal component , That's data Second largest variance The direction of , And this direction is the same as the first principal component direction orthogonal (orthogonal If it is a two-dimensional space, it is called vertical ).
- All principal component directions are calculated in this way .
- Through the covariance matrix of the data set and its eigenvalue analysis , We can get the values of these principal components .
- Once the eigenvalues and eigenvectors of the covariance matrix are obtained , We can keep the largest N Features . These eigenvectors also give N The real structure of the most important features , We can multiply the data by this N eigenvectors So as to transform it into a new space .
Why orthogonality ?
- Orthogonality is to minimize the loss of data validity
- One reason for orthogonality is that the eigenvectors of eigenvalues are orthogonal
PCA Advantages and disadvantages
- advantage : Reduce the complexity of data , Identify the most important features .
- shortcoming : Not necessarily , And may lose useful information .
- Applicable data type : Numerical data .
3.PCA Thought
PCA seeing the name of a thing one thinks of its function , It's about finding out the main aspects of the data , Replace the original data with the most important aspect of the data . Concrete , If our data set is n Dimensional , share m Data . We hope to bring this m Dimensions of data from n Dimension down to n’ dimension , I hope this m individual n’ Dimensional datasets represent the original datasets as much as possible . We know the data from n Dimension down to n’ There's bound to be a loss , But we want the loss to be as small as possible . So how to make this n’ Dimension data represents the original data as much as possible ?
Let's look at the simplest case first , That is to say n=2, n’=1, That is to reduce the dimension of data from two dimensions to one dimension . The data is as follows . We want to find a certain dimension direction , It can represent the data of these two dimensions . Two vector directions are listed ,u1 and u2, So which vector can better represent the original data set ? Intuitively, we can see that ,u1 Than u2 good .
Why? u1 Than u2 Okay? ? There are two explanations , The first explanation is that the sample point is close enough to the line , The second explanation is that the projections of the sample points on this line can be separated as far as possible .
If we put n’ from 1 Dimension is extended to any dimension , Then our standard of dimension reduction is : The sample point is close enough to the hyperplane , In other words, the projection of sample points on this hyperplane can be separated as far as possible .
4.sklearn API
- sklearn.decomposition.PCA(n_components=None)
- Decompose the data into lower dimension space
- n_components:
- decimal : What percentage of information is retained
- Integers : How many features are reduced to
- PCA.fit_transform(X) X:numpy array Formatted data [n_samples,n_features]
- Return value : After the transformation, specify the array
5、 Case study : Explore users' preferences for item categories, subdivision and dimensionality reduction
ps: The data is in kaggle Last game 
The data are as follows :
order_products__prior.csv: Order and product information
- Field :order_id, product_id, add_to_cart_order, reordered
products.csv: Commodity information
- Field :product_id, product_name, aisle_id, department_id
orders.csv: User's order information
- Field :order_id,user_id,eval_set,order_number,….
aisles.csv: The specific item category to which the commodity belongs
- Field : aisle_id, aisle


# Read the data from four tables
prior = pd.read_csv("./data/instacart/order_products__prior.csv")
products = pd.read_csv("./data/instacart/products.csv")
orders = pd.read_csv("./data/instacart/orders.csv")
aisles = pd.read_csv("./data/instacart/aisles.csv")
# Merge four tables
mt = pd.merge(prior, products, on=['product_id', 'product_id'])
mt1 = pd.merge(mt, orders, on=['order_id', 'order_id'])
mt2 = pd.merge(mt1, aisles, on=['aisle_id', 'aisle_id'])
# pd.crosstab Count the frequency relationship between users and items ( Count the times )
cross = pd.crosstab(mt2['user_id'], mt2['aisle'])
# PCA Principal component analysis
pc = PCA(n_components=0.95)
data = pc.fit_transform(cross)
6.PCA Algorithm is summarized
Here to PCA The algorithm makes a summary . As a dimension reduction method of unsupervised learning , It just needs eigenvalue decomposition , You can compress the data , Denoise . So it's widely used in real scenes . In order to overcome PCA Some shortcomings of , There's a lot of PCA Variants , For example, the sixth section is to solve the problem of nonlinear dimensionality reduction KPCA, There's also the incremental solution to the memory limit PCA Method Incremental PCA, And solve the problem of sparse data dimensionality reduction PCA Method Sparse PCA etc. .
PCA advantage :
1) Just measure the amount of information by variance , Not affected by factors other than data set .
2) The principal components are orthogonal , It can eliminate the mutual influence factors among the original data components .
3) The calculation method is simple , The main operation is eigenvalue decomposition , Easy to implement .
PCA shortcoming :
1) The meaning of each characteristic dimension of principal component has certain fuzziness , It is not as explanatory as the original sample features .
2) Non principal components with small variance may also contain important information about sample differences , Because dimension reduction and discarding may have an impact on subsequent data processing .
边栏推荐
- 小怿和你聊聊V2X测试系列之 如何实现C-V2X HIL测试(2022版)
- 环境变量和文件夹放置位置
- 修改checkbox样式
- Thales security solutions: key steps to improve national network security
- Flowable query, complete, void, delete tasks
- 卷积神经网络CNN
- js数组交集、差集和并集
- Maxwell简介&使用
- Spark introduction to proficient - external part (operation and maintenance and simple operation of standaone cluster)
- Transferring multiple pictures is the judgment of empty situation.
猜你喜欢

PCIe bus architecture high performance data preprocessing board / K7 325t FMC interface data acquisition and transmission card

微信OAuth2.0 登录流程以及安全性分析

一刻钟读懂gPTP

TSN security protocol (802.1qci)

How can FD apply the vector diagnostic tool chain?

A2B音频总线在智能座舱中的应用

修改滚动条样式

Jenkins如何设置邮箱自动发送邮件?

Introduction & use of Maxwell
![[quantitative notes] the meaning of relevant technical indicators of volatility](/img/e9/5d6cada39f6211a1eaf3fe31000c3e.png)
[quantitative notes] the meaning of relevant technical indicators of volatility
随机推荐
Spark3.x源码编译
力扣114题:二叉树展开链表
[quantitative notes] the meaning of relevant technical indicators of volatility
Prevent blackmail attacks through data encryption schemes
How does Jenkins set the mailbox to automatically send mail?
Export file or download file
VMware Cloud Director 10.4 发布 (含下载) - 云计算调配和管理平台
shader入门之基础光照知识
FMC sub card: 4-channel 250msps sampling rate 16 bit AD acquisition sub card
微信OAuth2.0 登录流程以及安全性分析
Introduction & use of Maxwell
环境变量和文件夹放置位置
Use Altium designer software to draw a design based on stm32
PyCharm 界面设置
[operation rules] how to realize TSN system level test?
【MySQL】 锁机制:InnoDB引擎中锁分类以及表锁、行锁、页锁详解
Xiaodi network security note information collection CDN bypass technology (7)
CAN FD如何应用Vector诊断工具链?
养老年金保险有必要买吗?适合老人的养老年金产品有哪些?
如何选择合适的模型