Hướng dẫn plot explained variance pca python - cốt truyện giải thích phương sai pca python

Nội dung chính Show

Phương sai được giải thích là gì?
Người giới thiệu
Sự kết luận
Ajitesh Kumar
PCA giải thích phương sai trở lại là gì?
PCA nên giải thích bao nhiêu phương sai?
Các thành phần PCA_ trong sklearn là gì?
PCA phương sai được giải thích tích lũy là gì?

Trong bài đăng này, bạn sẽ tìm hiểu về các khái niệm của & nbsp; phương sai đã giải thích, một trong những khái niệm chính liên quan đến phân tích thành phần chính (PCA). Các khái niệm phương sai được giải thích sẽ được minh họa bằng các ví dụ mã Python. Kiểm tra các khái niệm về giá trị riêng và các hàm riêng trong bài đăng này - tại sao & khi nào nên sử dụng eigenvalue và eigenvector.explained variance which is one of the key concepts related to principal component analysis (PCA). The explained variance concepts will be illustrated with Python code examples. Check out the concepts of Eigenvalues and Eigenvectors in this post – Why & when to use Eigenvalue and Eigenvectors.

Phương sai được giải thích là gì?
Phương sai được giải thích bằng mã Python
- - Lớp PCA Sklearn để xác định phương sai được giải thích
  - Mã python tùy chỉnh (mà không sử dụng sklearn pca) để xác định phương sai được giải thích
Người giới thiệu
Sự kết luận

Phương sai được giải thích là gì?

Phương sai được giải thích bằng mã Pythonis a statistical measure of how much variation in a dataset can be attributed to each of the principal components (eigenvectors) generated by the principal component analysis (PCA) method. In very basic terms, it refers to the amount of variability in a data set that can be attributed to each individual principal component. In other words, it tells us how much of the total variance is “explained” by each component. This is important because it allows us to rank the components in order of importance, and to focus on the most important ones when interpreting the results of our analysis. For example, let’s say you want to build a machine learning model to predict housing prices. The explained variance would tell us how much of the variation in housing prices can be explained by the model. In this case, a higher explained variance would be better because it would mean that the model is doing a better job of predicting housing prices.

Lớp PCA Sklearn để xác định phương sai được giải thíchthe larger the variance explained by a principal component, the more important that component is. PCA is a technique used to reduce the dimensionality of data. It does this by finding the directions of maximum variance in the data and projecting the data onto those directions. The amount of variance explained by each direction is called the “explained variance.” Explained variance can be used to choose the number of dimensions to keep in a reduced dataset. It can also be used to assess the quality of a machine learning model. In general, a model with high explained variance will have good predictive power, while a model with low explained variance may not be as accurate.

Mã python tùy chỉnh (mà không sử dụng sklearn pca) để xác định phương sai được giải thíchexamples. For example, if we have a dataset with 100 samples and 10 features, and we want to reduce it to two dimensions using PCA, we would expect the first component to explain about 86% of the variance (9/10), and the second component to explain about 14% (1/10). Explained variance can also be used to compare different PCA models. For example, if we compare two models that both reduce a dataset from 10 dimensions to 2, but one explains 80% of the variance and the other explains 95% of the variance, we would say that the latter model is better at representing the data.

Người giới thiệufunction of ratio of related eigenvalue and sum of eigenvalues of all eigenvectors. Let’s say that there are N eigenvectors, then the explained variance for each eigenvector (principal component) can be expressed the ratio of eigenvalue of related eigenvalue \(\lambda_i\) and sum of all eigenvalues \((\lambda_1 + \lambda_2 + … + \lambda_n)\) as the following:

Sự kết luận
\)

Phương sai được giải thích là thước đo thống kê về mức độ biến đổi trong bộ dữ liệu có thể được quy cho từng thành phần chính (eigenvector) được tạo bằng phương pháp phân tích thành phần chính (PCA). Trong các thuật ngữ rất cơ bản, nó đề cập đến lượng biến thiên trong một tập dữ liệu có thể được quy cho từng thành phần chính. Nói cách khác, nó cho chúng ta biết bao nhiêu phương sai được giải thích bởi từng thành phần. Điều này rất quan trọng vì nó cho phép chúng tôi xếp hạng các thành phần theo thứ tự quan trọng và tập trung vào các thành phần quan trọng nhất khi diễn giải kết quả phân tích của chúng tôi. Ví dụ, hãy để nói rằng bạn muốn xây dựng một mô hình học máy để dự đoán giá nhà ở. Phương sai được giải thích sẽ cho chúng ta biết mức độ biến đổi của giá nhà đất có thể được giải thích bằng mô hình. Trong trường hợp này, một phương sai được giải thích cao hơn sẽ tốt hơn bởi vì điều đó có nghĩa là mô hình đang thực hiện tốt hơn việc dự đoán giá nhà ở. & NBSP;sklearn.decomposition for doing eigen decomposition of transformation matrix (Covariance matrix created using X_train_std in example given below). Here is the snapshot of the data after being cleaned up.

Hình. Dữ liệu được sử dụng để phân tích phương sai được giải thích

Lưu ý một số điều sau đây trong mã Python được đưa ra dưới đây:

Giải thích_variance_ratio_ Phương thức của PCA được sử dụng để có được khẩu phần phương sai (eigenvalue / tổng eigenvalues) method of PCA is used to get the ration of variance (eigenvalue / total eigenvalues)
Biểu đồ thanh được sử dụng để thể hiện phương sai được giải thích cá nhân.is used to represent individual explained variances.
Biểu đồ bước được sử dụng để thể hiện phương sai được giải thích bởi các thành phần chính khác nhau. is used to represent the variance explained by different principal components.
Dữ liệu cần được thu nhỏ trước khi áp dụng kỹ thuật PCA.

#
# Scale the dataset; This is very important before you apply PCA
#
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
sc.fit(X_train)
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)
#
# Instantiate PCA
#
pca = PCA()
#
# Determine transformed features
#
X_train_pca = pca.fit_transform(X_train_std)
#
# Determine explained variance using explained_variance_ration_ attribute
#
exp_var_pca = pca.explained_variance_ratio_
#
# Cumulative sum of eigenvalues; This will be used to create step plot
# for visualizing the variance explained by each principal component.
#
cum_sum_eigenvalues = np.cumsum(exp_var_pca)
#
# Create the visualization plot
#
plt.bar(range(0,len(exp_var_pca)), exp_var_pca, alpha=0.5, align='center', label='Individual explained variance')
plt.step(range(0,len(cum_sum_eigenvalues)), cum_sum_eigenvalues, where='mid',label='Cumulative explained variance')
plt.ylabel('Explained variance ratio')
plt.xlabel('Principal component index')
plt.legend(loc='best')
plt.tight_layout()
plt.show()

Mã Python được đưa ra ở trên dẫn đến kết quả trong cốt truyện sau.Python code given above results in the following plot.

Hình 2. Phương sai được giải thích bằng Sklearn PCAExplained Variance using sklearn PCA

Mã python tùy chỉnh (mà không sử dụng sklearn pca) để xác định phương sai được giải thích

Trong phần này, bạn sẽ tìm hiểu về cách xác định phương sai được giải thích mà không cần sử dụng Sklearn PCA. Lưu ý một số điều sau đây trong mã được đưa ra dưới đây:without using sklearn PCA. Note some of the following in the code given below:

Dữ liệu đào tạo đã được chia tỷ lệ

Phương pháp EIGH của lớp numpy.linalg được sử dụng. method of numpy.linalg class is used.
Ma trận hiệp phương sai của bộ dữ liệu đào tạo đã được tạo
Giá trị riêng và hàm riêng của ma trận hiệp phương sai đã được xác định
Phương sai được giải thích đã được tính toán
Biểu đồ trực quan đã được tạo ra để trực quan hóa phương sai giải thích.

#
# Scale the dataset; This is very important before you apply PCA
#
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
sc.fit(X_train)
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)
#
# Import eigh method for calculating eigenvalues and eigenvectirs
#
from numpy.linalg import eigh
#
# Determine covariance matrix
#
cov_matrix = np.cov(X_train_std, rowvar=False)
#
# Determine eigenvalues and eigenvectors
#
egnvalues, egnvectors = eigh(cov_matrix)
#
# Determine explained variance
#
total_egnvalues = sum(egnvalues)
var_exp = [(i/total_egnvalues) for i in sorted(egnvalues, reverse=True)]
#
# Plot the explained variance against cumulative explained variance
#
import matplotlib.pyplot as plt
cum_sum_exp = np.cumsum(var_exp)
plt.bar(range(0,len(var_exp)), var_exp, alpha=0.5, align='center', label='Individual explained variance')
plt.step(range(0,len(cum_sum_exp)), cum_sum_exp, where='mid',label='Cumulative explained variance')
plt.ylabel('Explained variance ratio')
plt.xlabel('Principal component index')
plt.legend(loc='best')
plt.tight_layout()
plt.show()

Dưới đây là cách biểu đồ phương sai được giải thích sẽ như thế nào:

Người giới thiệu

Eigenvector & eigenvalues với các ví dụ về Python
Khi nào & tại sao sử dụng eigenvalues & eigenvector

Sự kết luận

Dưới đây là kết luận / học tập từ bài đăng này:

Phương sai được giải thích đại diện cho thông tin được giải thích bằng các thành phần chính khác nhau (eigenvector)
Phương sai được giải thích được tính là tỷ lệ eigenvalue của một thành phần chính khớp (eigenvector) với tổng số giá trị riêng.
Phương sai được giải thích có thể được tính toán như thuộc tính đã giải thích_variance_ratio_ của trường hợp PCA được tạo bằng lớp sklearn.decompation PCA.explained_variance_ratio_ of PCA instance created using sklearn.decomposition PCA class.

Tác giả
Bài viết gần đây

Kiểm tra cuốn sách mới nhất của tôi có tiêu đề là Nguyên tắc đầu tiên suy nghĩ: Xây dựng các sản phẩm chiến thắng bằng cách sử dụng suy nghĩ nguyên tắc đầu tiên

Ajitesh Kumar

Gần đây tôi đã làm việc trong lĩnh vực phân tích dữ liệu bao gồm khoa học dữ liệu và học máy / học sâu. Tôi cũng đam mê các công nghệ khác nhau bao gồm các ngôn ngữ lập trình như Java/JEE, JavaScript, Python, R, Julia, v.v. vv Để cập nhật và blog mới nhất, hãy theo dõi chúng tôi trên Twitter. Tôi rất thích kết nối với bạn trên LinkedIn. Kiểm tra cuốn sách mới nhất của tôi có tiêu đề là Nguyên tắc đầu tiên suy nghĩ: Xây dựng các sản phẩm chiến thắng bằng cách sử dụng suy nghĩ nguyên tắc đầu tiên

PCA giải thích phương sai trở lại là gì?

PCA.Giải thích_variance_ratio_ tham số trả về một vectơ của phương sai được giải thích bởi mỗi chiều.a vector of the variance explained by each dimension.

PCA nên giải thích bao nhiêu phương sai?

Một số tiêu chí nói rằng tổng phương sai được giải thích bởi tất cả các thành phần nên nằm trong khoảng từ 70% đến 80%, trong trường hợp này có nghĩa là khoảng bốn đến năm thành phần.between 70% to 80% variance, which in this case would mean about four to five components.

Các thành phần PCA_ trong sklearn là gì?

Phân tích thành phần chính (PCA).Giảm kích thước tuyến tính bằng cách sử dụng phân tách giá trị số ít của dữ liệu để chiếu nó lên một không gian chiều thấp hơn.Dữ liệu đầu vào được tập trung nhưng không được chia tỷ lệ cho từng tính năng trước khi áp dụng SVD. (PCA). Linear dimensionality reduction using Singular Value Decomposition of the data to project it to a lower dimensional space. The input data is centered but not scaled for each feature before applying the SVD.

PCA phương sai được giải thích tích lũy là gì?

Phương sai được giải thích tích lũy cho thấy sự tích lũy phương sai cho từng số thành phần chính.Phương sai giải thích cá nhân mô tả phương sai của từng thành phần chính.shows the accumulation of variance for each principal component number. The individual explained variance describes the variance of each principal component.