1. PCA Transformation & Visualization
1.1. To prepare our 30 numeric features for PCA, we have to standardize the values
1.1.1. from sklearn.preprocessing import StandardScaler
1.1.2. scaler = StandardScaler() scaler.fit(df)
1.1.3. scaled_data = scaler.transform(df)
1.1.4. At this point we can observe that scaled_data includes 569 sample records and all 30 features
1.1.4.1. scaled_data.shape
1.1.4.1.1. (569, 30)
1.2. Next we use PCA to reduce our scaled (standardized) dataset to just 2 principal components
1.2.1. from sklearn.decomposition import PCA
1.2.2. pca = PCA(n_components=2)
1.2.3. pca.fit(scaled_data)
1.2.4. x_pca = pca.transform(scaled_data)
1.2.5. At this point we can observe that x_pca includes 569 sample records and just 2 features
1.2.5.1. x_pca.shape
1.2.5.1.1. (569, 2)
1.3. Now that we've used PCA to reduce 30 dimensions to just 2 dimensions, we can visualize the data on a scatterplot
1.3.1. plt.figure(figsize=(8,6)) plt.scatter(x_pca[:,0],x_pca[:,1],c=cancer['target'],cmap='plasma') plt.xlabel('First principal component') plt.ylabel('Second Principal Component')
1.3.1.1. Here's the plot
1.3.1.2. Normally it is not possible to add hue to your visualization for PCA like we did here because PCA is for unlabelled data, but we do so here to show how effective PCA can be in reducing a 30 dimensional set to 2 principal components that explain whether or not a tumor was malignant or benign
2. Interpreting components
2.1. This is considered the most challenging aspect of PCA
2.1.1. The problem is that when looking at a visualization of the first principal component vs the second principal component (see linked scatterplot for example), the PCA components are not relatable to the original features
2.1.1.1. Each component actually relates to a combination of all the features
2.2. The components are stored as an attribute of the PCA object
2.2.1. pca.components_
2.2.1.1. Returns a numpy array
2.2.1.1.1. Here's the result
2.3. We can visualize the relationship between each of the PCA components and the features by using a heat map
2.3.1. df_comp = pd.DataFrame(pca.components_,columns=cancer['feature_names'])
2.3.2. plt.figure(figsize=(12,6)) sns.heatmap(df_comp,cmap='plasma')
2.3.3. Here's the heat map
2.3.3.1. We can see all the 30 features on the x-axis
2.3.3.2. The features with the greatest contrast in colour split between component 0 vs 1 are the features that are most significant in determining one classification vs another
3. Next steps
3.1. The idea is to take your x_pca object (i.e. the pca object after the transformation to reduce dimensionality and rotate) and feed it into a predictive classification algorithm such as logistic regression or SVM
3.1.1. Effectively, we have used PCA to convert an unlabelled dataset into a labelled dataset (the component number representing the labels), which means we can feed the result into supervised learning models
4. Theory
4.1. PCA is an unsupervised statistical technique used to examine the interrelations among a set of variables in order to identify the underlying structure of those variables
4.1.1. The algorithm is used usually for analysis of data and not a fully deployable model (i.e. not used to auto-predict new data)
4.2. The idea is to calculate "best fit" regression lines, where the number of lines depends on the number of variables (or features)
4.2.1. Each line is orthogonal, which means, it's at right angles to the other lines in n-dimensional space
4.2.1.1. If we have 4 variables, the plane of division by orthogonal lines is in 4 dimensions, which is not possible to visualize in a diagram
4.2.2. The idea is that each orthogonal line explains a decreasing amount of variance
4.2.2.1. Here's a visualization of PCA
4.2.2.1.1. The first PCA line explains 70% of variance and the second one explains further 28% of variance (leaving 2% unexplained)
4.2.2.1.2. For 3 variables, we must imagine a third orthogonal line cutting through 3-dimensional space
4.3. PCA is used to transform data
4.3.1. Here is a visualization of PCA transformation process
4.3.1.1. provided by Udemy course lecture
5. Data profiling
5.1. The Udemy course, we used Sklearn to bring in some highly dimensional data (i.e. a cancer dataset with 30 features)
5.1.1. The features of this dataset are all numeric and relate to tumors
5.1.2. The categories for this dataset boil down to just two that are important in relation to the tumors: malignant vs benign
5.1.3. The idea here with PCA is to reduce this dataset to just the most important features that explain the majority of variance (i.e. reduce 30 dimensions to something we can visualize; 2D or 3D)
5.1.3.1. In this case, we want to reduce it to just two "principal component" features that we can visualize in a scatterplot
5.1.4. First, we import our libraries for the PCA
5.1.4.1. import matplotlib.pyplot as plt import pandas as pd import numpy as np import seaborn as sns %matplotlib inline
5.1.5. Next, we bring in the highly dimensional cancer dataset
5.1.5.1. from sklearn.datasets import load_breast_cancer
5.1.5.2. cancer = load_breast_cancer()
5.1.5.2.1. The data type returned is sklearn.utils.Bunch, which is similar to a dictionary
5.1.5.3. cancer.keys()
5.1.5.3.1. Reveals different components of the dataset
5.1.5.4. print(cancer['DESCR'])
5.1.5.4.1. Returns a formatted, textual description of the dataset
5.1.5.5. df = pd.DataFrame(cancer['data'],columns=cancer['feature_names'])
5.1.5.5.1. Grabs the features from the 'data' key and the column names from the 'feature_names' key into a Pandas dataframe
5.1.5.6. df.head()
5.1.5.6.1. Preview the data
5.2. See Python and logistic regression map for more info on data profiling in general
6. Exploratory Data Analysis (EDA)
6.1. In the Udemy course lecture there was no real EDA but we have visualization as part of the PCA Transformation and Visualization process
6.2. See Python and logistic regression map for more info on EDA in general
7. Data cleaning
7.1. In the Udemy course lecture, we didn't do any data cleaning prior to the PCA (which itself has the goal of transforming unlabelled data in readiness for downstream unsupervised categorical machine learning)
7.1.1. See Python and logistic regression map for more info on data cleaning in general