Principal component analysis with Python

NumPy | Python Methods and Functions

Using PCA:

  • It is used to find relationships between variables in the data.
  • It is used to interpret and visualize data.
  • As the number of variables decreases, this simplifies further analysis.
  • It is often used to visualize genetic distance and the relationship between populations.

They are mainly performed on square symmetric matrix. It can be pure sum of squares and cross product matrices, or covariance matrices or correlation matrices. The correlation matrix is ​​used if the individual variance is very different.

ATP objectives:

  • This is basically an independent procedure in which it reduces the attribute space from more variables to fewer factors.
  • PCA — it is essentially a process of shrinking dimensions, but there is no guarantee that the dimension can be interpreted.
  • The main challenge in this PCA is to select a subset of variables from a larger set based on which the original variables have the greatest correlation with principal.

Principal Axis Method: PCA basically looks for a linear combination of variables so that we can extract the maximum variance from the variables. Once this process is complete, it removes it and looks for another linear combination that gives an explanation for the maximum fraction of the variance remaining, which mostly results in orthogonal factors. In this method, we analyze the total variance.

Eigen vector: this is a nonzero vector that remains parallel after matrix multiplication. Suppose x is an r-dimensional eigenvector of an r * r matrix M if Mx and x are parallel. Then we need to solve Mx = Ax, where x and A are unknown, to get the eigenvector and eigenvalues.
In the Eigenvectors section, we can say that principal components show both the total and unique variance of a variable. Basically, it is a variance-oriented approach that aims to reproduce the total variance and correlation with all components. The main components are mostly linear combinations of input variables, weighted by their contribution to explain variance in a particular orthogonal dimension.

Eigenvalues: this is mostly known as characteristic roots. It basically measures the variance across all variables that is accounted for by this factor. The eigenvalue ratio is the ratio of the explanatory importance of factors in relation to variables. If the coefficient is low, then it contributes less to explaining the variables. In simple terms, it measures the number of variances in the total given database taken into account by a factor. We can compute the eigenvalue of a factor as the sum of its quadratic factor loading for all variables.

Now, let`s look at principal component analysis with Python.

To get the dataset used in the implementation, click here .

Step 1: Importing Libraries

# import required libraries

import numpy as np

import matplotlib.pyplot as plt

import pandas as pd

Step 2: Import the dataset

Import the dataset and distribute the dataset X and y components for data analysis.

# import or load dataset

dataset = pd.read_csv ( ` wines.csv` )

 
# distribution of the dataset across the two components X and Y

X = dataset.iloc [:, 0 : 13 ]. values ​​

y = dataset.iloc [:, 13 ]. values

Step 3: Splitting the dataset into training and test cases

# Separate X and Y by
# Training and Testing Kit

from sklearn.model_selection import train_test_split

 

X_train, X_test, y_train, y_test = train_test_split (X , y, test_size = 0.2 , random_state = 0 )

Step 4: Scaling functions

Execute doing preprocessing on the training and test set, for example fitting to the standard scale.

# preprocessor execution

from sklearn.preprocessing import StandardScaler

sc = StandardScaler ()

 

X_train = sc.fit_transform (X_train)

X_test = sc.transform (X_test)

Step 5: Applying the PCA function

Applying the PCA function in the training and test case e. for analysis.

# Using the PCA function in training
# and test suite X- component

from sklearn.decomposition import PCA

 

pca = PCA (n_components = 2 )

 

X_train = pca.fit_transform (X_train)

X_test = pca.transform (X_test)

 

explained_variance = pca.explained_variance_ratio_

Step 6: Fitting the logistic regression to the training regression set

# Fitting the logistic regression to the training set

from sklearn.linear_model import LogisticRegression 

 

classifier = LogisticRegression (random_state = 0 )

classifier.fit (X_train, y_train)

Step 7. Predicting test result

# Predict test result using
# predictive function in LogisticRegression

y_pred = classifier.predict (X_test)

Step 8: Create a confusion matrix

# creating confusion between
# test case Y and predicted value.

from sklearn.metrics import confusion_matrix

 

cm = confusion_matrix (y_test, y_pred)

Step 9: Predicting the result of the training set

# Predicting training set
# scatter plot result

from matplotlib.colors import ListedColormap

 

X_set, y_set = X_train, y_train

X1, X2 = np.meshgrid (np.arange (start = X_set [:, 0 ]. min () - 1 ,

stop = X_set [:, 0 ]. max () + 1 , step = 0.01 ),

np.arange (start = X_set [:, 1 ]. min () - 1 ,

stop = X_set [:, 1 ]. max () + 1 , step = 0.01 ))

 
plt.contourf (X1, X2, classifier.predict (np.array ([X1.ravel (),

X2.ravel ()]). T) .reshape (X1.shape), alpha = 0.75 ,

cmap = ListedColormap (( `yellow` , ` white` , `aquamarine` )))

  

plt.xlim (X1. min (), X1. max ())

plt.ylim (X2. min (), X2. max ())

 

for i, j in enumerate (np.unique (y_set)):

  plt.scatter (X_set [y_set = = j, 0 ], X_set [y_set = = j, 1 ],

c = ListedColormap (( `red` , `green` , ` blue` )) (i), label = j)

 

plt.title ( `Logistic Regression (Training set)` )

plt.xla bel ( `PC1` ) # for Xlabel

plt.ylabel ( `PC2` ) # for Ylabel

plt.legend () # show legend

 
# show scatter plot
plt.show ()

Step 10: Render test case results

# Visualize test case results from a scatter plot

from matplotlib .colors import ListedColormap

 

X_set, y_set = X_test, y_test

 

X1, X2 = np.meshgrid (np.arange (start = X_set [:, 0 ]. min () - 1 ,

stop = X_set [:, 0 ]. max () + 1 , step = 0.01 ),

np.arange (start = X_set [:, 1 ] . min () - 1 ,

  stop = X_set [:, 1 ]. max () + 1 , step = 0.01 ))

  
plt.contourf (X1, X2, classifier.predict (np.array ([X1.ravel (),

X2.ravel ()]). T) .reshape (X1.shape), alpha = 0.75 ,

  cmap = ListedColormap (( `yellow` , ` white` , `aquamarine` ))) 

  

plt.xlim (X1. min (), X1. max ())

plt.ylim ( X2. min (), X2. max ())

 

for i, j in enumerate (np.unique (y_set)):

  plt.scatter (X_set [y_set = = j, 0 ], X_set [y_set = = j, 1 ],

c = ListedColormap (( ` red` , `green` , ` blue` )) (i), label = j)

 
# title for the bitmap

plt.title ( `Logistic Regression (Test set)`

plt.xlabel ( ` PC1` ) # for Xlabel

plt.ylabel ( `PC2` ) # for Ylabel

plt.legend ()

 
# show scatter plot
plt.show ()





Get Solution for free from DataCamp guru