Implementing Agglomeration Clustering Using Sklearn

Assumption: The clustering method assumes that each data point is similar enough to other data points, so it can be assumed that the data is initially clustered into 1 cluster.

Step 1: Import required libraries

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

from sklearn .decomposition import PCA

from sklearn.cluster import AgglomerativeClustering

 from sklearn.preprocessing import StandardScaler, normalize

from sklearn.metrics import silhouette_score

import scipy.cluster.hierarchy as shc

Step 2: Load and clear data

# Change workplace to file location
cd C: UsersDevDesktopKaggleCredit_Card

 

X = pd.read_csv ( 'CC_GENERAL.csv' ) < / p>

 
# Remove the CUST_ID column from the data

X = X.drop ( 'CUST_ID ' , axis = 1 )

  
# Handle missing values ​​

X.fillna (method = 'ffill' , inplace = True )

Step 3: Data preprocessing

# Data scaling, to make all functions comparable

scaler = StandardScaler ()

X_scaled = scaler. fit_transform (X)

 
# Normalize the data so that the data is roughly
# follows Gaussian distribution

X_normalized = normalize (X_scaled)

 
# Convert numpy array to panda DataFrame

X_normalized = pd.DataFrame (X_normalized)

Step 4: Downsizing the data

< table border = "0" cellpadding = "0" cellspacing = "0">

pca = PCA (n_components = 2 )

X_principal = pca.fit_transform (X_normalized)

X_principal = pd.DataFrame (X_principal)

X_principal.columns = [ 'P1' , 'P2' ]

Dendograms are used to divide a given cluster into many different clusters.

Step 5: Visualize dendograms

< / p>

plt.figure (figsize = ( 8 , 8 ))

plt.title ( 'Visualizing the data' )

Dendrogram = shc.dendrogram ((shc.linkage (X_principal, method = 'ward' )))

To determine the optimal number of clusters by visualizing the data, imagine that all the horizontal lines are completely horizontal, and then, after calculating the maximum distance between any two horizontal lines, draw a horizontal line at the calculated maximum distance.

The image above shows that the optimal number of clusters should be 2 for this data.

Step 6: Build and visualize different clustering models for different k values ​​

a) k = 2

ac2 = AgglomerativeClustering (n_clusters = 2 )

 
  # Clustering visualization

plt.figure (figsize = ( 6 , 6 ))

plt.scatter (X_principal [ 'P1' ], X_principal [ ' P2' ], 

c = ac2.fit_predict (X_principal), cmap = 'rainbow' )

plt.show ()

b) k = 3

ac3 = AgglomerativeClustering (n_clusters = 3 )

 

plt.figure (figsize = ( 6 , 6 ))

plt.scatter (X_principal [ 'P1' ], X_principal [ ' P2' ],

c = ac3.fit_predict (X_principal), cmap = < / code> 'rainbow' )

plt.show ( )

c) c = 4

ac4 = AgglomerativeClustering (n_clusters = 4 )

 

plt.figure (figsize = ( 6 , 6 ))

plt.scatter (X_principal [ 'P1' ], X_principal [ ' P2' ],

c = ac4.fit_predict (X_principal), cmap = ' rainbow' )

plt.show ()

d) k = 5

ac5 = AgglomerativeClustering (n_clusters = 5 )

 

plt.figure (figsize = ( 6 , 6 ))

plt.scatter (X_principal [ 'P1' ], X_principal [ ' P2' ],

c = ac5.fit_predict (X_principal), cmap = 'rainbow' )

plt.show ()

e) k = 6 

ac6 = AgglomerativeClustering ( n_clusters = 6 )

 

plt.figure (figsize = ( 6 , 6 ))

plt.scatter (X_principal [ 'P1' ], X_principal [ ' P2' ],

c = ac6.fit_predict (X_principal), cmap = 'rainbow' )

plt.show ()

Now let's determine the optimal number of clusters using mathematical techniques. Here we will use silhouette points for this purpose.

Step 7: Evaluate different models and visualize the results.

k = [ 2 , 3 , 4 , 5 , 6 ]

  
# Add silhouette points of different models to the list

silhouette_scores = []

silhouette_scores.append (

silhouette_score (X_principal, ac2.fit_predict (X_principal)))

silhouette_scores.append (

silhouette_score (X_principal, ac3.fit_predict (X_principal)))

silhouette_scores.append (

silhouette_score (X_principal, ac4.fit_predict (X_principal) ))

silhouette_scores.append (

silhouette_score (X_principal, ac5.fit_predict (X_principal)))

silhouette_scores.append (

silhouette_score (X_principal, ac6.fit_predict (X_principal)))

 
# Building a histogram to compare results
  plt.bar (k, silhouette_scores)

plt.xlabel ( 'Number of clusters' , fontsize = 20 )

plt.ylabel ( 'S (i)' , fontsize = 20 )

plt.show ()

Thus, using silhouette points, it is concluded that the optimal number of clusters for the data data and clustering technique is 2.





Get Solution for free from DataCamp guru