Change language

Implementing Agglomeration Clustering Using Sklearn

| | |

Assumption: The clustering method assumes that each data point is similar enough to other data points, so it can be assumed that the data is initially clustered into 1 cluster.

Step 1: Import required libraries

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

from sklearn .decomposition import PCA

from sklearn.cluster import AgglomerativeClustering

 from sklearn.preprocessing import StandardScaler, normalize

from sklearn.metrics import silhouette_score

import scipy.cluster.hierarchy as shc

Step 2: Load and clear data

# Change workplace to file location
cd C: UsersDevDesktopKaggleCredit_Card

 

X = pd.read_csv ( ’CC_GENERAL.csv’ ) < / p>

 
# Remove the CUST_ID column from the data

X = X.drop ( ’CUST_ID ’ , axis = 1 )

  
# Handle missing values ​​

X.fillna (method = ’ffill’ , inplace = True )

Step 3: Data preprocessing

# Data scaling, to make all functions comparable

scaler = StandardScaler ()

X_scaled = scaler. fit_transform (X)

 
# Normalize the data so that the data is roughly
# follows Gaussian distribution

X_normalized = normalize (X_scaled)

 
# Convert numpy array to panda DataFrame

X_normalized = pd.DataFrame (X_normalized)

Step 4: Downsizing the data

< table border = "0" cellpadding = "0" cellspacing = "0">

pca = PCA (n_components = 2 )

X_principal = pca.fit_transform (X_normalized)

X_principal = pd.DataFrame (X_principal)

X_principal.columns = [ ’P1’ , ’P2’ ]

Dendograms are used to divide a given cluster into many different clusters.

Step 5: Visualize dendograms

< / p>

plt.figure (figsize = ( 8 , 8 ))

plt.title ( ’Visualizing the data’ )

Dendrogram = shc.dendrogram ((shc.linkage (X_principal, method = ’ward’ )))

To determine the optimal number of clusters by visualizing the data, imagine that all the horizontal lines are completely horizontal, and then, after calculating the maximum distance between any two horizontal lines, draw a horizontal line at the calculated maximum distance.

The image above shows that the optimal number of clusters should be 2 for this data.

Step 6: Build and visualize different clustering models for different k values ​​

a) k = 2

ac2 = AgglomerativeClustering (n_clusters = 2 )

 
  # Clustering visualization

plt.figure (figsize = ( 6 , 6 ))

plt.scatter (X_principal [ ’P1’ ], X_principal [ ’ P2’ ], 

c = ac2.fit_predict (X_principal), cmap = ’rainbow’ )

plt.show ()

b) k = 3

ac3 = AgglomerativeClustering (n_clusters = 3 )

 

plt.figure (figsize = ( 6 , 6 ))

plt.scatter (X_principal [ ’P1’ ], X_principal [ ’ P2’ ],

c = ac3.fit_predict (X_principal), cmap = < / code> ’rainbow’ )

plt.show ( )

c) c = 4

ac4 = AgglomerativeClustering (n_clusters = 4 )

 

plt.figure (figsize = ( 6 , 6 ))

plt.scatter (X_principal [ ’P1’ ], X_principal [ ’ P2’ ],

c = ac4.fit_predict (X_principal), cmap = ’ rainbow’ )

plt.show ()

d) k = 5

ac5 = AgglomerativeClustering (n_clusters = 5 )

 

plt.figure (figsize = ( 6 , 6 ))

plt.scatter (X_principal [ ’P1’ ], X_principal [ ’ P2’ ],

c = ac5.fit_predict (X_principal), cmap = ’rainbow’ )

plt.show ()

e) k = 6 

ac6 = AgglomerativeClustering ( n_clusters = 6 )

 

plt.figure (figsize = ( 6 , 6 ))

plt.scatter (X_principal [ ’P1’ ], X_principal [ ’ P2’ ],

c = ac6.fit_predict (X_principal), cmap = ’rainbow’ )

plt.show ()

Now let’s determine the optimal number of clusters using mathematical techniques. Here we will use silhouette points for this purpose.

Step 7: Evaluate different models and visualize the results.

k = [ 2 , 3 , 4 , 5 , 6 ]

  
# Add silhouette points of different models to the list

silhouette_scores = []

silhouette_scores.append (

silhouette_score (X_principal, ac2.fit_predict (X_principal)))

silhouette_scores.append (

silhouette_score (X_principal, ac3.fit_predict (X_principal)))

silhouette_scores.append (

silhouette_score (X_principal, ac4.fit_predict (X_principal) ))

silhouette_scores.append (

silhouette_score (X_principal, ac5.fit_predict (X_principal)))

silhouette_scores.append (

silhouette_score (X_principal, ac6.fit_predict (X_principal)))

 
# Building a histogram to compare results
  plt.bar (k, silhouette_scores)

plt.xlabel ( ’Number of clusters’ , fontsize = 20 )

plt.ylabel ( ’S (i)’ , fontsize = 20 )

plt.show ()

Thus, using silhouette points, it is concluded that the optimal number of clusters for the data data and clustering technique is 2.

Shop

Learn programming in R: courses

$

Best Python online courses for 2022

$

Best laptop for Fortnite

$

Best laptop for Excel

$

Best laptop for Solidworks

$

Best laptop for Roblox

$

Best computer for crypto mining

$

Best laptop for Sims 4

$

Latest questions

NUMPYNUMPY

Common xlabel/ylabel for matplotlib subplots

12 answers

NUMPYNUMPY

How to specify multiple return types using type-hints

12 answers

NUMPYNUMPY

Why do I get "Pickle - EOFError: Ran out of input" reading an empty file?

12 answers

NUMPYNUMPY

Flake8: Ignore specific warning for entire file

12 answers

NUMPYNUMPY

glob exclude pattern

12 answers

NUMPYNUMPY

How to avoid HTTP error 429 (Too Many Requests) python

12 answers

NUMPYNUMPY

Python CSV error: line contains NULL byte

12 answers

NUMPYNUMPY

csv.Error: iterator should return strings, not bytes

12 answers

News


Wiki

Python | How to copy data from one Excel sheet to another

Common xlabel/ylabel for matplotlib subplots

Check if one list is a subset of another in Python

sin

How to specify multiple return types using type-hints

exp

Printing words vertically in Python

exp

Python Extract words from a given string

Cyclic redundancy check in Python

Finding mean, median, mode in Python without libraries

cos

Python add suffix / add prefix to strings in a list

Why do I get "Pickle - EOFError: Ran out of input" reading an empty file?

Python - Move item to the end of the list

Python - Print list vertically