ML | Spectral clustering



Spectral Clustering — it is a growing clustering algorithm that performs better than many traditional clustering algorithms in many cases. It treats each data point as a node in the graph and thus turns the clustering problem into a graph splitting problem. A typical implementation consists of three main stages:

  1. Building a similarity graph: at this stage builds a similarity graph in the form of an adjacency matrix, which is represented by A. ways: —
    • Epsilon-neighborhood graph: the epsilon parameter is fixed in advance. Then each point connects to all points that lie in its epsilon radius. If all the distances between any two points are the same on the scale, then, as a rule, the weights of the edges, i.e. the distance between two points is not saved as they do not provide any additional information. Thus, in this case, the constructed graph is an undirected and unweighted graph.
    • K-nearest neighbors The parameter k is fixed in advance. Then, for two vertices u and v, the edge is directed from u to v only if v is among the k-nearest neighbors of u. Note that this results in a weighted and directed graph, because it is not always the case that for every u having v as one of its k-nearest neighbors, it will be the same for v having u among its k-nearest neighbors neighbors. To make this graph non-directional, one of the following approaches is used:
      1. A straight edge from U to V and from V to and, if either v is one of the K-nearest neighbors and or is one of the k-nearest neighbors of st.
      2. A straight edge from and to V and from V to and, if v is one of the K-nearest neighbors of both and and is one of K-nearest neighbors st.
    • Fully linked plot. To plot this plot, each point is associated with an undirected edge weighted by the distance between two points to any other point. Since this approach is used to model local neighborhood relationships, so it is common to use the Gaussian similarity metric to compute distance.
  2. Projecting data into a lower dimensional space. This step is done to account for the possibility that members of the same cluster may be far away in a given dimensional space. Thus, the dimensional space is reduced so that these points are closer in the reduced dimension space and thus can be grouped together using the traditional clustering algorithm. This is done by calculating the Laplace Graf matrix . To calculate this first, you need to determine the degree of the knot. The degree of the i-th node is defined as

    Note that this is the edge between nodes i and j as defined in the adjacency matrix above.

    The degree matrix is ​​defined as follows:

    Thus, the Laplace Graf matrix is ​​defined as:

    This matrix is ​​then normalized for mathematical efficiency. To reduce the size, the eigenvalues ​​and the corresponding eigenvectors are first computed. If the number of clusters is k, then the first eigenvalues ​​and their eigenvectors are taken and fit into the matrix in such a way that the eigenvectors are columns.

  3. Data clustering. This the process basically involves clustering the abbreviated data using any traditional clustering method — usually K-Means Clustering. First, each node is assigned a row of the normalized Laplace Graf matrix. This data is then grouped using any conventional technique. To transform the clustering result, the node ID is stored.

    Properties :

    1. No assumptions: this clustering method, unlike other traditional methods , does not imply that the data matches some property. Thus, this makes this method the answer to a more general class of clustering problems.
    2. Ease of implementation and speed: This algorithm is easier to implement than other clustering algorithms and is also very fast. since it mainly consists of math calculations.
    3. Not scalable: because it involves matrix construction and computation of eigenvalues ​​and eigenvectors, it takes a long time for dense datasets.

The following steps demonstrate how to implement spectral clustering using Sklearn. Data for the next steps — this is credit card data which can be downloaded from Kaggle .

Step 1: Import required libraries

import pandas as pd

import matplotlib.pyplot as plt

from sklearn.cluster import SpectralClustering

from sklearn.preprocessing import StandardScaler, normalize

from sklearn.decomposition import PCA

from sklearn.metrics import silhouette_score

Step 2: Load and clear data

# Change workplace to data location

cd "C: UsersDevDesktopKaggleCredit_Card"

 
# Loading data

X = pd.read_csv ( `CC_GENERAL.csv` )

  
# Remove the CUST_ID column from the data

X = X.drop ( `CUST_ID` , axis = 1 )

  
# Handle missing values, if any

X.fillna (method = `ffill` , inplace = True )

 
X.head ()

Step 3: Pre-processing data to make the data rendered

# Pre-processing data to make it renderable

 
# Scaling data

scaler = StandardScaler ()

X_scaled = scaler.fit_transform (X)

 
# Data normalization

X_normalized = normalize (X_scaled)

 
# Convert numpy array to panda DataFrame

X_normalized = pd.DataFrame (X_normalized)

 
# Reducing data size

pca = PCA (n_components = 2 )

X_principal = pca.fit_transform (X_normalized)

X_principal = pd.DataFrame (X_principal)

X_principal.columns = [ `P1` , ` P2` ]

  
X_principal.head ()

Step 4. Post swarm clustering models and visualize clustering.

In the following steps, two different spectral clustering models with different affinity values. You can read about the documentation of the Spectral Clustering class here .

a ) affinity = & # 39; rbf & # 39;

# Building a clustering model

spectral_model_rbf = SpectralClustering (n_clusters = 2 , affinity = ` rbf` )

 
# Train the model and save the predicted cluster labels

labels_rbf = spectral_model_rbf.fit_predict (X_princ ipal)

# Create a label for color matching

colors = {}

colors [ 0 ] = `b`

colors [ 1 ] = `y`

  
# Build a color vector for each data point

cvec = [colors [label] for label in < / code> labels_rbf]

 
# Building a cluster scatterplot

 

b = plt.scatter (X_principal [ `P1` ], X_principal [ ` P2` ], color = `b` ); 

y = plt.scatter (X_principal [ `P1` ], X_principal [ ` P2` ], color = `y` ); 

 

plt.figure (figsize = ( 9 , 9 ))

plt.scatter (X_principal [ `P1` ], X_principal [ ` P2` ], c = cvec)

plt.legend ((b, y), ( `Label 0` , `Label 1` ))

plt.show ()

b) similarity = & # 39; next_leagues & # 39;

# Building a clustering model

spectral_model_nn = SpectralClustering (n_clusters = 2 , affinity = `nearest_neighbors` )

  
# Train the model and save the predicted cluster labels

labels_nn = spectral_model_nn.fit_predict (X_principal)

Step 5: Performance Evaluation

# List of different affinity values ​​

affinity = [ `rbf` , `nearest-neighbors` ]

 
# Silhouette points list

s_scores = []

 
# Performance score
s_scores.append (silhouette_score (X, labels_rbf))
s_scores.append (silhouette_score (X, labels_nn))

 

print (s_scores)

< p>

Step 6: Compare performances

# Building a histogram for comparing models
plt.bar (affinity, s_scores )

plt.xlabel ( `Affinity` )

plt.ylabel ( `Silhouette Score` )

plt.title ( ` Comparison of different Clustering Models` )

plt.show ()