Change language

ML | Spectral clustering

| |

Spectral Clustering — it is a growing clustering algorithm that performs better than many traditional clustering algorithms in many cases. It treats each data point as a node in the graph and thus turns the clustering problem into a graph splitting problem. A typical implementation consists of three main stages:

  1. Building a similarity graph: at this stage builds a similarity graph in the form of an adjacency matrix, which is represented by A. ways: —
    • Epsilon-neighborhood graph: the epsilon parameter is fixed in advance. Then each point connects to all points that lie in its epsilon radius. If all the distances between any two points are the same on the scale, then, as a rule, the weights of the edges, i.e. the distance between two points is not saved as they do not provide any additional information. Thus, in this case, the constructed graph is an undirected and unweighted graph.
    • K-nearest neighbors The parameter k is fixed in advance. Then, for two vertices u and v, the edge is directed from u to v only if v is among the k-nearest neighbors of u. Note that this results in a weighted and directed graph, because it is not always the case that for every u having v as one of its k-nearest neighbors, it will be the same for v having u among its k-nearest neighbors neighbors. To make this graph non-directional, one of the following approaches is used:
      1. A straight edge from U to V and from V to and, if either v is one of the K-nearest neighbors and or is one of the k-nearest neighbors of st.
      2. A straight edge from and to V and from V to and, if v is one of the K-nearest neighbors of both and and is one of K-nearest neighbors st.
    • Fully linked plot. To plot this plot, each point is associated with an undirected edge weighted by the distance between two points to any other point. Since this approach is used to model local neighborhood relationships, so it is common to use the Gaussian similarity metric to compute distance.
  2. Projecting data into a lower dimensional space. This step is done to account for the possibility that members of the same cluster may be far away in a given dimensional space. Thus, the dimensional space is reduced so that these points are closer in the reduced dimension space and thus can be grouped together using the traditional clustering algorithm. This is done by calculating the Laplace Graf matrix . To calculate this first, you need to determine the degree of the knot. The degree of the i-th node is defined as

    Note that this is the edge between nodes i and j as defined in the adjacency matrix above.

    The degree matrix is ​​defined as follows:

    Thus, the Laplace Graf matrix is ​​defined as:

    This matrix is ​​then normalized for mathematical efficiency. To reduce the size, the eigenvalues ​​and the corresponding eigenvectors are first computed. If the number of clusters is k, then the first eigenvalues ​​and their eigenvectors are taken and fit into the matrix in such a way that the eigenvectors are columns.

  3. Data clustering. This the process basically involves clustering the abbreviated data using any traditional clustering method — usually K-Means Clustering. First, each node is assigned a row of the normalized Laplace Graf matrix. This data is then grouped using any conventional technique. To transform the clustering result, the node ID is stored.

    Properties :

    1. No assumptions: this clustering method, unlike other traditional methods , does not imply that the data matches some property. Thus, this makes this method the answer to a more general class of clustering problems.
    2. Ease of implementation and speed: This algorithm is easier to implement than other clustering algorithms and is also very fast. since it mainly consists of math calculations.
    3. Not scalable: because it involves matrix construction and computation of eigenvalues ​​and eigenvectors, it takes a long time for dense datasets.

The following steps demonstrate how to implement spectral clustering using Sklearn. Data for the next steps — this is credit card data which can be downloaded from Kaggle .

Step 1: Import required libraries

import pandas as pd

import matplotlib.pyplot as plt

from sklearn.cluster import SpectralClustering

from sklearn.preprocessing import StandardScaler, normalize

from sklearn.decomposition import PCA

from sklearn.metrics import silhouette_score

Step 2: Load and clear data

# Change workplace to data location

cd "C: UsersDevDesktopKaggleCredit_Card"

 
# Loading data

X = pd.read_csv ( ’CC_GENERAL.csv’ )

  
# Remove the CUST_ID column from the data

X = X.drop ( ’CUST_ID’ , axis = 1 )

  
# Handle missing values, if any

X.fillna (method = ’ffill’ , inplace = True )

 
X.head ()

Step 3: Pre-processing data to make the data rendered

# Pre-processing data to make it renderable

 
# Scaling data

scaler = StandardScaler ()

X_scaled = scaler.fit_transform (X)

 
# Data normalization

X_normalized = normalize (X_scaled)

 
# Convert numpy array to panda DataFrame

X_normalized = pd.DataFrame (X_normalized)

 
# Reducing data size

pca = PCA (n_components = 2 )

X_principal = pca.fit_transform (X_normalized)

X_principal = pd.DataFrame (X_principal)

X_principal.columns = [ ’P1’ , ’ P2’ ]

  
X_principal.head ()

Step 4. Post swarm clustering models and visualize clustering.

In the following steps, two different spectral clustering models with different affinity values. You can read about the documentation of the Spectral Clustering class here .

a ) affinity = & # 39; rbf & # 39;

# Building a clustering model

spectral_model_rbf = SpectralClustering (n_clusters = 2 , affinity = ’ rbf’ )

 
# Train the model and save the predicted cluster labels

labels_rbf = spectral_model_rbf.fit_predict (X_princ ipal)

# Create a label for color matching

colors = {}

colors [ 0 ] = ’b’

colors [ 1 ] = ’y’

  
# Build a color vector for each data point

cvec = [colors [label] for label in < / code> labels_rbf]

 
# Building a cluster scatterplot

 

b = plt.scatter (X_principal [ ’P1’ ], X_principal [ ’ P2’ ], color = ’b’ ); 

y = plt.scatter (X_principal [ ’P1’ ], X_principal [ ’ P2’ ], color = ’y’ ); 

 

plt.figure (figsize = ( 9 , 9 ))

plt.scatter (X_principal [ ’P1’ ], X_principal [ ’ P2’ ], c = cvec)

plt.legend ((b, y), ( ’Label 0’ , ’Label 1’ ))

plt.show ()

b) similarity = & # 39; next_leagues & # 39;

# Building a clustering model

spectral_model_nn = SpectralClustering (n_clusters = 2 , affinity = ’nearest_neighbors’ )

  
# Train the model and save the predicted cluster labels

labels_nn = spectral_model_nn.fit_predict (X_principal)

Step 5: Performance Evaluation

# List of different affinity values ​​

affinity = [ ’rbf’ , ’nearest-neighbors’ ]

 
# Silhouette points list

s_scores = []

 
# Performance score
s_scores.append (silhouette_score (X, labels_rbf))
s_scores.append (silhouette_score (X, labels_nn))

 

print (s_scores)

< p>

Step 6: Compare performances

# Building a histogram for comparing models
plt.bar (affinity, s_scores )

plt.xlabel ( ’Affinity’ )

plt.ylabel ( ’Silhouette Score’ )

plt.title ( ’ Comparison of different Clustering Models’ )

plt.show ()

Shop

Learn programming in R: courses

$

Best Python online courses for 2022

$

Best laptop for Fortnite

$

Best laptop for Excel

$

Best laptop for Solidworks

$

Best laptop for Roblox

$

Best computer for crypto mining

$

Best laptop for Sims 4

$

Latest questions

NUMPYNUMPY

Common xlabel/ylabel for matplotlib subplots

12 answers

NUMPYNUMPY

How to specify multiple return types using type-hints

12 answers

NUMPYNUMPY

Why do I get "Pickle - EOFError: Ran out of input" reading an empty file?

12 answers

NUMPYNUMPY

Flake8: Ignore specific warning for entire file

12 answers

NUMPYNUMPY

glob exclude pattern

12 answers

NUMPYNUMPY

How to avoid HTTP error 429 (Too Many Requests) python

12 answers

NUMPYNUMPY

Python CSV error: line contains NULL byte

12 answers

NUMPYNUMPY

csv.Error: iterator should return strings, not bytes

12 answers

News


Wiki

Python | How to copy data from one Excel sheet to another

Common xlabel/ylabel for matplotlib subplots

Check if one list is a subset of another in Python

sin

How to specify multiple return types using type-hints

exp

Printing words vertically in Python

exp

Python Extract words from a given string

Cyclic redundancy check in Python

Finding mean, median, mode in Python without libraries

cos

Python add suffix / add prefix to strings in a list

Why do I get "Pickle - EOFError: Ran out of input" reading an empty file?

Python - Move item to the end of the list

Python - Print list vertically