Change language

# ML | Spectral clustering

| |

Spectral Clustering — it is a growing clustering algorithm that performs better than many traditional clustering algorithms in many cases. It treats each data point as a node in the graph and thus turns the clustering problem into a graph splitting problem. A typical implementation consists of three main stages:

1. Building a similarity graph: at this stage builds a similarity graph in the form of an adjacency matrix, which is represented by A. ways: —
• Epsilon-neighborhood graph: the epsilon parameter is fixed in advance. Then each point connects to all points that lie in its epsilon radius. If all the distances between any two points are the same on the scale, then, as a rule, the weights of the edges, i.e. the distance between two points is not saved as they do not provide any additional information. Thus, in this case, the constructed graph is an undirected and unweighted graph.
• K-nearest neighbors The parameter k is fixed in advance. Then, for two vertices u and v, the edge is directed from u to v only if v is among the k-nearest neighbors of u. Note that this results in a weighted and directed graph, because it is not always the case that for every u having v as one of its k-nearest neighbors, it will be the same for v having u among its k-nearest neighbors neighbors. To make this graph non-directional, one of the following approaches is used:
1. A straight edge from U to V and from V to and, if either v is one of the K-nearest neighbors and or is one of the k-nearest neighbors of st.
2. A straight edge from and to V and from V to and, if v is one of the K-nearest neighbors of both and and is one of K-nearest neighbors st.
• Fully linked plot. To plot this plot, each point is associated with an undirected edge weighted by the distance between two points to any other point. Since this approach is used to model local neighborhood relationships, so it is common to use the Gaussian similarity metric to compute distance.
2. Projecting data into a lower dimensional space. This step is done to account for the possibility that members of the same cluster may be far away in a given dimensional space. Thus, the dimensional space is reduced so that these points are closer in the reduced dimension space and thus can be grouped together using the traditional clustering algorithm. This is done by calculating the Laplace Graf matrix . To calculate this first, you need to determine the degree of the knot. The degree of the i-th node is defined as

Note that this is the edge between nodes i and j as defined in the adjacency matrix above.

The degree matrix is ​​defined as follows:

Thus, the Laplace Graf matrix is ​​defined as:

This matrix is ​​then normalized for mathematical efficiency. To reduce the size, the eigenvalues ​​and the corresponding eigenvectors are first computed. If the number of clusters is k, then the first eigenvalues ​​and their eigenvectors are taken and fit into the matrix in such a way that the eigenvectors are columns.

3. Data clustering. This the process basically involves clustering the abbreviated data using any traditional clustering method — usually K-Means Clustering. First, each node is assigned a row of the normalized Laplace Graf matrix. This data is then grouped using any conventional technique. To transform the clustering result, the node ID is stored.

Properties :

1. No assumptions: this clustering method, unlike other traditional methods , does not imply that the data matches some property. Thus, this makes this method the answer to a more general class of clustering problems.
2. Ease of implementation and speed: This algorithm is easier to implement than other clustering algorithms and is also very fast. since it mainly consists of math calculations.
3. Not scalable: because it involves matrix construction and computation of eigenvalues ​​and eigenvectors, it takes a long time for dense datasets.

The following steps demonstrate how to implement spectral clustering using Sklearn. Data for the next steps — this is credit card data which can be downloaded from Kaggle .

Step 1: Import required libraries

 ` import ` ` pandas as pd ` ` import ` ` matplotlib.pyplot as plt ` ` from ` ` sklearn.cluster ` ` import ` ` SpectralClustering ` ` from ` ` sklearn.preprocessing ` ` import ` ` StandardScaler, normalize ` ` from ` ` sklearn.decomposition ` ` import ` ` PCA ` ` from ` ` sklearn.metrics ` ` import ` ` silhouette_score `

Step 2: Load and clear data

 ` # Change workplace to data location ` ` cd ` ` "C: UsersDevDesktopKaggleCredit_Card" `   ` # Loading data ` ` X ` ` = ` ` pd.read_csv (` ` ’CC_GENERAL.csv’ ` `) ` ` `  ` # Remove the CUST_ID column from the data ` ` X ` ` = ` X.drop ( ` ’CUST_ID’ ` `, axis ` ` = ` ` 1 ` `) ` ` `  ` # Handle missing values, if any ` ` X.fillna (method ` ` = ` ` ’ffill’ ` `, inplace ` ` = ` ` True ` `) `   ` X.head () `

Step 3: Pre-processing data to make the data rendered

 ` # Pre-processing data to make it renderable `   ` # Scaling data ` ` scaler ` ` = ` ` StandardScaler () ` ` X_scaled ` ` = ` ` scaler.fit_transform (X) `   ` # Data normalization ` ` X_normalized ` ` = ` ` normalize (X_scaled) `   ` # Convert numpy array to panda DataFrame ` ` X_normalized ` ` = ` ` pd.DataFrame (X_normalized) `   # Reducing data size ` pca ` ` = ` ` PCA (n_components ` ` = ` ` 2 ` `) ` ` X_principal ` ` = ` ` pca.fit_transform (X_normalized) ` ` X_principal ` ` = ` ` pd.DataFrame (X_principal) ` ` X_principal.columns ` ` = ` ` [` ` ’P1’ ` `, ` `’ P2’ ` `] ` ` `  ` X_principal.head () `

Step 4. Post swarm clustering models and visualize clustering.

In the following steps, two different spectral clustering models with different affinity values. You can read about the documentation of the Spectral Clustering class here .

a ) affinity = & # 39; rbf & # 39;

 ` # Building a clustering model ` ` spectral_model_rbf ` ` = ` ` SpectralClustering (n_clusters ` ` = ` ` 2 ` `, affinity ` ` = ` `’ rbf’ ` `) `   ` # Train the model and save the predicted cluster labels ` ` labels_rbf ` ` = ` ` spectral_model_rbf.fit_predict (X_princ ipal) `

` `

` # Create a label for color matching colors = {} colors [ 0 ] = ’b’ colors [ 1 ] = ’y’    # Build a color vector for each data point cvec = [colors [label] for label in < / code> labels_rbf]   # Building a cluster scatterplot   b = plt.scatter (X_principal [ ’P1’ ], X_principal [ ’ P2’ ], color = ’b’ );  y = plt.scatter (X_principal [ ’P1’ ], X_principal [ ’ P2’ ], color = ’y’ );    plt.figure (figsize = ( 9 , 9 )) plt.scatter (X_principal [ ’P1’ ], X_principal [ ’ P2’ ], c = cvec) plt.legend ((b, y), ( ’Label 0’ , ’Label 1’ )) plt.show () `

` `

b) similarity = & # 39; next_leagues & # 39;

 ` # Building a clustering model ` ` spectral_model_nn ` ` = ` ` SpectralClustering (n_clusters ` ` = ` ` 2 ` `, affinity ` ` = ` ` ’nearest_neighbors’ ` `) ` ` `  ` # Train the model and save the predicted cluster labels ` ` labels_nn ` ` = ` ` spectral_model_nn.fit_predict (X_principal) `

Step 5: Performance Evaluation

 ` # List of different affinity values ​​` ` affinity ` ` = ` ` [` ` ’rbf’ ` `, ` ` ’nearest-neighbors’ ` `] `   ` # Silhouette points list ` ` s_scores ` ` = ` ` [] `   ` # Performance score ` ` s_scores.append (silhouette_score (X, labels_rbf)) ` ` s_scores.append (silhouette_score (X, labels_nn)) `   ` print ` ` (s_scores) `
< p>

Step 6: Compare performances

 ` # Building a histogram for comparing models ` ` plt.bar (affinity, s_scores ) ` ` plt.xlabel (` ` ’Affinity’ ` `) ` ` plt.ylabel (` ` ’Silhouette Score’ ` `) ` ` plt.title (` `’ Comparison of different Clustering Models’ ` `) ` ` plt.show () `

## Shop

Learn programming in R: courses

\$

Best Python online courses for 2022

\$

Best laptop for Fortnite

\$

Best laptop for Excel

\$

Best laptop for Solidworks

\$

Best laptop for Roblox

\$

Best computer for crypto mining

\$

Best laptop for Sims 4

\$

Latest questions

NUMPYNUMPY

Common xlabel/ylabel for matplotlib subplots

NUMPYNUMPY

How to specify multiple return types using type-hints

NUMPYNUMPY

Why do I get "Pickle - EOFError: Ran out of input" reading an empty file?

NUMPYNUMPY

Flake8: Ignore specific warning for entire file

NUMPYNUMPY

glob exclude pattern

NUMPYNUMPY

How to avoid HTTP error 429 (Too Many Requests) python

NUMPYNUMPY

Python CSV error: line contains NULL byte

NUMPYNUMPY

csv.Error: iterator should return strings, not bytes

## Wiki

Python | How to copy data from one Excel sheet to another

Common xlabel/ylabel for matplotlib subplots

Check if one list is a subset of another in Python

How to specify multiple return types using type-hints

Printing words vertically in Python

Python Extract words from a given string

Cyclic redundancy check in Python

Finding mean, median, mode in Python without libraries