ML | V-measure for evaluating the effectiveness of clustering

Michael Zippo

Computing the V-measure first requires computing two terms:

Uniformity : Perfectly uniform clustering — this is the one where each cluster has data points belonging to the same class label. Uniformity describes the closeness of the clustering algorithm to this perfection.
Completeness: perfectly full clustering — it is one in which all data points belonging to the same class are clustered in the same cluster. Completeness describes the closeness of the clustering algorithm to this perfection.

Trivial homogeneity: This is the case when the number of clusters is equal to the number of data points, and each point is in the same cluster. This is an extreme case where uniformity is maximized and completeness is minimal.

Trivial completeness: this is the case when all data points are combined into one cluster. This is an extreme case where uniformity is minimal and completeness is maximal.

Suppose each data point in the above charts has a different class label for trivial homogeneity and trivial completeness.

Note . The term “homogeneous” is different from completeness in the sense that, when talking about homogeneity, the basic concept refers to the corresponding cluster, and we check if each data point has the same class label in each cluster. Speaking of completeness, the main concept is the corresponding class label, which checks if the data points of each class label are in the same cluster.

In the above diagram, the clustering is perfectly homogeneous because in each cluster the data points have the same class label, but it is not complete because not all data points have the same class label belong to the same class label.

In the above diagram, clustering is completely complete since all data points of the same class label belong to the same cluster, but it is not homogeneous because the 1st cluster contains data points of many class labels.

Suppose there are N data samples, C different class labels, K clusters and the number of data points belonging to class c and cluster k. The homogeneity of h is then determined as follows:

where

and

Completeness with is defined as follows:

where

and

So the weighted V-measure is defined as follows:

The factor can be adjusted to provide either uniformity or The completeness of the clustering algorithm.

The main advantage of this scoring metric is that it is independent of the number of class labels, the number of clusters, the size of the data, and the clustering algorithm used, and is a very reliable metric.

The following code will demonstrate how to compute the V-measure of the clustering algorithm. Used data — Credit Card Fraud Detection which can be downloaded from Kaggle . The clustering algorithm uses Variational Bayesian inference for the Gaussian mixture model .

Step 1: Import the required libraries

import pandas as pd

import matplotlib.pyplot as plt

from sklearn.cluster import KMeans

from sklearn.metrics import v_measure_score

Step 2: Load and clear data

       # Change workplace to file location  
  cd C: UsersDevDesktopKaggleCredit Card Fraud 
    
  # Loading data  
   df   =   pd.read_csv ( ’ creditcard.csv’  ) 
   
  # Separate dependent and independent variables  
   y   =   df [  ’Class’  ] 
   X   =   df.drop (  ’Class’  , axis   =   1  ) 
    
  X.head ()

Step 3: Building different models clustering and comparing their V-measure indicators

At this stage, 5 different K-Means clustering models will be built, with each model clustering data into a different number of clusters .

# List of V-Measure scores for different models

v_scores = []

# List of different types of covariance parameters

N_Clusters = [ 2 , 3 , 4 , 5 , 6 ]

a) n_clusters = 2

# Building a clustering model

kmeans2 = KMeans (n_clusters = 2 )

# Train the clustering model
kmeans2.fit (X)

# Store predicted class tags sterilization

labels2 = kmeans2.predict ( X)

# Performance evaluation
v_scores.append (v_measure_score (y, labels2))

b) n_clusters = 3

# Building a clustering model

kmeans3 = KMeans (n_clusters = 3 )

# Train the clustering model
kmeans3.fit (X)

# Store predicted clustering labels

labels3 = kmeans3.predict (X)

# Performance score
v_scores.append (v_measure_score (y, labels3))

c) n_clusters = 4

# Building a clustering model

kmeans4 = KMeans (n_clusters = 4 )

# Train the clustering model
kmeans4.fit (X)

# Storing predicted clustering labels

labels4 = kmeans4.predict (X)

# Performance evaluation
v_scores.append (v_measure_score (y, labels4))

d ) n_clusters = 5

# Building the clustering model

kmeans5 = KMeans (n_clusters = 5 )

# Fashion training whether clustering
kmeans5.fit (X)

# Store predicted clustering labels

labels5 = kmeans5.predict (X)

# Performance score
v_scores.append (v_measure_score (y, labels5))

e) n_clusters = 6

# Building a clustering model

kmeans6 = KMeans (n_clusters = 6 )

# Train the clustering model
kmeans6.fit (X)

# Storing predicted clustering labels

labels6 = kmeans6.predict (X)

# Performance score
v_scores.append (v_measure_score (y, labels6))

Step 4: Visualize the results and performance comparison

# Building a histogram for comparing models
plt.bar ( N_Clusters, v_scores)

plt.xlabel ( ’Number of Clusters’ )

plt.ylabel ( ’V-Measure Score’ )

plt.title ( ’Comparison of different Clustering Models’ )

plt.show ()

Shop

Best laptop for Excel

Best laptop for Solidworks

$399+

Best laptop for Roblox

$399+

Best laptop for development

$499+

Best laptop for Cricut Maker

$299+

Best laptop for hacking

$890

Best laptop for Machine Learning

$699+

Raspberry Pi robot kit

$150

ML | V-measure for evaluating the effectiveness of clustering

Shop

News

Wiki