Change language

ML | V-measure for evaluating the effectiveness of clustering

Computing the V-measure first requires computing two terms:

  1. Uniformity : Perfectly uniform clustering — this is the one where each cluster has data points belonging to the same class label. Uniformity describes the closeness of the clustering algorithm to this perfection.
  2. Completeness: perfectly full clustering — it is one in which all data points belonging to the same class are clustered in the same cluster. Completeness describes the closeness of the clustering algorithm to this perfection.

Trivial homogeneity: This is the case when the number of clusters is equal to the number of data points, and each point is in the same cluster. This is an extreme case where uniformity is maximized and completeness is minimal.

Trivial completeness: this is the case when all data points are combined into one cluster. This is an extreme case where uniformity is minimal and completeness is maximal.

Suppose each data point in the above charts has a different class label for trivial homogeneity and trivial completeness.

Note . The term “homogeneous” is different from completeness in the sense that, when talking about homogeneity, the basic concept refers to the corresponding cluster, and we check if each data point has the same class label in each cluster. Speaking of completeness, the main concept is the corresponding class label, which checks if the data points of each class label are in the same cluster.

In the above diagram, the clustering is perfectly homogeneous because in each cluster the data points have the same class label, but it is not complete because not all data points have the same class label belong to the same class label.

In the above diagram, clustering is completely complete since all data points of the same class label belong to the same cluster, but it is not homogeneous because the 1st cluster contains data points of many class labels.

Suppose there are N data samples, C different class labels, K clusters and the number of data points belonging to class c and cluster k. The homogeneity of h is then determined as follows:

where

and

Completeness with is defined as follows:

where

and

So the weighted V-measure is defined as follows:

The factor can be adjusted to provide either uniformity or The completeness of the clustering algorithm.

The main advantage of this scoring metric is that it is independent of the number of class labels, the number of clusters, the size of the data, and the clustering algorithm used, and is a very reliable metric.

The following code will demonstrate how to compute the V-measure of the clustering algorithm. Used data —  Credit Card Fraud Detection which can be downloaded from Kaggle . The clustering algorithm uses Variational Bayesian inference for the Gaussian mixture model .

Step 1: Import the required libraries

import pandas as pd

import matplotlib.pyplot as plt

from sklearn.cluster import KMeans

from sklearn.metrics import v_measure_score

Step 2: Load and clear data

# Change workplace to file location
cd C: UsersDevDesktopKaggleCredit Card Fraud

  
# Loading data

df = pd.read_csv ( ’ creditcard.csv’ )

 
# Separate dependent and independent variables

y = df [ ’Class’ ]

X = df.drop ( ’Class’ , axis = 1 )

  
X.head ()

Step 3: Building different models clustering and comparing their V-measure indicators

At this stage, 5 different K-Means clustering models will be built, with each model clustering data into a different number of clusters .

# List of V-Measure scores for different models

v_scores = []

 
# List of different types of covariance parameters

N_Clusters = [  2 , 3 , 4 , 5 , 6 ]

a) n_clusters = 2

# Building a clustering model

kmeans2 = KMeans (n_clusters = 2 )

  
# Train the clustering model
kmeans2.fit (X)

 
# Store predicted class tags sterilization

labels2 = kmeans2.predict ( X)

 
# Performance evaluation
v_scores.append (v_measure_score (y, labels2))

b) n_clusters = 3

# Building a clustering model

kmeans3 = KMeans (n_clusters = 3 )

  
# Train the clustering model
kmeans3.fit (X)

  # Store predicted clustering labels

labels3 = kmeans3.predict (X)

 
# Performance score
v_scores.append (v_measure_score (y, labels3))

c) n_clusters = 4

# Building a clustering model

kmeans4 = KMeans (n_clusters = 4 )

 
# Train the clustering model
kmeans4.fit (X)

 
# Storing predicted clustering labels

labels4 = kmeans4.predict (X)

 
# Performance evaluation
v_scores.append (v_measure_score (y, labels4))

d ) n_clusters = 5

# Building the clustering model

kmeans5 = KMeans (n_clusters = 5 )

 
# Fashion training whether clustering
kmeans5.fit (X)

 
# Store predicted clustering labels

labels5 = kmeans5.predict (X)

 
# Performance score
v_scores.append (v_measure_score (y, labels5))

e) n_clusters = 6

# Building a clustering model

kmeans6 = KMeans (n_clusters = 6 )

  
# Train the clustering model
kmeans6.fit (X)

 
# Storing predicted clustering labels

labels6 = kmeans6.predict (X)

 
# Performance score
v_scores.append (v_measure_score (y, labels6))

Step 4: Visualize the results and performance comparison

# Building a histogram for comparing models
plt.bar ( N_Clusters, v_scores)

plt.xlabel ( ’Number of Clusters’ )

plt.ylabel ( ’V-Measure Score’ )

plt.title ( ’Comparison of different Clustering Models’ )

plt.show ()

Shop

Gifts for programmers

Best laptop for Excel

$
Gifts for programmers

Best laptop for Solidworks

$399+
Gifts for programmers

Best laptop for Roblox

$399+
Gifts for programmers

Best laptop for development

$499+
Gifts for programmers

Best laptop for Cricut Maker

$299+
Gifts for programmers

Best laptop for hacking

$890
Gifts for programmers

Best laptop for Machine Learning

$699+
Gifts for programmers

Raspberry Pi robot kit

$150

Latest questions

PythonStackOverflow

Common xlabel/ylabel for matplotlib subplots

1947 answers

PythonStackOverflow

Check if one list is a subset of another in Python

1173 answers

PythonStackOverflow

How to specify multiple return types using type-hints

1002 answers

PythonStackOverflow

Printing words vertically in Python

909 answers

PythonStackOverflow

Python Extract words from a given string

798 answers

PythonStackOverflow

Why do I get "Pickle - EOFError: Ran out of input" reading an empty file?

606 answers

PythonStackOverflow

Python os.path.join () method

384 answers

PythonStackOverflow

Flake8: Ignore specific warning for entire file

360 answers

News


Wiki

Python | How to copy data from one Excel sheet to another

Common xlabel/ylabel for matplotlib subplots

Check if one list is a subset of another in Python

How to specify multiple return types using type-hints

Printing words vertically in Python

Python Extract words from a given string

Cyclic redundancy check in Python

Finding mean, median, mode in Python without libraries

Python add suffix / add prefix to strings in a list

Why do I get "Pickle - EOFError: Ran out of input" reading an empty file?

Python - Move item to the end of the list

Python - Print list vertically