# ML | V-measure for evaluating the effectiveness of clustering

Computing the V-measure first requires computing two terms:

1. Uniformity : Perfectly uniform clustering — this is the one where each cluster has data points belonging to the same class label. Uniformity describes the closeness of the clustering algorithm to this perfection.
2. Completeness: perfectly full clustering — it is one in which all data points belonging to the same class are clustered in the same cluster. Completeness describes the closeness of the clustering algorithm to this perfection.

Trivial homogeneity: This is the case when the number of clusters is equal to the number of data points, and each point is in the same cluster. This is an extreme case where uniformity is maximized and completeness is minimal. Trivial completeness: this is the case when all data points are combined into one cluster. This is an extreme case where uniformity is minimal and completeness is maximal. Suppose each data point in the above charts has a different class label for trivial homogeneity and trivial completeness.

Note . The term “homogeneous” is different from completeness in the sense that, when talking about homogeneity, the basic concept refers to the corresponding cluster, and we check if each data point has the same class label in each cluster. Speaking of completeness, the main concept is the corresponding class label, which checks if the data points of each class label are in the same cluster. In the above diagram, the clustering is perfectly homogeneous because in each cluster the data points have the same class label, but it is not complete because not all data points have the same class label belong to the same class label. In the above diagram, clustering is completely complete since all data points of the same class label belong to the same cluster, but it is not homogeneous because the 1st cluster contains data points of many class labels.

Suppose there are N data samples, C different class labels, K clusters and the number of data points belonging to class c and cluster k. The homogeneity of h is then determined as follows: where and Completeness with is defined as follows: where and So the weighted V-measure is defined as follows: The factor can be adjusted to provide either uniformity or The completeness of the clustering algorithm.

The main advantage of this scoring metric is that it is independent of the number of class labels, the number of clusters, the size of the data, and the clustering algorithm used, and is a very reliable metric.

The following code will demonstrate how to compute the V-measure of the clustering algorithm. Used data —  Credit Card Fraud Detection which can be downloaded from Kaggle . The clustering algorithm uses Variational Bayesian inference for the Gaussian mixture model .

Step 1: Import the required libraries

 ` import ` ` pandas as pd ` ` import ` ` matplotlib.pyplot as plt ` ` from ` ` sklearn.cluster ` ` import ` ` KMeans ` ` from ` ` sklearn.metrics ` ` import ` ` v_measure_score `

Step 2: Load and clear data

` `

` # Change workplace to file location cd C: UsersDevDesktopKaggleCredit Card Fraud    # Loading data df = pd.read_csv ( ’ creditcard.csv’ )   # Separate dependent and independent variables y = df [ ’Class’ ] X = df.drop ( ’Class’ , axis = 1 )    X.head () `

` ` Step 3: Building different models clustering and comparing their V-measure indicators

At this stage, 5 different K-Means clustering models will be built, with each model clustering data into a different number of clusters .

 ` # List of V-Measure scores for different models ` ` v_scores ` ` = ` ` [] `   ` # List of different types of covariance parameters ` ` N_Clusters ` ` = ` ` [` ` 2 ` `, ` ` 3 ` `, ` ` 4 ` `, ` ` 5 ` `, ` ` 6 ` `] `

a) n_clusters = 2

 ` # Building a clustering model ` ` kmeans2 ` ` = ` ` KMeans (n_clusters ` ` = ` ` 2 ` `) ` ` `  ` # Train the clustering model ` ` kmeans2.fit (X) `   ` # Store predicted class tags sterilization ` ` labels2 ` ` = ` ` kmeans2.predict ( X) `   ` # Performance evaluation ` ` v_scores.append (v_measure_score (y, labels2)) `

b) n_clusters = 3

 ` # Building a clustering model ` ` kmeans3 ` ` = ` ` KMeans (n_clusters ` ` = ` ` 3 ` `) ` ` `  ` # Train the clustering model ` ` kmeans3.fit (X) `   ` # Store predicted clustering labels ` ` labels3 ` ` = ` ` kmeans3.predict (X) `   ` # Performance score ` ` v_scores.append (v_measure_score (y, labels3)) `

c) n_clusters = 4

 ` # Building a clustering model ` ` kmeans4 ` ` = ` ` KMeans (n_clusters ` ` = ` ` 4 ` `) `   ` # Train the clustering model ` ` kmeans4.fit (X) `   ` # Storing predicted clustering labels ` ` labels4 ` ` = ` ` kmeans4.predict (X) `   ` # Performance evaluation ` ` v_scores.append (v_measure_score (y, labels4)) `

d ) n_clusters = 5

 ` # Building the clustering model ` ` kmeans5 ` ` = ` ` KMeans (n_clusters ` ` = ` ` 5 ` `) `   ` # Fashion training whether clustering ` ` kmeans5.fit (X) `   ` # Store predicted clustering labels ` ` labels5 ` ` = ` ` kmeans5.predict (X) `   ` # Performance score ` ` v_scores.append (v_measure_score (y, labels5)) `

e) n_clusters = 6

 ` # Building a clustering model ` ` kmeans6 ` ` = ` ` KMeans (n_clusters ` ` = ` ` 6 ` `) ` ` `  ` # Train the clustering model ` ` kmeans6.fit (X) `   ` # Storing predicted clustering labels ` ` labels6 ` ` = ` ` kmeans6.predict (X) `   ` # Performance score ` ` v_scores.append (v_measure_score (y, labels6)) `

Step 4: Visualize the results and performance comparison

 ` # Building a histogram for comparing models ` ` plt.bar ( N_Clusters, v_scores) ` ` plt.xlabel (` ` ’Number of Clusters’ ` `) ` ` plt.ylabel (` ` ’V-Measure Score’ ` `) ` ` plt.title (` ` ’Comparison of different Clustering Models’ ` `) ` ` plt.show () `

## Shop Best laptop for Excel

\$ Best laptop for Solidworks

\$399+ Best laptop for Roblox

\$399+ Best laptop for development

\$499+ Best laptop for Cricut Maker

\$299+ Best laptop for hacking

\$890 Best laptop for Machine Learning

\$699+ Raspberry Pi robot kit

\$150

Latest questions

PythonStackOverflow

Common xlabel/ylabel for matplotlib subplots

PythonStackOverflow

Check if one list is a subset of another in Python

PythonStackOverflow

How to specify multiple return types using type-hints

PythonStackOverflow

Printing words vertically in Python

PythonStackOverflow

Python Extract words from a given string

PythonStackOverflow

Why do I get "Pickle - EOFError: Ran out of input" reading an empty file?

PythonStackOverflow

Python os.path.join () method

PythonStackOverflow

Flake8: Ignore specific warning for entire file

## Wiki

Python | How to copy data from one Excel sheet to another

Common xlabel/ylabel for matplotlib subplots

Check if one list is a subset of another in Python

How to specify multiple return types using type-hints

Printing words vertically in Python

Python Extract words from a given string

Cyclic redundancy check in Python

Finding mean, median, mode in Python without libraries