Computing the V-measure first requires computing two terms:
- Uniformity : Perfectly uniform clustering — this is the one where each cluster has data points belonging to the same class label. Uniformity describes the closeness of the clustering algorithm to this perfection.
- Completeness: perfectly full clustering — it is one in which all data points belonging to the same class are clustered in the same cluster. Completeness describes the closeness of the clustering algorithm to this perfection.
Trivial homogeneity: This is the case when the number of clusters is equal to the number of data points, and each point is in the same cluster. This is an extreme case where uniformity is maximized and completeness is minimal.
Trivial completeness: this is the case when all data points are combined into one cluster. This is an extreme case where uniformity is minimal and completeness is maximal.
Suppose each data point in the above charts has a different class label for trivial homogeneity and trivial completeness.
Note . The term “homogeneous” is different from completeness in the sense that, when talking about homogeneity, the basic concept refers to the corresponding cluster, and we check if each data point has the same class label in each cluster. Speaking of completeness, the main concept is the corresponding class label, which checks if the data points of each class label are in the same cluster.
In the above diagram, the clustering is perfectly homogeneous because in each cluster the data points have the same class label, but it is not complete because not all data points have the same class label belong to the same class label.
In the above diagram, clustering is completely complete since all data points of the same class label belong to the same cluster, but it is not homogeneous because the 1st cluster contains data points of many class labels.
Suppose there are N data samples, C different class labels, K clusters and the number of data points belonging to class c and cluster k. The homogeneity of h is then determined as follows:
where
and
Completeness with is defined as follows:
where
and
So the weighted V-measure is defined as follows:
The factor can be adjusted to provide either uniformity or The completeness of the clustering algorithm.
The main advantage of this scoring metric is that it is independent of the number of class labels, the number of clusters, the size of the data, and the clustering algorithm used, and is a very reliable metric.
The following code will demonstrate how to compute the V-measure of the clustering algorithm. Used data — Credit Card Fraud Detection which can be downloaded from Kaggle . The clustering algorithm uses Variational Bayesian inference for the Gaussian mixture model .
Step 1: Import the required libraries
|
Step 2: Load and clear data
|
Step 3: Building different models clustering and comparing their V-measure indicators
At this stage, 5 different K-Means clustering models will be built, with each model clustering data into a different number of clusters .
|
a) n_clusters = 2
|
b) n_clusters = 3
|
c) n_clusters = 4
|
d ) n_clusters = 5
|
e) n_clusters = 6
|
Step 4: Visualize the results and performance comparison
|