  # Dunn index Python

Python Methods and Functions

## Dunn index and DB index — Clustered Indexes | Set 1

Different performance metrics are used to evaluate different machine learning algorithms. In case of a classification problem, we have a variety of performance measures to assess how good our model is. For cluster analysis, the analogous question is how to assess the “goodness” of the resulting clusters?

## Why do we need cluster validity indices?

• To compare clustering algorithms.
• To compare two sets of clusters.
• To compare two clusters, ie which one is better in terms of compression and connection. In general, cluster validity measures are classified into 3 classes, which are:

• Validation of the internal cluster: The result of clustering is evaluated against the data of the cluster itself (internal information) without reference to external information.
• External cluster validation: The results of clustering are evaluated against some externally known results, such as class labels provided externally.
• Relative validation of clusters: The results of clustering are evaluated by varying different parameters for the same algorithm (for example, by changing the number of clusters).

In addition to the term cluster validity index, we need to know the inter-cluster distance d (a, b) between two clusters a, b and the intra-cluster index D (a) of cluster a.

The distance between the clusters d (a, b) between two clusters a and b can be:

Single link distance: closest distance between two objects belonging respectively to a and b.

Full link distance: distance between two other distant objects belonging respectively to a and b. Average connection distance: Average distance between all the objects belonging respectively to a and b.

Centroid Connection Distance: Distance between the center of gravity of the two clusters a and b respectively. The intra-cluster distance D (a) of a cluster a can be -

Full diameter connection distance: distance between the two most distant objects belonging to cluster a.

Average diameter connection distance: Average distance between all objects belonging to cluster a. Connection distance of the diameter of the center of gravity: twice the average distance between all objects and the center of gravity of the cluster a.

Now let's talk about 2 internal cluster validity indices, namely the Dunn index and the DB index.

## Dunn's index

The Dunn Index (DI) (introduced by J. C. Dunn in 1974), a metric for evaluating clustering algorithms, is an internal scoring scheme, in which the result is based on the aggregated data itself. Like all other indices of this type, the aim of this Dunn index is to identify sets of compact clusters, with low variance between cluster members and well separated, in which the means of the different clusters are sufficiently far from inside the variance of the cluster.

The higher the value of the Dunn index, the better the aggregation. The number of clusters which maximizes the Dunn index is considered to be the optimal number of clusters k. It also has some drawbacks. As the number of clusters and the dimensionality of the data increase, so does the cost of computation.

The Dunn index for the number c of clusters is implemented (Python 3) as:

import pandas as pd
from sklearn import datasets
from jqmcvi import base

df = pd.DataFrame(X.data)

# K-Means
from sklearn import cluster
k_means = cluster.KMeans(n_clusters=3)
k_means.fit(df) #K-means training
y_pred = k_means.predict(df)

# We store the K-means results in a dataframe
pred = pd.DataFrame(y_pred)
pred.columns = ['Type']

# we merge this dataframe with df
prediction = pd.concat([df, pred], axis = 1)

# We store the clusters
clus0 = prediction.loc[prediction.Species == 0]
clus1 = prediction.loc[prediction.Species == 1]
clus2 = prediction.loc[prediction.Species == 2]
cluster_list = [clus0.values, clus1.values, clus2.values]

print(base.dunn(cluster_list))



## Dunn index GitHub

import numpy as np
from sklearn.preprocessing import LabelEncoder

DIAMETER_METHODS = ['mean_cluster', 'farthest']
CLUSTER_DISTANCE_METHODS = ['nearest', 'farthest']

def inter_cluster_distances(labels, distances, method='nearest'):
"""Calculates the distances between the two nearest points of each cluster.
:param labels: a list containing cluster labels for each of the n elements
:param distances: an n x n numpy.array containing the pairwise distances between elements
:param method: nearest for the distances between the two nearest points in each cluster, or farthest
"""
if method not in CLUSTER_DISTANCE_METHODS:
raise ValueError(
'method must be one of {}'.format(CLUSTER_DISTANCE_METHODS))

if method == 'nearest':
return __cluster_distances_by_points(labels, distances)
elif method == 'farthest':
return __cluster_distances_by_points(labels, distances, farthest=True)

def __cluster_distances_by_points(labels, distances, farthest=False):
n_unique_labels = len(np.unique(labels))
cluster_distances = np.full((n_unique_labels, n_unique_labels),
float('inf') if not farthest else 0)

np.fill_diagonal(cluster_distances, 0)

for i in np.arange(0, len(labels) - 1):
for ii in np.arange(i, len(labels)):
if labels[i] != labels[ii] and (
(not farthest and
distances[i, ii] < cluster_distances[labels[i], labels[ii]])
or
(farthest and
distances[i, ii] > cluster_distances[labels[i], labels[ii]])):
cluster_distances[labels[i], labels[ii]] = cluster_distances[
labels[ii], labels[i]] = distances[i, ii]
return cluster_distances

def diameter(labels, distances, method='farthest'):
"""Calculates cluster diameters
:param labels: a list containing cluster labels for each of the n elements
:param distances: an n x n numpy.array containing the pairwise distances between elements
:param method: either mean_cluster for the mean distance between all elements in each cluster, or farthest for the distance between the two points furthest from each other
"""
if method not in DIAMETER_METHODS:
raise ValueError('method must be one of {}'.format(DIAMETER_METHODS))

n_clusters = len(np.unique(labels))
diameters = np.zeros(n_clusters)

if method == 'mean_cluster':
for i in range(0, len(labels) - 1):
for ii in range(i + 1, len(labels)):
if labels[i] == labels[ii]:
diameters[labels[i]] += distances[i, ii]

for i in range(len(diameters)):
diameters[i] /= sum(labels == i)

elif method == 'farthest':
for i in range(0, len(labels) - 1):
for ii in range(i + 1, len(labels)):
if labels[i] == labels[ii] and distances[i, ii] > diameters[
labels[i]]:
diameters[labels[i]] = distances[i, ii]
return diameters

def dunn(labels, distances, diameter_method='farthest',
cdist_method='nearest'):
"""
Dunn index for cluster validation (larger is better).

.. math:: D = \\min_{i = 1 \\ldots n_c; j = i + 1\ldots n_c} \\left\\lbrace \\frac{d \\left( c_i,c_j \\right)}{\\max_{k = 1 \\ldots n_c} \\left(diam \\left(c_k \\right) \\right)} \\right\\rbrace

where :math:d(c_i,c_j) represents the distance between
clusters :math:c_i and :math:c_j, and :math:diam(c_k) is the diameter of cluster :math:c_k.
Inter-cluster distance can be defined in many ways, such as the distance between cluster centroids or between their closest elements. Cluster diameter can be defined as the mean distance between all elements in the cluster, between all elements to the cluster centroid, or as the distance between the two furthest elements.
The higher the value of the resulting Dunn index, the better the clustering
result is considered, since higher values indicate that clusters are
compact (small :math:diam(c_k)) and far apart (large :math:d \\left( c_i,c_j \\right)).
:param labels: a list containing cluster labels for each of the n elements
:param distances: an n x n numpy.array containing the pairwise distances between elements
:param diameter_method: see :py:function:diameter method parameter
:param cdist_method: see :py:function:diameter method parameter

labels = LabelEncoder().fit(labels).transform(labels)

ic_distances = inter_cluster_distances(labels, distances, cdist_method)
min_distance = min(ic_distances[ic_distances.nonzero()])
max_diameter = max(diameter(labels, distances, diameter_method))

return min_distance / max_diameter

if __name__ == '__main__':
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=3)
c = data['target']
x = data['data']
k = kmeans.fit_predict(x)
d = euclidean_distances(x)

for diameter_method in DIAMETER_METHODS:
for cdist_method in CLUSTER_DISTANCE_METHODS:
dund = dunn(c, d, diameter_method, cdist_method)
dunk = dunn(k, d, diameter_method, cdist_method)
print(diameter_method, cdist_method, dund, dunk)
@oktavianidewi



This is a Python implementation of the Dunn index, which is used to evaluate the results of clustering. While, officially, I have unwittingly followed the reference below, Wikipedia has been much more helpful in this endeavor. I also want to thank the people who used the code and commented in the GitHub gist where the code below is hosted. I have received many bug reports and helpful suggestions.

Kovács, F .; Legány, C. & Babos, A. Techniques for measuring the validity of clusters. 6th International Symposium of Hungarian Researchers on Computational Intelligence, 2005.

Closest measures the distance between two clusters as the distance between the two closest points in each cluster. the farthest one does something similar, but using the two points in each group that are farthest from each other.

DIAMETER_METHODS = ['mean_cluster', 'farthest'] are the two methods of calculating the diameter of a cluster that I implemented in the code. Mean_cluster uses the average distance between all the elements in each cluster as the diameter and furthest uses the distance between the two points furthest from each other inside the same cluster.

## Optimizing Dunn Index calculation

### StackOverflow question

The Dunn Index is a method of evaluating clustering. A higher value is better. It is calculated as the lowest intercluster distance (ie. the smallest distance between any two cluster centroids) divided by the highest intracluster distance (ie. the largest distance between any two points in any cluster).

I have a code snippet for calculating the Dunn Index:

def dunn_index(pf, cf):
"""
pf -- all data points
cf -- cluster centroids
"""
numerator = inf
for c in cf: # for each cluster
for t in cf: # for each cluster
if t is c: continue # if same cluster, ignore
numerator = min(numerator, distance(t, c)) # find distance between centroids
denominator = 0
for c in cf: # for each cluster
for p in pf: # for each point
if p.get_cluster() is not c: continue # if point not in cluster, ignore
for t in pf: # for each point
if t.get_cluster() is not c: continue # if point not in cluster, ignore
if t is p: continue # if same point, ignore
denominator = max(denominator, distance(t, p))
return numerator/denominator


The problem is this is exceptionally slow: for an example data set consisting of 5000 instances and 15 clusters, the function above needs to perform just over 375 million distance calculations at worst. Realistically it's much lower, but even a best case, where the data is ordered by cluster already, is around 25 million distance calculations. I want to shave time off of it, and I've already tried rectilinear distance vs. euclidean and it's not good.

How can I improve this algorithm?

TLDR: Importantly, the problem is set up in two-dimensions. For large dimensions, these techniques can be ineffective.

In 2D, we can compute the diameter (intracluster distance) of each cluster in O(n log n) time where n is the cluster size using convex hulls. Vectorization is used to speed up remaining operations. There are two possible asymptotic improvements mentioned at the end of the post, contributions welcome ;)

Setup and fake data:

import numpy as np
from scipy import spatial
from matplotlib import pyplot as plt

# set up fake data
np.random.seed(0)
n_centroids = 1000
centroids = np.random.rand(n_centroids, 2)
cluster_sizes = np.random.randint(1, 1000, size=n_centroids)
# labels from 1 to n_centroids inclusive
labels = np.repeat(np.arange(n_centroids), cluster_sizes) + 1
points = np.zeros((cluster_sizes.sum(), 2))
points[:,0] = np.repeat(centroids[:,0], cluster_sizes)
points[:,1] = np.repeat(centroids[:,1], cluster_sizes)
points += 0.05 * np.random.randn(cluster_sizes.sum(), 2)


Looks somewhat like this: Next, we define a diameter function for computing the largest intracluster distance, based on this approach using the convex hull.

# compute the diameter based on convex hull
def diameter(pts):
# need at least 3 points to construct the convex hull
if pts.shape <= 1:
return 0
if pts.shape == 2:
return ((pts - pts)**2).sum()
# two points which are fruthest apart will occur as vertices of the convex hull
hull = spatial.ConvexHull(pts)
candidates = pts[spatial.ConvexHull(pts).vertices]
return spatial.distance_matrix(candidates, candidates).max()


For the Dunn index calculation, I assume that we have already computed the points, the cluster labels and the cluster centroids.

If the number of clusters is large, the following solution based on Pandas may perform well:

import pandas as pd
def dunn_index_pandas(pts, labels, centroids):
# O(k n log(n)) with k clusters and n points; better performance with more even clusters
max_intracluster_dist = pd.DataFrame(pts).groupby(labels).agg(diameter_pandas).max()
# O(k^2) with k clusters; can be reduced to O(k log(k))
# get pairwise distances between centroids
cluster_dmat = spatial.distance_matrix(centroids, centroids)
# fill diagonal with +inf: ignore zero distance to self in "min" computation
np.fill_diagonal(cluster_dmat, np.inf)
min_intercluster_dist = cluster_sizes.min()
return min_intercluster_dist / max_intracluster_dist


Otherwise, we can continue with a pure numpy solution.

def dunn_index(pts, labels, centroids):
# O(k n log(n)) with k clusters and n points; better performance with more even clusters
max_intracluster_dist = max(diameter(pts[labels==i]) for i in np.unique(labels))
# O(k^2) with k clusters; can be reduced to O(k log(k))
# get pairwise distances between centroids
cluster_dmat = spatial.distance_matrix(centroids, centroids)
# fill diagonal with +inf: ignore zero distance to self in "min" computation
np.fill_diagonal(cluster_dmat, np.inf)
min_intercluster_dist = cluster_sizes.min()
return min_intercluster_dist / max_intracluster_dist

%time dunn_index(points, labels, centroids)
# returned value 2.15
# in 2.2 seconds
%time dunn_index_pandas(points, labels, centroids)
# returned 2.15
# in 885 ms


For 1000 clusters with i.i.d. ~U[1,1000] cluster sizes this takes 2.2. seconds on my machine. This number drops to .8 seconds with the Pandas approach for this example (many small clusters).

There are two further optimization opportunities that are relevant when the number of clusters is large:

• First, I am computing the minimal intercluster distance with a brute force O(k^2) approach where k is the number of clusters. This can be reduced to O(k log(k)), as discussed here.

• Second, max(diameter(pts[labels==i]) for i in np.unique(labels)) requires k passes over an array of size n. With many clusters this can become the bottleneck (as in this example). This is somewhat mitigated with the pandas approach, but I expect that this can be optimized a lot further. For current parameters, roughly one third of compute time is spent outside of computing intercluser of intracluster distances.