Dunn index Python

Python Methods and Functions




Dunn index and DB index — Clustered Indexes | Set 1

Different performance metrics are used to evaluate different machine learning algorithms. In case of a classification problem, we have a variety of performance measures to assess how good our model is. For cluster analysis, the analogous question is how to assess the “goodness” of the resulting clusters?




Why do we need cluster validity indices?

  • To compare clustering algorithms.
  • To compare two sets of clusters.
  • To compare two clusters, ie which one is better in terms of compression and connection.

In general, cluster validity measures are classified into 3 classes, which are:

  • Validation of the internal cluster: The result of clustering is evaluated against the data of the cluster itself (internal information) without reference to external information.
  • External cluster validation: The results of clustering are evaluated against some externally known results, such as class labels provided externally.
  • Relative validation of clusters: The results of clustering are evaluated by varying different parameters for the same algorithm (for example, by changing the number of clusters).

In addition to the term cluster validity index, we need to know the inter-cluster distance d (a, b) between two clusters a, b and the intra-cluster index D (a) of cluster a.

The distance between the clusters d (a, b) between two clusters a and b can be:

Single link distance: closest distance between two objects belonging respectively to a and b.

Full link distance: distance between two other distant objects belonging respectively to a and b. Average connection distance: Average distance between all the objects belonging respectively to a and b.

Centroid Connection Distance: Distance between the center of gravity of the two clusters a and b respectively. The intra-cluster distance D (a) of a cluster a can be -

Full diameter connection distance: distance between the two most distant objects belonging to cluster a.

Average diameter connection distance: Average distance between all objects belonging to cluster a. Connection distance of the diameter of the center of gravity: twice the average distance between all objects and the center of gravity of the cluster a.

Now let's talk about 2 internal cluster validity indices, namely the Dunn index and the DB index.




Dunn's index

The Dunn Index (DI) (introduced by J. C. Dunn in 1974), a metric for evaluating clustering algorithms, is an internal scoring scheme, in which the result is based on the aggregated data itself. Like all other indices of this type, the aim of this Dunn index is to identify sets of compact clusters, with low variance between cluster members and well separated, in which the means of the different clusters are sufficiently far from inside the variance of the cluster.

The higher the value of the Dunn index, the better the aggregation. The number of clusters which maximizes the Dunn index is considered to be the optimal number of clusters k. It also has some drawbacks. As the number of clusters and the dimensionality of the data increase, so does the cost of computation.

The Dunn index for the number c of clusters is implemented (Python 3) as:

import pandas as pd
from sklearn import datasets
from jqmcvi import base

# loading the dataset
X = datasets.load_iris()
df = pd.DataFrame(X.data)

# K-Means
from sklearn import cluster
k_means = cluster.KMeans(n_clusters=3)
k_means.fit(df) #K-means training
y_pred = k_means.predict(df)

# We store the K-means results in a dataframe
pred = pd.DataFrame(y_pred)
pred.columns = ['Type']

# we merge this dataframe with df
prediction = pd.concat([df, pred], axis = 1)

# We store the clusters
clus0 = prediction.loc[prediction.Species == 0]
clus1 = prediction.loc[prediction.Species == 1]
clus2 = prediction.loc[prediction.Species == 2]
cluster_list = [clus0.values, clus1.values, clus2.values]

print(base.dunn(cluster_list))





Dunn index GitHub

import numpy as np
from sklearn.preprocessing import LabelEncoder

DIAMETER_METHODS = ['mean_cluster', 'farthest']
CLUSTER_DISTANCE_METHODS = ['nearest', 'farthest']


def inter_cluster_distances(labels, distances, method='nearest'):
    """Calculates the distances between the two nearest points of each cluster.
    :param labels: a list containing cluster labels for each of the n elements
    :param distances: an n x n numpy.array containing the pairwise distances between elements
    :param method: `nearest` for the distances between the two nearest points in each cluster, or `farthest`
    """
    if method not in CLUSTER_DISTANCE_METHODS:
        raise ValueError(
            'method must be one of {}'.format(CLUSTER_DISTANCE_METHODS))

    if method == 'nearest':
        return __cluster_distances_by_points(labels, distances)
    elif method == 'farthest':
        return __cluster_distances_by_points(labels, distances, farthest=True)


def __cluster_distances_by_points(labels, distances, farthest=False):
    n_unique_labels = len(np.unique(labels))
    cluster_distances = np.full((n_unique_labels, n_unique_labels),
                                float('inf') if not farthest else 0)

    np.fill_diagonal(cluster_distances, 0)

    for i in np.arange(0, len(labels) - 1):
        for ii in np.arange(i, len(labels)):
            if labels[i] != labels[ii] and (
                (not farthest and
                 distances[i, ii] < cluster_distances[labels[i], labels[ii]])
                    or
                (farthest and
                 distances[i, ii] > cluster_distances[labels[i], labels[ii]])):
                cluster_distances[labels[i], labels[ii]] = cluster_distances[
                    labels[ii], labels[i]] = distances[i, ii]
    return cluster_distances


def diameter(labels, distances, method='farthest'):
    """Calculates cluster diameters
    :param labels: a list containing cluster labels for each of the n elements
    :param distances: an n x n numpy.array containing the pairwise distances between elements
    :param method: either `mean_cluster` for the mean distance between all elements in each cluster, or `farthest` for the distance between the two points furthest from each other
    """
    if method not in DIAMETER_METHODS:
        raise ValueError('method must be one of {}'.format(DIAMETER_METHODS))

    n_clusters = len(np.unique(labels))
    diameters = np.zeros(n_clusters)

    if method == 'mean_cluster':
        for i in range(0, len(labels) - 1):
            for ii in range(i + 1, len(labels)):
                if labels[i] == labels[ii]:
                    diameters[labels[i]] += distances[i, ii]

        for i in range(len(diameters)):
            diameters[i] /= sum(labels == i)

    elif method == 'farthest':
        for i in range(0, len(labels) - 1):
            for ii in range(i + 1, len(labels)):
                if labels[i] == labels[ii] and distances[i, ii] > diameters[
                        labels[i]]:
                    diameters[labels[i]] = distances[i, ii]
    return diameters


def dunn(labels, distances, diameter_method='farthest',
         cdist_method='nearest'):
    """
    Dunn index for cluster validation (larger is better).
    
    .. math:: D = \\min_{i = 1 \\ldots n_c; j = i + 1\ldots n_c} \\left\\lbrace \\frac{d \\left( c_i,c_j \\right)}{\\max_{k = 1 \\ldots n_c} \\left(diam \\left(c_k \\right) \\right)} \\right\\rbrace
    
    where :math:`d(c_i,c_j)` represents the distance between
    clusters :math:`c_i` and :math:`c_j`, and :math:`diam(c_k)` is the diameter of cluster :math:`c_k`.
    Inter-cluster distance can be defined in many ways, such as the distance between cluster centroids or between their closest elements. Cluster diameter can be defined as the mean distance between all elements in the cluster, between all elements to the cluster centroid, or as the distance between the two furthest elements.
    The higher the value of the resulting Dunn index, the better the clustering
    result is considered, since higher values indicate that clusters are
    compact (small :math:`diam(c_k)`) and far apart (large :math:`d \\left( c_i,c_j \\right)`).
    :param labels: a list containing cluster labels for each of the n elements
    :param distances: an n x n numpy.array containing the pairwise distances between elements
    :param diameter_method: see :py:function:`diameter` `method` parameter
    :param cdist_method: see :py:function:`diameter` `method` parameter
    

    labels = LabelEncoder().fit(labels).transform(labels)

    ic_distances = inter_cluster_distances(labels, distances, cdist_method)
    min_distance = min(ic_distances[ic_distances.nonzero()])
    max_diameter = max(diameter(labels, distances, diameter_method))

    return min_distance / max_diameter


if __name__ == '__main__':
    from sklearn.metrics.pairwise import euclidean_distances
    from sklearn.datasets import load_iris
    from sklearn.cluster import KMeans

    data = load_iris()
    kmeans = KMeans(n_clusters=3)
    c = data['target']
    x = data['data']
    k = kmeans.fit_predict(x)
    d = euclidean_distances(x)

    for diameter_method in DIAMETER_METHODS:
        for cdist_method in CLUSTER_DISTANCE_METHODS:
            dund = dunn(c, d, diameter_method, cdist_method)
            dunk = dunn(k, d, diameter_method, cdist_method)
            print(diameter_method, cdist_method, dund, dunk)
@oktavianidewi

This is a Python implementation of the Dunn index, which is used to evaluate the results of clustering. While, officially, I have unwittingly followed the reference below, Wikipedia has been much more helpful in this endeavor. I also want to thank the people who used the code and commented in the GitHub gist where the code below is hosted. I have received many bug reports and helpful suggestions.

Kovács, F .; Legány, C. & Babos, A. Techniques for measuring the validity of clusters. 6th International Symposium of Hungarian Researchers on Computational Intelligence, 2005.

Closest measures the distance between two clusters as the distance between the two closest points in each cluster. the farthest one does something similar, but using the two points in each group that are farthest from each other.

DIAMETER_METHODS = ['mean_cluster', 'farthest'] are the two methods of calculating the diameter of a cluster that I implemented in the code. Mean_cluster uses the average distance between all the elements in each cluster as the diameter and furthest uses the distance between the two points furthest from each other inside the same cluster.




Optimizing Dunn Index calculation

StackOverflow question

The Dunn Index is a method of evaluating clustering. A higher value is better. It is calculated as the lowest intercluster distance (ie. the smallest distance between any two cluster centroids) divided by the highest intracluster distance (ie. the largest distance between any two points in any cluster).

I have a code snippet for calculating the Dunn Index:

def dunn_index(pf, cf):
    """
    pf -- all data points
    cf -- cluster centroids
    """
    numerator = inf
    for c in cf: # for each cluster
        for t in cf: # for each cluster
            if t is c: continue # if same cluster, ignore
            numerator = min(numerator, distance(t, c)) # find distance between centroids
    denominator = 0
    for c in cf: # for each cluster
        for p in pf: # for each point
            if p.get_cluster() is not c: continue # if point not in cluster, ignore
            for t in pf: # for each point
                if t.get_cluster() is not c: continue # if point not in cluster, ignore
                if t is p: continue # if same point, ignore
                denominator = max(denominator, distance(t, p))
    return numerator/denominator

The problem is this is exceptionally slow: for an example data set consisting of 5000 instances and 15 clusters, the function above needs to perform just over 375 million distance calculations at worst. Realistically it's much lower, but even a best case, where the data is ordered by cluster already, is around 25 million distance calculations. I want to shave time off of it, and I've already tried rectilinear distance vs. euclidean and it's not good.

How can I improve this algorithm?

Answer:

TLDR: Importantly, the problem is set up in two-dimensions. For large dimensions, these techniques can be ineffective.

In 2D, we can compute the diameter (intracluster distance) of each cluster in O(n log n) time where n is the cluster size using convex hulls. Vectorization is used to speed up remaining operations. There are two possible asymptotic improvements mentioned at the end of the post, contributions welcome ;)


Setup and fake data:

import numpy as np
from scipy import spatial
from matplotlib import pyplot as plt

# set up fake data
np.random.seed(0)
n_centroids = 1000
centroids = np.random.rand(n_centroids, 2)
cluster_sizes = np.random.randint(1, 1000, size=n_centroids)
# labels from 1 to n_centroids inclusive
labels = np.repeat(np.arange(n_centroids), cluster_sizes) + 1
points = np.zeros((cluster_sizes.sum(), 2))
points[:,0] = np.repeat(centroids[:,0], cluster_sizes)
points[:,1] = np.repeat(centroids[:,1], cluster_sizes)
points += 0.05 * np.random.randn(cluster_sizes.sum(), 2)

Looks somewhat like this:

enter image description here

Next, we define a diameter function for computing the largest intracluster distance, based on this approach using the convex hull.

# compute the diameter based on convex hull 
def diameter(pts):
  # need at least 3 points to construct the convex hull
  if pts.shape[0] <= 1:
    return 0
  if pts.shape[0] == 2:
    return ((pts[0] - pts[1])**2).sum()
  # two points which are fruthest apart will occur as vertices of the convex hull
  hull = spatial.ConvexHull(pts)
  candidates = pts[spatial.ConvexHull(pts).vertices]
  return spatial.distance_matrix(candidates, candidates).max()

For the Dunn index calculation, I assume that we have already computed the points, the cluster labels and the cluster centroids.

If the number of clusters is large, the following solution based on Pandas may perform well:

import pandas as pd
def dunn_index_pandas(pts, labels, centroids):
  # O(k n log(n)) with k clusters and n points; better performance with more even clusters
  max_intracluster_dist = pd.DataFrame(pts).groupby(labels).agg(diameter_pandas)[0].max()
  # O(k^2) with k clusters; can be reduced to O(k log(k))
  # get pairwise distances between centroids
  cluster_dmat = spatial.distance_matrix(centroids, centroids)
  # fill diagonal with +inf: ignore zero distance to self in "min" computation
  np.fill_diagonal(cluster_dmat, np.inf)
  min_intercluster_dist = cluster_sizes.min()
  return min_intercluster_dist / max_intracluster_dist

Otherwise, we can continue with a pure numpy solution.

def dunn_index(pts, labels, centroids):
  # O(k n log(n)) with k clusters and n points; better performance with more even clusters
  max_intracluster_dist = max(diameter(pts[labels==i]) for i in np.unique(labels))
  # O(k^2) with k clusters; can be reduced to O(k log(k))
  # get pairwise distances between centroids
  cluster_dmat = spatial.distance_matrix(centroids, centroids)
  # fill diagonal with +inf: ignore zero distance to self in "min" computation
  np.fill_diagonal(cluster_dmat, np.inf)
  min_intercluster_dist = cluster_sizes.min()
  return min_intercluster_dist / max_intracluster_dist

%time dunn_index(points, labels, centroids)
# returned value 2.15
# in 2.2 seconds
%time dunn_index_pandas(points, labels, centroids)
# returned 2.15
# in 885 ms

For 1000 clusters with i.i.d. ~U[1,1000] cluster sizes this takes 2.2. seconds on my machine. This number drops to .8 seconds with the Pandas approach for this example (many small clusters).

There are two further optimization opportunities that are relevant when the number of clusters is large:

  • First, I am computing the minimal intercluster distance with a brute force O(k^2) approach where k is the number of clusters. This can be reduced to O(k log(k)), as discussed here.

  • Second, max(diameter(pts[labels==i]) for i in np.unique(labels)) requires k passes over an array of size n. With many clusters this can become the bottleneck (as in this example). This is somewhat mitigated with the pandas approach, but I expect that this can be optimized a lot further. For current parameters, roughly one third of compute time is spent outside of computing intercluser of intracluster distances.





Get Solution for free from DataCamp guru