Dunn index Python

Michael Zippo 26.09.2021

The Dunn Index (DI) is one of the benchmarks for clustering algorithms. It is most commonly used to assess the goodness of the split by k-means clustering algorithms for a given number of clusters.

What is the Dunn Index?

The Dunn Index is calculated as the ratio of the smallest inter-cluster distance to the largest intra-cluster distance.

Clustering validity index: the problem

Description of the Dunn Index

This part describes each step of the calculation and provides powerful practical examples to help you better understand the formula.

How to calculate the Dunn Index

1. Calculate the distance between clusters

The first step is to compute the distance between clusters. There are several ways to calculate this.

Between-cluster distance measures the distance between observations that belong to two different clusters.

2. Calculate intra-cluster distance

The second step is to compute the within-cluster distances. There are several ways to calculate this. Intra-cluster distance measures the distance between observations that belong to the same cluster.

3. Calculate the Dunn Index

The Dunn Index is defined as the ratio of the smallest inter-cluster distance to the largest intra-cluster distance.

For clusters, the Dunn index is calculated as follows:

Dunn index formula

First of all, this means that the inter-cluster distance function should be minimized. This is supposed to find the distance between the two closest clusters.

Then maximize the within-cluster distance function. It aims to find the cluster with the largest diameter.

Finally, get the ratio of the above two values. This is called the Dunn index (range 0 to 1). From the above explanation, we can conclude that the higher the value of Dunn's exponent, the better the clustering.

Dunn Index SkLearn implementation


import numpy as np
from sklearn.preprocessing import LabelEncoder

DIAMETER_METHODS = ['mean_cluster', 'farthest']
CLUSTER_DISTANCE_METHODS = ['nearest', 'farthest']


def inter_cluster_distances(labels, distances, method='nearest'):
    """Calculates the distances between the two nearest points of each cluster.
    :param labels: a list containing cluster labels for each of the n elements
    :param distances: an n x n numpy.array containing the pairwise distances between elements
    :param method: `nearest` for the distances between the two nearest points in each cluster, or `farthest`
    """
    if method not in CLUSTER_DISTANCE_METHODS:
        raise ValueError(
            'method must be one of {}'.format(CLUSTER_DISTANCE_METHODS))

    if method == 'nearest':
        return __cluster_distances_by_points(labels, distances)
    elif method == 'farthest':
        return __cluster_distances_by_points(labels, distances, farthest=True)


def __cluster_distances_by_points(labels, distances, farthest=False):
    n_unique_labels = len(np.unique(labels))
    cluster_distances = np.full((n_unique_labels, n_unique_labels),
                                float('inf') if not farthest else 0)

    np.fill_diagonal(cluster_distances, 0)

    for i in np.arange(0, len(labels) - 1):
        for ii in np.arange(i, len(labels)):
            if labels[i] != labels[ii] and (
                (not farthest and
                 distances[i, ii] < cluster_distances[labels[i], labels[ii]])
                    or
                (farthest and
                 distances[i, ii] > cluster_distances[labels[i], labels[ii]])):
                cluster_distances[labels[i], labels[ii]] = cluster_distances[
                    labels[ii], labels[i]] = distances[i, ii]
    return cluster_distances


def diameter(labels, distances, method='farthest'):
    """Calculates cluster diameters
    :param labels: a list containing cluster labels for each of the n elements
    :param distances: an n x n numpy.array containing the pairwise distances between elements
    :param method: either `mean_cluster` for the mean distance between all elements in each cluster, or `farthest` for the distance between the two points furthest from each other
    """
    if method not in DIAMETER_METHODS:
        raise ValueError('method must be one of {}'.format(DIAMETER_METHODS))

    n_clusters = len(np.unique(labels))
    diameters = np.zeros(n_clusters)

    if method == 'mean_cluster':
        for i in range(0, len(labels) - 1):
            for ii in range(i + 1, len(labels)):
                if labels[i] == labels[ii]:
                    diameters[labels[i]] += distances[i, ii]

        for i in range(len(diameters)):
            diameters[i] /= sum(labels == i)

    elif method == 'farthest':
        for i in range(0, len(labels) - 1):
            for ii in range(i + 1, len(labels)):
                if labels[i] == labels[ii] and distances[i, ii] > diameters[
                        labels[i]]:
                    diameters[labels[i]] = distances[i, ii]
    return diameters


def dunn(labels, distances, diameter_method='farthest',
         cdist_method='nearest'):
    """
    Dunn index for cluster validation (larger is better).
    
    .. math:: D = \\min_{i = 1 \\ldots n_c; j = i + 1\ldots n_c} \\left\\lbrace \\frac{d \\left( c_i,c_j \\right)}{\\max_{k = 1 \\ldots n_c} \\left(diam \\left(c_k \\right) \\right)} \\right\\rbrace
    
    where :math:`d(c_i,c_j)` represents the distance between
    clusters :math:`c_i` and :math:`c_j`, and :math:`diam(c_k)` is the diameter of cluster :math:`c_k`.
    Inter-cluster distance can be defined in many ways, such as the distance between cluster centroids or between their closest elements. Cluster diameter can be defined as the mean distance between all elements in the cluster, between all elements to the cluster centroid, or as the distance between the two furthest elements.
    The higher the value of the resulting Dunn index, the better the clustering
    result is considered, since higher values indicate that clusters are
    compact (small :math:`diam(c_k)`) and far apart (large :math:`d \\left( c_i,c_j \\right)`).
    :param labels: a list containing cluster labels for each of the n elements
    :param distances: an n x n numpy.array containing the pairwise distances between elements
    :param diameter_method: see :py:function:`diameter` `method` parameter
    :param cdist_method: see :py:function:`diameter` `method` parameter
    
    .. [Kovacs2005] Kovács, F., Legány, C., & Babos, A. (2005). Cluster validity measurement techniques. 6th International Symposium of Hungarian Researchers on Computational Intelligence.
    """

    labels = LabelEncoder().fit(labels).transform(labels)

    ic_distances = inter_cluster_distances(labels, distances, cdist_method)
    min_distance = min(ic_distances[ic_distances.nonzero()])
    max_diameter = max(diameter(labels, distances, diameter_method))

    return min_distance / max_diameter


if __name__ == '__main__':
    from sklearn.metrics.pairwise import euclidean_distances
    from sklearn.datasets import load_iris
    from sklearn.cluster import KMeans

    data = load_iris()
    kmeans = KMeans(n_clusters=3)
    c = data['target']
    x = data['data']
    k = kmeans.fit_predict(x)
    d = euclidean_distances(x)

    for diameter_method in DIAMETER_METHODS:
        for cdist_method in CLUSTER_DISTANCE_METHODS:
            dund = dunn(c, d, diameter_method, cdist_method)
            dunk = dunn(k, d, diameter_method, cdist_method)
            print(diameter_method, cdist_method, dund, dunk)

Dunn Index R implementation


data(mouse)
express <- mouse[1:25,c("M1","M2","M3","NC1","NC2","NC3")]
rownames(express) <- mouse$ID[1:25]
## hierarchical clustering
Dist <- dist(express,method="euclidean")
clusterObj <- hclust(Dist, method="average")
nc <- 2 ## number of clusters      
cluster <- cutree(clusterObj,nc)
dunn(Dist, cluster)

Dunn Index Spark implementation


import org.apache.spark.rdd.RDD

val sc = spark.sparkContext

val data = sc.textFile("/FileStore/tables/sample.csv", 16)

val dataRDD = data
      .map(s => s.split(",")
        .map(_.toDouble))
      .keyBy(_.apply(0))
      .cache()

val parsedData = dataRDD.map(s => Vectors.dense(s._2)).cache()

val clusters = KMeans.train(parsedData,2,100)

//Global Center
val centroides = sc.parallelize(clusters.clusterCenters)

val centroidesCartesian = centroides.cartesian(centroides).filter(x => x._1 != x._2).cache()

// DUNN

val minA = centroidesCartesian.map(x => Vectors.sqdist(x._1, x._2)).min()

val maxB = parsedData.map( r => Vectors.sqdist(r, clusters.clusterCenters(clusters.predict(r)))).max

//Get Dunn index
val dunn = minA / maxB

Dunn Index MATLAB implementation


code

Dunn Index Python implementation


import pandas as pd
from sklearn import datasets
from jqmcvi import base

# loading the dataset
X = datasets.load_iris()
df = pd.DataFrame(X.data)

# K-Means
from sklearn import cluster
k_means = cluster.KMeans(n_clusters=3)
k_means.fit(df) #K-means training
y_pred = k_means.predict(df)

# We store the K-means results in a dataframe
pred = pd.DataFrame(y_pred)
pred.columns = ['Type']

# we merge this dataframe with df
prediction = pd.concat([df, pred], axis = 1)

# We store the clusters
clus0 = prediction.loc[prediction.Species == 0]
clus1 = prediction.loc[prediction.Species == 1]
clus2 = prediction.loc[prediction.Species == 2]
cluster_list = [clus0.values, clus1.values, clus2.values]

print(base.dunn(cluster_list))

Cluster analysis results on "Mouse" dataset

Cluster Analysis

The multi-class classification problem is the problem of classifying data into more than two classes. There are many methods for its solution, an important class of which is the method of binary hierarchical partitioning, i.e. the arrangement of classes in the form of a binary tree. The tree is created in such a way that classes in each child node are divided into 2 clusters. The process of building the structure continues until leaf nodes contain only one class. At the same time in each node tree contains a binary classifier. Most methods of constructing nested dichotomies or the tree use a random strategy, which
does not guarantee finding a good optimal partition.

The term cluster analysis (first introduced by Tryon, 1939) actually encompasses a set of different classification algorithms. A common question asked by researchers in many fields is how to organize observed data into illustrative structures, i.e., to deploy taxonomies. For example, biologists aim to categorize animals into different species in order to meaningfully describe the differences between them. According to the modern system accepted in biology, humans belong to primates, mammals, amniotes, vertebrates, and animals. Notice that in this classification, the higher the level of aggregation, the less similar the members in the corresponding class are. Humans have more similarities with other primates (i.e., apes) than with "distant" members of the mammalian family (e.g., dogs), etc. In the following sections, general methods of cluster analysis will be discussed, see Merging (tree clustering), Two-Way Merging, and K-means Method.

Checking Statistical Significance

Note that the previous reasoning refers to clustering algorithms, but does not mention anything about statistical significance checking. In fact, cluster analysis is not so much a conventional statistical method as it is a "set" of different algorithms for "distributing objects into clusters". There is a point of view that unlike many other statistical procedures, cluster analysis methods are used in most cases when you do not have any a priori hypotheses about classes, but are still in the descriptive stage of research. It should be understood that cluster analysis determines the "most possibly significant solution. Therefore, statistical significance testing is not really applicable here, even in cases where p-levels are known (like in the K-means method).

Clustering techniques are used in a wide variety of fields. Hartigan (1975) gave an excellent review of many published studies containing results obtained by cluster analysis techniques. For example, in the field of medicine, the clustering of diseases, disease treatments, or disease symptoms leads to widely used taxonomies. In the field of psychiatry, the correct diagnosis of symptom clusters such as paranoia, schizophrenia, etc., is crucial for successful therapy. In archaeology, with the help of cluster analysis, researchers try to establish taxonomies of stone tools, funerary objects, etc. Extensive applications of cluster analysis in marketing research are known. In general, whenever it is necessary to classify "mountains" of information into manageable groups, cluster analysis proves to be very useful and effective.

The example in the Main Purpose section explains the purpose of the aggregation algorithm (tree clustering). The purpose of this algorithm is to combine objects (e.g., animals) into sufficiently large clusters, using some measure of similarity or distance between objects. A typical result of such clustering is a hierarchical tree.

Consider a horizontal tree diagram. The diagram starts with each object in the class (on the left side of the diagram). Now imagine that gradually (in very small steps) you "weaken" your criterion about which objects are unique and which are not. In other words, you lower the threshold pertaining to the decision to combine two or more objects into one cluster.

As a result, you link more and more objects together and aggregate (merge) more and more clusters of increasingly divergent elements. Finally, in the final step, all objects are combined together. In these diagrams, the horizontal axes represent aggregation distance (in vertical tree diagrams, the vertical axes represent aggregation distance). So, for each node in the graph (where a new cluster is formed) you can see the amount of distance for which the corresponding items are linked into a new single cluster. When the data has a clear "structure" in terms of clusters of objects similar to each other, then this structure is likely to be reflected in the hierarchical tree by different branches. As a result of a successful association method analysis, it is possible to detect clusters (branches) and interpret them.

Unification or tree clustering method is used to form clusters of dissimilarity or distances between objects. These distances can be defined in one-dimensional or multidimensional space. For example, if you have to cluster the types of food in a cafe, you can take into account the number of calories contained in it, the price, a subjective evaluation of taste, etc. The most direct way to calculate distances between objects in multidimensional space is to calculate Euclidean distances. If you have a two- or three-dimensional space, this measure is the actual geometric distance between objects in space (as if the distances between objects were measured with a tape measure). However, the association algorithm does not "care" about whether the distances "provided" for that distance are real or some other derived distance measure, which is more meaningful to the researcher; and the challenge for researchers is to pick the right method for specific applications.

Euclidean distance. This seems to be the most general type of distance. It is simply a geometric distance in multidimensional space and is calculated as follows:

Note that the Euclidean distance (and its square) is calculated from original data, not standardized data. This is the usual way of calculating it, which has certain advantages (for example, the distance between two objects does not change when a new object, which may turn out to be an outlier, is introduced into the analysis). However, distances can be strongly influenced by differences between the axes from whose coordinates these distances are calculated. For example, if one of the axes is measured in centimeters, and you then convert it to millimeters (by multiplying the values by 10), the final Euclidean distance (or Euclidean distance square) calculated from the coordinates will change greatly, and as a result, the cluster analysis results may be very different from previous ones.

Euclidean distance squared. Sometimes you may want to square the standard Euclidean distance in order to give greater weights to objects that are more distant from each other. This distance is calculated as follows (see also the remarks in the previous paragraph):

City block distance (Manhattan distance). This distance is simply the average of the coordinate differences. In most cases, this distance measure leads to the same results as for the usual Euclidean distance. Note, however, that for this measure, the influence of individual large differences (outliers) is reduced (since they are not squared). The Manhattan distance is calculated using the formula:

Chebyshev distance. This distance can be useful when one wants to define two objects as "different" if they differ in any one coordinate (in any one dimension). The Chebyshev distance is calculated using the formula:

The progressive distance. Sometimes one wishes to progressively increase or decrease the weight pertaining to a dimension for which the corresponding objects are very different. This can be achieved using the power distance. The power distance is calculated using the formula:

where r and p are user-defined parameters. A few examples of calculations can show how this measure "works". The parameter p is responsible for gradual weighting of differences over individual coordinates, the parameter r is responsible for progressive weighting of large distances between objects. If both parameters r and p are equal to two, the distance coincides with the Euclidean distance.

Percentage of disagreement. This measure is used when the data is categorical. This distance is calculated using the formula:

Distance(x,y) = (Number of xi yi)/ i

Association or linkage rules

In the first step, when each object is a separate cluster, the distances between these objects are determined by the chosen measure. However, when several objects are linked together, the question arises, how should the distances between clusters be determined? In other words, a merging or linking rule is needed for two clusters. There are various possibilities here: for example, you can link two clusters together when any two objects in two clusters are closer to each other than the corresponding linkage distance. In other words, you use the "nearest neighbor rule" to determine the distance between clusters; this method is called the single linkage method. This rule builds "fibrous" clusters, that is, clusters that are "linked together" only by individual elements that happen to be closer to each other than the others. Alternatively, you can use neighbors in clusters that are farther apart than all other pairs of objects. This method is called the full linkage method. There are also many other methods for combining clusters, similar to the ones considered.

Single link (nearest neighbor method). As described above, in this method the distance between two clusters is determined by the distance between the two closest objects (nearest neighbors) in different clusters. This rule should, in a sense, string objects together to form clusters, and the resulting clusters tend to be represented by long "chains".

Full connectivity (the most distant neighbors method). In this method, the distances between clusters are determined by the greatest distance between any two objects in different clusters (i.e., the "farthest neighbors"). This method usually works very well when the objects actually come from really different "groves". If, however, the clusters are in some way elongated or their natural type is "chained", then this method is unsuitable.

Unweighted pairwise average. In this method, the distance between two different clusters is calculated as the average distance between all pairs of objects in them. The method is effective when objects actually form different "groves", but it works equally well for extended ("chain" type) clusters. Note that in their book Sneath and Sokal (1973) introduce the acronym UPGMA to refer to this method as the unweighted pair-group method using arithmetic averages.

Weighted pair-group average. The method is identical to the unweighted pair-group average method, except for the fact that in the calculations the size of corresponding clusters (i.e. the number of objects contained in them) is used as a weighting factor. Therefore, the proposed method should be used (rather even than the previous method) when unequal cluster sizes are assumed. Sneath and Sokal (1973) introduce the acronym WPGMA to refer to this method as the weighted pair-group method using arithmetic averages.

Unweighted centroid method. In this method, the distance between two clusters is defined as the distance between their centers of gravity. Sneath and Sokal (1973) use the acronym UPGMC to refer to this method as the unweighted pair-group method using the centroid average.

The weighted centroid method (median). That method is identical to the previous one, except that weights are used in the calculations to account for the difference between cluster sizes (i.e. the numbers of objects in them). Therefore, if there are (or are suspected to be) significant differences in cluster sizes, this method is preferable to the previous one. Sneath and Sokal (1973) used the acronym WPGMC to refer to it as the weighted pair-group method using the centroid average.

Ward's method. This method differs from all other methods because it uses methods of variance analysis to estimate distances between clusters. The method minimizes the sum of squares (SS) for any two (hypothetical) clusters that can be formed at each step. Details can be found in Ward (1963). In general, the method appears to be very efficient, but it tends to generate clusters of small size.

For an overview of other clustering methods, see Two-Input Association and the K-Means Method.

This method was previously discussed in terms of "objects" to be clustered (see Merging (tree clustering)). In all other types of analysis, the question of interest is usually expressed in terms of observations or variables. It turns out that clustering in terms of both observations and variables can lead to quite interesting results. For example, imagine that a medical researcher collects data on various characteristics (variables) of the conditions of patients (observations) suffering from heart disease. The researcher may want to cluster the observations (patients) to identify clusters of patients with similar symptoms. At the same time, the researcher may want to cluster variables to identify clusters of variables that are associated with a similar physical condition.

After this discussion relating to whether to cluster observations or variables, one might ask, why not cluster in both directions? The Cluster Analysis module contains an effective two-way pairing procedure that allows you to do just that. However, two-way aggregation is used (relatively rarely) in circumstances where both observations and variables are expected to simultaneously contribute to the detection of meaningful clusters.

Thus, returning to the previous example, one might assume that a medical researcher needs to identify clusters of patients who are similar with respect to certain clusters of physical condition characteristics. The difficulty in interpreting the results arises because similarities between different clusters may originate from (or be the cause of) some difference in subsets of variables. The resulting clusters are therefore inherently heterogeneous. This may seem a bit vague at first; indeed, compared to the other cluster analysis methods described (see Merge (tree clustering) and the K-means method), two-way merging is probably the least frequently used method. However, some researchers believe that it offers a powerful means of exploratory data analysis (for more information, see Hartigan's description of this method (Hartigan, 1975)).

This clustering method differs significantly from agglomerative methods such as Merge (tree clustering) and Two-Way Merge. Suppose you already have hypotheses about the number of clusters (by observations or by variables). You can tell the system to form exactly three clusters so that they are as different as possible. This is the type of problem that the K-means method algorithm solves. In general, the K-means method builds exactly K different clusters as far apart as possible.

In the physical state example (see Two-Way Association), a medical researcher may have a "suspicion" from his clinical experience that his patients basically fall into three different categories. He may then want to know whether his intuition can be confirmed numerically, that is, will a cluster analysis of K averages actually yield three clusters of patients as expected? If so, then the averages of the different physical parameter measures for each cluster will give a quantitative way of representing the researcher's hypotheses (e.g., patients in cluster 1 have high parameter 1, lower parameter 2, etc.).

From a computational point of view, you can think of this method as variance analysis (see Variance Analysis) "in reverse. The program starts with K randomly selected clusters and then changes the membership of the objects to: (1) - minimize variability within clusters, and (2) - maximize variability between clusters. This method is similar to the method of "analysis of variance (ANOVA) in reverse," in the sense that the significance criterion in the analysis of variance compares the between-group variability to the within-group variability when testing the hypothesis that the means in the groups differ from each other. In K-means clustering, the program moves objects (i.e., observations) from some groups (clusters) to others in order to obtain the most significant result in the analysis of variance (ANOVA).

Usually when the results of a K-means cluster analysis are obtained, you can calculate the mean for each cluster on each dimension to assess how different the clusters are from each other. Ideally, you should get very different averages for most, if not all, of the dimensions used in the analysis. The F-statistics values obtained for each dimension are another indicator of how well the corresponding dimension discriminates between clusters.

Quality assessment in a clustering problem

The quality estimation problem in the clustering problem is intractable for at least two reasons:

Kleinberg's impossibility theorem - there is no optimal clustering algorithm.
Many clustering algorithms are unable to determine the true number of clusters in the data. Most often, the number of clusters is fed into the input of the algorithm and picked up by several runs of the algorithm.

Clustering quality assessment methods

A clustering quality assessment method is a tool for quantifying clustering results.

It is accepted to distinguish two groups of clustering quality assessment methods:

External measures are based on comparing the clustering result with an a priori known division into classes.
Internal measures reflect the quality of clustering only on the information in the data.

Clustering quality assessment methods comparison

There is no better method for evaluating the quality of clustering. However, a study[1] attempted to compare existing measures on different data. The results showed that on artificial datasets the Silhouette , DB∗ and Calinski-Harabasz . On real datasets Score-function performed the best of all .

Shop

Best laptop for Excel

Best laptop for Solidworks

$399+

Best laptop for Roblox

$399+

Best laptop for development

$499+

Best laptop for Cricut Maker

$299+

Best laptop for hacking

$890

Best laptop for Machine Learning

$699+

Raspberry Pi robot kit

$150