ML | Hierarchical clustering (agglomeration and divisional clustering)

NumPy | Python Methods and Functions

There are essentially two types of hierarchical cluster analysis strategies:

  1. Agglomeration clustering: also known as bottom-up or hierarchical agglomeration clustering (HAC). A structure that is more informative than the unstructured set of clusters returned by flat clustering. This clustering algorithm does not require us to specify the number of clusters in advance. Bottom-up algorithms treat all data as a singleton cluster from the start and then agglomerate pairs of clusters sequentially until all clusters are combined into a single cluster that contains all the data.

    Algorithm :

     given a dataset (d  1 , d  2 , d  3 , .... d  N  ) of size N # compute the distance matrix for i = 1 to N: # as the distance matrix is ​​symmetric about # the primary diagonal so we compute only lower # part of the primary diagonal for j = 1 to i: dis_mat [i] [j] = distance [d  i , d  j ] each data point is a singleton cluster  repeat  merge the two cluster having minimum distance update the distance matrix  untill  only a single cluster remains 

    Implementation of the above algorithm in Python using the scikit-learn library:

    from sklearn .cluster import AgglomerativeClustering

    import numpy as np

     
    # random dataset

    X = np .array ([[ 1 , 2 ], [ 1 , 4 ], [ 1 , 0 ],

    [ 4 , 2 ], [ 4 , 4 ], [ 4 , 0 ]])

     
    # the number of clusters must be mentioned here
    # otherwise the result will be one cluster
    # containing all data

    clustering = AgglomerativeClustering (n_clusters = 2 ). fit (X)

     
    # print tags classes

    print (clustering.labels_)  

    Output:

     [1, 1, 1, 0, 0, 0] 
  2. Split Clustering: also known as top-down approach. This algorithm also does not require the number of clusters to be specified in advance. Top-down clustering requires a cluster splitting method that contains all the data and continues to recursively split clusters until the individual data is split into a singleton cluster.

    Algorithm :

     given a dataset (d  1 , d  2 , d  3 , .... d  N ) of size N at the top we have all data in one cluster the cluster is split using a flat clustering method eg. K-Means etc  repeat  choose the best cluster among all the clusters to split split that cluster by the flat clustering algorithm  untill  each data is in its own singleton cluster 

Hierarchical agglomeration versus divisional clustering —

  • Split clustering is more complex than agglomeration clustering, because in the case of split clustering we need a flat clustering method as a "subroutine" for splitting each cluster until we have all the data that has its own singleton cluster.
  • Split clustering is more efficient unless we create complete hierarchy down to individual data sheets. The time complexity of naive agglomeration clustering is O (n 3 ), because we carefully scan the N x N dist_mat for the smallest distance in each of the N-1 iterations. Using the priority queue data structure, we can reduce this complexity to O (n 2 logn) . With a few more optimizations, it can be reduced to O (n 2 ) . Whereas, for partitioning clustering with a fixed number of top levels, using an efficient flat algorithm such as K-Means, the partitioning algorithms are linear in the number of patterns and clusters.
  • The partitioning algorithm is also more accurate . Agglomeration clustering makes decisions based on local patterns or neighboring points without first considering the global distribution of data. These early decisions cannot be reversed. while partitioning clustering takes global data distribution into account when making top-level partitioning decisions.




Get Solution for free from DataCamp guru