Change language

ML | Hierarchical clustering (agglomeration and divisional clustering)

There are essentially two types of hierarchical cluster analysis strategies:

  1. Agglomeration clustering: also known as bottom-up or hierarchical agglomeration clustering (HAC). A structure that is more informative than the unstructured set of clusters returned by flat clustering. This clustering algorithm does not require us to specify the number of clusters in advance. Bottom-up algorithms treat all data as a singleton cluster from the start and then agglomerate pairs of clusters sequentially until all clusters are combined into a single cluster that contains all the data.

    Algorithm :

     given a dataset (d  1 , d  2 , d  3 , .... d  N  ) of size N # compute the distance matrix for i = 1 to N: # as the distance matrix is ​​symmetric about # the primary diagonal so we compute only lower # part of the primary diagonal for j = 1 to i: dis_mat [i] [j] = distance [d  i , d  j ] each data point is a singleton cluster  repeat  merge the two cluster having minimum distance update the distance matrix  untill  only a single cluster remains 

    Implementation of the above algorithm in Python using the scikit-learn library:

    from sklearn .cluster import AgglomerativeClustering

    import numpy as np

     
    # random dataset

    X = np .array ([[ 1 , 2 ], [ 1 , 4 ], [ 1 , 0 ],

    [ 4 , 2 ], [ 4 , 4 ], [ 4 , 0 ]])

     
    # the number of clusters must be mentioned here
    # otherwise the result will be one cluster
    # containing all data

    clustering = AgglomerativeClustering (n_clusters = 2 ). fit (X)

     
    # print tags classes

    print (clustering.labels_)  

    Output:

     [1, 1, 1, 0, 0, 0] 
  2. Split Clustering: also known as top-down approach. This algorithm also does not require the number of clusters to be specified in advance. Top-down clustering requires a cluster splitting method that contains all the data and continues to recursively split clusters until the individual data is split into a singleton cluster.

    Algorithm :

     given a dataset (d  1 , d  2 , d  3 , .... d  N ) of size N at the top we have all data in one cluster the cluster is split using a flat clustering method eg. K-Means etc  repeat  choose the best cluster among all the clusters to split split that cluster by the flat clustering algorithm  untill  each data is in its own singleton cluster 

Hierarchical agglomeration versus divisional clustering —

  • Split clustering is more complex than agglomeration clustering, because in the case of split clustering we need a flat clustering method as a "subroutine" for splitting each cluster until we have all the data that has its own singleton cluster.
  • Split clustering is more efficient unless we create complete hierarchy down to individual data sheets. The time complexity of naive agglomeration clustering is O (n 3 ), because we carefully scan the N x N dist_mat for the smallest distance in each of the N-1 iterations. Using the priority queue data structure, we can reduce this complexity to O (n 2 logn) . With a few more optimizations, it can be reduced to O (n 2 ) . Whereas, for partitioning clustering with a fixed number of top levels, using an efficient flat algorithm such as K-Means, the partitioning algorithms are linear in the number of patterns and clusters.
  • The partitioning algorithm is also more accurate . Agglomeration clustering makes decisions based on local patterns or neighboring points without first considering the global distribution of data. These early decisions cannot be reversed. while partitioning clustering takes global data distribution into account when making top-level partitioning decisions.

Shop

Gifts for programmers

Learn programming in R: courses

$FREE
Gifts for programmers

Best Python online courses for 2022

$FREE
Gifts for programmers

Best laptop for Fortnite

$399+
Gifts for programmers

Best laptop for Excel

$
Gifts for programmers

Best laptop for Solidworks

$399+
Gifts for programmers

Best laptop for Roblox

$399+
Gifts for programmers

Best computer for crypto mining

$499+
Gifts for programmers

Best laptop for Sims 4

$

Latest questions

PythonStackOverflow

Common xlabel/ylabel for matplotlib subplots

1947 answers

PythonStackOverflow

Check if one list is a subset of another in Python

1173 answers

PythonStackOverflow

How to specify multiple return types using type-hints

1002 answers

PythonStackOverflow

Printing words vertically in Python

909 answers

PythonStackOverflow

Python Extract words from a given string

798 answers

PythonStackOverflow

Why do I get "Pickle - EOFError: Ran out of input" reading an empty file?

606 answers

PythonStackOverflow

Python os.path.join () method

384 answers

PythonStackOverflow

Flake8: Ignore specific warning for entire file

360 answers

News


Wiki

Python | How to copy data from one Excel sheet to another

Common xlabel/ylabel for matplotlib subplots

Check if one list is a subset of another in Python

How to specify multiple return types using type-hints

Printing words vertically in Python

Python Extract words from a given string

Cyclic redundancy check in Python

Finding mean, median, mode in Python without libraries

Python add suffix / add prefix to strings in a list

Why do I get "Pickle - EOFError: Ran out of input" reading an empty file?

Python - Move item to the end of the list

Python - Print list vertically