DBSCAN Clustering in ML | Density based on clustering

Basically, all clustering methods use the same approach, ie, we first compute the similarities and then use that to cluster the data points into groups or batches. Here we will focus on Spatial Clustering of Applications based on Noise Density (DBSCAN) clustering method.

Clusters — these are dense areas in data space separated by areas of lower point density.  DBSCAN is based on this intuitive notion of "clusters" and "noise". The basic idea is that for each point of the cluster, the neighborhood of a given radius should contain at least the minimum number of points.

Why DBSCAN?
Partitioning methods (K-means, PAM clustering) and hierarchical clustering work to find spherical or convex clusters. In other words, they are only suitable for compact and well-separated clusters. In addition, they are also severely affected by the presence of noise and outliers in the data.

Real data may contain violations, for example:
i) Clusters can be arbitrary, such as shown in the picture below. 
ii) Data may contain noise.

The figure below shows a dataset containing non-convex clusters and glitches / noises. Given such data, the k-means algorithm has difficulty in identifying these clusters with arbitrary shapes.

DBSCAN requires two parameters —

  1. eps : It defines the neighborhood around the data point, that is, if the distance between two points is less than or equal to “eps”, then they are considered neighbors. If eps is set too small, most of the data will be considered outliers. If it is selected very large, the clusters will be merged and most of the data points will be in the same clusters. One way to find the eps value is based on a k-distance plot .
  2. MinPts : minimum number of neighbors (data points) in radius eps. The larger the dataset, the larger the MinPts value should be selected. Typically, the minimum MinPts values ​​can be obtained from the number of D dimensions in the dataset as MinPts & gt; = D + 1 . The minimum MinPts value must be at least 3.

    In this algorithm, we have 3 types of data points.

    Core Point : A point is a core point if it has more than MinPts points within eps.
    Border Point : A point which has fewer than MinPts within eps but it is in the neighborhood of a core point.
    Noise or outlier : A point which is not a core point or border point.

The DBSCAN algorithm can be abstracted in the following steps —

  1. Find all neighboring points within eps and identify major points that were visited more than MinPts neighbors.
  2. For each base point, if not already assigned to a cluster, create a new cluster.
  3. Find recursively all points associated with its density, and assign them to the same cluster as the core point. 
    Point a and b are said to be densities related if there is a point with that has a sufficient number of points in its neighbors and both points a and b are within the EPS distance. This is a chain process. Thus, if b is a neighbor of c , c is a neighbor of d , d is a neighbor of e , which in turn is a neighbor of a, means that b is a neighbor of a .
  4. Iterate over the remaining unvisited points in the dataset. Points that do not belong to any cluster are noise.

Below is the DBSCAN clustering algorithm in pseudocode:

 DBSCAN (dataset, eps, MinPts) {# cluster index C = 1 for each unvisited point p in dataset {mark p as visited # find neighbors Neighbors N = find the neighboring points of p if | N | & gt; = MinPts: N = NUN' if p 'is not a member of any cluster: add p' to cluster C} 

Implementation of the above algorithm in Python:

Here we will use the Python library sklearn for calculating DBSCAN. We will also use the matplotlib.pyplot library to render the clusters.

The dataset used can be found here .

import numpy as np

from sklearn.cluster import DBSCAN

from sklearn import metrics

from sklearn.datasets.samples_generator import make_blobs

from sklearn.preprocessing import StandardScaler

from sklearn import datasets

 
# Load data into X

db = DBSCAN (eps = 0.3 , min_samples = 10 ). fit (X)

core_samples_mask = np.zeros_like (db.labels_, dtype = bool )

core_samples_mask [db.core_sample_indices_] = True

labels = db.labels_

 
# Number of clusters in labels, ignoring noise if present.

n_clusters_ = len ( set (labels) ) - ( 1 if - 1 in labels else 0 )

 

print (labels)

  
# Plot result

import matplotlib.pyplot as plt

 
# Black removed and used instead of noise.

unique_labels = set (labels)

colors = [ 'y' , ' b' , 'g' , 'r' ]

print (colors)

for k, col in zip (unique_labels, colors):

  if k = = - 1 :

# Black is used for noise.

  col = ' k'

 

  class_member_mask = (labels = = k)

 

xy = X [class_member_mask & amp; core_samples_mask]

plt.plot (xy [:, 0 ], xy [:, 1 ], 'o' , markerfacecolor = col,

markeredgecolor = 'k'

  markersize = 6 )

 

xy = X [class_member_mask & amp; ~ core_samples_mask]

plt.plot (xy [:, 0 ], xy [:, 1 ] , 'o' , markerfacecolor = col,

markeredgecolor = ' k' ,

  markersize = 6 )

  

plt.title ( 'number of clusters:% d' % n_clusters_)

plt.show ()

Output:

Black dots represent outliers. By changing eps and MinPts , we can change the cluster configuration.

Now the question should be raised:

Lack of K -MEANS:

  1. K-means form only spherical clusters. This algorithm fails when the data is not spherical (that is, the same deviations in all directions). 
  2. The K-Means algorithm is outlier sensitive. Outliers can distort clusters in K-means to a very large extent. 
  3. The K-Means algorithm requires one to indicate the number of clusters in a monastery, etc.

In essence, DBSCAN overcomes all of the above-mentioned disadvantages of the K-Means algorithm. The DBSCAN algorithm identifies a dense region by grouping data points that are closed to each other based on distance measurements.

A Python implementation of the above algorithm without using the sklearn library can be found here dbscan_in_python .

Links:
https://en.wikipedia.org/wiki/DBSCAN
https://scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html





Get Solution for free from DataCamp guru