Change language

# DBSCAN Clustering in ML | Density based on clustering

| |

Basically, all clustering methods use the same approach, ie, we first compute the similarities and then use that to cluster the data points into groups or batches. Here we will focus on Spatial Clustering of Applications based on Noise Density (DBSCAN) clustering method.

Clusters — these are dense areas in data space separated by areas of lower point density.  DBSCAN is based on this intuitive notion of "clusters" and "noise". The basic idea is that for each point of the cluster, the neighborhood of a given radius should contain at least the minimum number of points.

Why DBSCAN?
Partitioning methods (K-means, PAM clustering) and hierarchical clustering work to find spherical or convex clusters. In other words, they are only suitable for compact and well-separated clusters. In addition, they are also severely affected by the presence of noise and outliers in the data.

Real data may contain violations, for example:
i) Clusters can be arbitrary, such as shown in the picture below.
ii) Data may contain noise.

The figure below shows a dataset containing non-convex clusters and glitches / noises. Given such data, the k-means algorithm has difficulty in identifying these clusters with arbitrary shapes.

DBSCAN requires two parameters —

1. eps : It defines the neighborhood around the data point, that is, if the distance between two points is less than or equal to “eps”, then they are considered neighbors. If eps is set too small, most of the data will be considered outliers. If it is selected very large, the clusters will be merged and most of the data points will be in the same clusters. One way to find the eps value is based on a k-distance plot .
2. MinPts : minimum number of neighbors (data points) in radius eps. The larger the dataset, the larger the MinPts value should be selected. Typically, the minimum MinPts values ​​can be obtained from the number of D dimensions in the dataset as ` MinPts" = D + 1 `. The minimum MinPts value must be at least 3.

In this algorithm, we have 3 types of data points.

Core Point : A point is a core point if it has more than MinPts points within eps.
Border Point : A point which has fewer than MinPts within eps but it is in the neighborhood of a core point.
Noise or outlier : A point which is not a core point or border point.

The DBSCAN algorithm can be abstracted in the following steps —

1. Find all neighboring points within eps and identify major points that were visited more than MinPts neighbors.
2. For each base point, if not already assigned to a cluster, create a new cluster.
3. Find recursively all points associated with its density, and assign them to the same cluster as the core point.
Point a and b are said to be densities related if there is a point with that has a sufficient number of points in its neighbors and both points a and b are within the EPS distance. This is a chain process. Thus, if b is a neighbor of c , c is a neighbor of d , d is a neighbor of e , which in turn is a neighbor of a, means that b is a neighbor of a .
4. Iterate over the remaining unvisited points in the dataset. Points that do not belong to any cluster are noise.

Below is the DBSCAN clustering algorithm in pseudocode:

` DBSCAN (dataset, eps, MinPts) {# cluster index C = 1 for each unvisited point p in dataset {mark p as visited # find neighbors Neighbors N = find the neighboring points of p if | N |" = MinPts: N = NUN’ if p ’is not a member of any cluster: add p’ to cluster C} `

Implementation of the above algorithm in Python:

Here we will use the Python library ` sklearn ` for calculating DBSCAN. We will also use the ` matplotlib.pyplot ` library to render the clusters.

The dataset used can be found here .

` `

` import numpy as np from sklearn.cluster import DBSCAN from sklearn import metrics from sklearn.datasets.samples_generator import make_blobs from sklearn.preprocessing import StandardScaler from sklearn import datasets   # Load data into X db = DBSCAN (eps = 0.3 , min_samples = 10 ). fit (X) core_samples_mask = np.zeros_like (db.labels_, dtype = bool ) core_samples_mask [db.core_sample_indices_] = True labels = db.labels_   # Number of clusters in labels, ignoring noise if present. n_clusters_ = len ( set (labels) ) - ( 1 if - 1 in labels else 0 )   print (labels)    # Plot result import matplotlib.pyplot as plt   # Black removed and used instead of noise. unique_labels = set (labels) colors = [ ’y’ , ’ b’ , ’g’ , ’r’ ] print (colors) for k, col in zip (unique_labels, colors):   if k = = - 1 : # Black is used for noise.   col = ’ k’     class_member_mask = (labels = = k)   xy = X [class_member_mask & amp; core_samples_mask] plt.plot (xy [:, 0 ], xy [:, 1 ], ’o’ , markerfacecolor = col, markeredgecolor = ’k’ ,    markersize = 6 )   xy = X [class_member_mask & amp; ~ core_samples_mask] plt.plot (xy [:, 0 ], xy [:, 1 ] , ’o’ , markerfacecolor = col, markeredgecolor = ’ k’ ,   markersize = 6 )    plt.title ( ’number of clusters:% d’ % n_clusters_) plt.show () `

` `

Output:

Black dots represent outliers. By changing eps and MinPts , we can change the cluster configuration.

Now the question should be raised:

Lack of K -MEANS:

1. K-means form only spherical clusters. This algorithm fails when the data is not spherical (that is, the same deviations in all directions).
2. The K-Means algorithm is outlier sensitive. Outliers can distort clusters in K-means to a very large extent.
3. The K-Means algorithm requires one to indicate the number of clusters in a monastery, etc.

In essence, DBSCAN overcomes all of the above-mentioned disadvantages of the K-Means algorithm. The DBSCAN algorithm identifies a dense region by grouping data points that are closed to each other based on distance measurements.

A Python implementation of the above algorithm without using the sklearn library can be found here dbscan_in_python .

## Shop

Learn programming in R: courses

\$

Best Python online courses for 2022

\$

Best laptop for Fortnite

\$

Best laptop for Excel

\$

Best laptop for Solidworks

\$

Best laptop for Roblox

\$

Best computer for crypto mining

\$

Best laptop for Sims 4

\$

Latest questions

NUMPYNUMPY

Common xlabel/ylabel for matplotlib subplots

NUMPYNUMPY

How to specify multiple return types using type-hints

NUMPYNUMPY

Why do I get "Pickle - EOFError: Ran out of input" reading an empty file?

NUMPYNUMPY

Flake8: Ignore specific warning for entire file

NUMPYNUMPY

glob exclude pattern

NUMPYNUMPY

How to avoid HTTP error 429 (Too Many Requests) python

NUMPYNUMPY

Python CSV error: line contains NULL byte

NUMPYNUMPY

csv.Error: iterator should return strings, not bytes

## Wiki

Python | How to copy data from one Excel sheet to another

Common xlabel/ylabel for matplotlib subplots

Check if one list is a subset of another in Python

How to specify multiple return types using type-hints

Printing words vertically in Python

Python Extract words from a given string

Cyclic redundancy check in Python

Finding mean, median, mode in Python without libraries