DBSCAN ( Density-based spatial clustering of application with noise ) is an unsupervised algorithm which is used to identify clusters of any shape in a data set containing noise and outliers.
The DBSCAN algorithm is based on this intuitive notion of “clusters” and “noise”. It groups together point that are close to each other based on eps(min. distance between two points) and a minimum number of points(minPoints).
Parameters
eps: the minimum distance between two points. It means that if the distance between two points is lower or equal to this value (eps), these points are considered neighbors. minPoints: the minimum number of points required to form a cluster.
Implementation
Parameter Estimation
1. Determine minPoints: Generally, a minPoints can be derived from a number of dimensions(D) in a dataset, as minPoints>=D+1. minPoints value should be minimum 3 and larger dependending on the dataset choosed.
2. Determining optimum ‘eps’ value: To determine the optimum eps value we used K-distance plot method, a knee corresponds to a threshold where a sharp change occurs along the k-distance curve.
Function used: knndistplot()
Package used
install.packages("dbscan")
For K-distance Plot
library(dbscan) iris_mat = as.matrix(iris[,-5]) kNNdisplot(iris_mat,k=4) abline(h=0.4,col='red')
K-Distance Plot
Apply DBSCAN and plot clusters
db = dbscan(iris_mat,0.4,4) hullplot(iris_mat,db$cluster)
HULL PLOT
Advantages of DBSCAN algorithm
1. It can discover any number of clusters.
2. Clusters of varying shapes and sizes can be obtained using the DBSCAN algorithm.
3. It can detect and ignore outliers.
Leave a Reply