DBSCAN

1. What does DBSCAN stands for?
Density-based spatial clustering of applications with noise.
2. How does DBSCAN work intuitively?
Given a dataset of points, the objective is to partition the points into the dense region (clusters) and sparse region (noise).
3. Explain DBSCAN hyperparameters eps and minPoints.
  • eps: specifies how close points should be to each other to be considered a part of a cluster. If distance between two points is lower or equal to this value (eps), these points are considered to be neighbours.
  • minPoints: The minimum number of points to form a dense region.
4. Explain Core Point, Border Point and Noise Point.
dbscan point visuals

Image Source: https://shritam.medium.com/how-dbscan-algorithm-works-2b5bef80fb3

  • Core Point:
    • A core point is a point that has ≥ minPoints within the eps radius around it.
    • It is always in a dense region
  • Border Point:
    • A border point is a point that has < minPoints in eps radius and it is in the neighbourhood of a core point (at least one core point in the neighbourhood)
  • Noise Point:
    • Any point that is not a core point or a border point.
5. Explain Density Edge and Density Connected Points.
6. Is it harder to build a cluster if both minPoint and eps decrease?
Depends, decrease in minPoint makes it easier to form cluster but decrease in eps makes it harder to form cluster.
7. Is it harder to build a cluster if both minPoint and eps increase?
Depends, increase in minPoint makes it harder to form cluster but increase in eps makes it easier to form cluster.
8. Is it harder to build a cluster if minPoint increases but eps remains the same?
It is harder to form cluster because it requires more points to form cluster.
9. Is it harder to build a cluster if eps increases but minPoint remains the same?
It is easier to form cluster because the required distance between points to form cluster increases.
10. Provide a general guideline to select the appropriate minPoint value.
  • The minPoint should be more than or equals to the dimension.
  • For larger datasets, we should increase the minPoint.
11. Explain the implications if eps is set too high.

Setting a high epsvalue means that the algorithm considers data points as neighbours even when they are relatively far apart, making it easier to group data points into clusters.

When eps is set too high, it will likely to lead to underfitting because DBSCAN may merge distinct clusters into a single large cluster or include outliers. This leads to a loss of discriminative power and the clusters may not accurately represent the underlying patterns in the data.

12. Explain how knee method is used to determine the appropriate eps.
dbscan knee method

Where y = eps, x = points (sample) sorted by distance

Image Source: https://shritam.medium.com/how-dbscan-algorithm-works-2b5bef80fb3

eps is the maximum distance between two points to be considered as neighbours.

  • How it works:
    1. Select the appropriate minPoint=k.
    2. Compute the average distance between every point to its k nearest neighbours.
    3. Plot the k-distance in ascending order. Note that every point will have different distances.
    4. Select the k-distance where the rate of increase changes drastically. The intuition is that the k-distance of core points and border points falls within a specific range, whereas noise points may exhibit significantly greater k-distances.
13. State and explain the steps to perform DBSCAN.
  1. Initialisation:
    • Select the appropriate values for eps and minPoints
  2. Core Point Identification:
    • For each data point, calculate the distance di.
    • If di ≤ eps, the data point categorised as the neighbour of the point x.
    • If a data point has more neighbours than minPoints, it will be marked as core point.
  3. Cluster Formation:
    • Select a random core point to create a cluster.
    • Connect the core points together that are within the datapoint’s eps to form the first cluster.
  4. Border Point Assignments:
    • Assign each border points to the cluster of the closest core point. Note that border points cannot extend the cluster further, they can only join a cluster.
  5. Noise Point Identification:
    • Any unvisited data points that are not core nor border points will be treated as outliers relative to the cluster.
  6. Repeat for unvisited Core Points:
    • Repeat the cluster formation process for all unvisited core points, iteratively creating additional clusters as long as there are unvisited core points to start from.
14. Explain why DBSCAN is able to noise effectively.
It can automatically detect and separate noise from the actual clusters based on density differences.
15. List the advantages of DBSCAN.
  • Handles Noise Data effectively.
  • Robust to different shaped clusters.
  • No need to specify the number of clusters (k).
  • Adaptive to data density.
  • Natural Handling of Outliers.
16. List and explain the disadvantages of DBSCAN.
  • Difficulty with varying density: When clusters in the data have significantly different densities, it can be challenging to find a single set of parameter values that work well for all clusters.
  • Computationally expensive: In datasets with high dimensionality or when using certain distance metrics, DBSCAN’s memory and computational requirements can become substantial.
Last updated on 19 Nov 2023