Anomaly Detection and the DBSCAN clustering algorithm

May 05, 2024

Anomaly detection is often approached as a supervised machine learning problem. A typical case is credit card fraud detection. These problems are then reduced to classification problems with imbalanced data. Some of the models used include Decision Tree, Logistic Regression, Random Forest, Ada Boost, XG Boost, Support Vector Machine (SVM), and Light GBM (see: https://www.sciencedirect.com/science/article/pii/S2772662223000036)

There are however problems when we do not have labelled data and need to find anomalies using an unsupervised approach.

DBSCAN (an acronym for Density-Based Spatial Clustering of Applications with Noise) is a popular algorithm for clustering data. Unlike K-Means, there is no need to specify how many clusters you expect because clusters are determined by grouping points that are close together. One only needs to specify what is the minimum number of points for each subset and the definition of “close”. The advantage of DBSCAN is that, unlike K-Means, it will leave out points that do not belong to any cluster. This can be useful when one needs to use an unsupervised approach to anomaly detection.

A typical (statistical) approach to find anomalies is to use the InterQuartile Range (IQR). If we have several values, we can assume that any value which is unusually high or unusually low will be an anomaly. The quantification of “unusually high”or “unusually low” is often determined by the distance from the median (or from the average). A common approch is to calculate the interquartile range, which is the distance between the 25th and the 75th percentile, multiply it by 1.5, and define as anomalous any value above or below median +/- 1.5*IQR (see: https://online.stat.psu.edu/stat200/lesson/3/3.2).

Another approach, especially applicable to normal distributions, involves calculating the mean and standard deviation of the distribution. The 68%-95%-99.7% rule is well-known and it quantifies the percentage of values that will be at most 1, 2, or 3 standard deviations from the average. Therefore values that are more than 2 or 3 standard deviations from the average will be classified as anomalous values.

However, one should note that, by definition, anomalies are not just extreme points but any point that does not occur frequently. While extreme points, by definition, do not occur frequently, there may also be points that do not occur frequently but are not extreme.

In Anomaly detection in temperature data using DBSCAN algorithm the authors show how using DBSCAN will find all extreme points but can also identify anomalous values that are not extreme.

Using DBSCAN, the algorithm will cluster the majority of the data points around the main locations and label the outliers as noise (or anomalous), indicating that they are unusual compared to the majority of the data. In this paper, the authors focus on the discovery of anomalies in monthly temperature data using DBSCAN algorithm.

First, they identify the data points greater than “μ+2σ” or “μ+3σ” and smaller than “μ-2σ” or “μ-3σ” as anomalies, where μ represents the average and σ the standard deviation.

Then they use the DBSCAN algorithm to find anomalies. The aim of the DBSCAN algorithm is to discover abnormal points that do not fit any of the clusters. In the experiments, the authors evaluated the DBSCAN algorithm and, besides the extreme points, discovered anomalies occurring between normal points showing that there needs not be any linear separation between normal points and abnormal points, as the statistical method showed.

This demonstrate how the DBSCAN algorithm can find the same anomalies that can be found using statistical methods but also finds additional anomalous values that may occur between normal points and do not have extreme values.

Share The Intelligent Blog

The Intelligent Blog

Anomaly Detection and the DBSCAN clustering algorithm

Discussion about this post