The Data Science Lab
DBSCAN Clustering and Anomaly Detection Using C#
Dr. James McCaffrey from Microsoft Research presents a complete end-to-end demonstration of data clustering and anomaly detection using the DBSCAN (Density Based Spatial Clustering of Applications with Noise) algorithm. Compared to other anomaly detection systems based on data clustering, DBSCAN can find significantly different types of anomalies.
The DBSCAN (Density Based Spatial Clustering of Applications with Noise) algorithm simultaneously clusters data and finds anomalous data items. This article presents an end-to-end demonstration of DBSCAN clustering and anomaly detection using the C# language.
A good way to see where this article is headed is to take a look at the screenshot in Figure 1 and the graph in Figure 2. The demo program begins by loading a tiny 10-item dataset into memory. The dummy data is:
1.0, 1.0
1.5, 1.0
2.0, 2.0
3.0, 9.0
4.0, 2.0
5.0, 2.0
6.0, 5.0
6.5, 6.0
7.0, 5.0
8.0, 6.0
For simplicity, and so that the data can be easily visualized, each data item has just two elements. The DBSCAN technique uses Euclidean distance so all data must be numeric. The source data should be normalized so that all columns have roughly the same range so that columns with large magnitudes don't overwhelm columns with small magnitudes. Data is most often normalized to the range (0.0, 1.0), but the demo data is normalized to range (0.0, 10.0).
The demo program clusters the data into groups, and the result is:
Setting epsilon = 1.5000
Setting minPoints = 2
Clustering with DBSCAN
Done
Clustering results:
0 0 0 -1 -1 -1 1 1 1 1
Number clusters found = 2
The DBSCAN algorithm requires you to specify two parameters, an epsilon value that determines if two data items are close or not, and a minimum points value that sets the fewest number of close data items needed to start a new cluster. These values must be determined by trial and error.
The result clustering indicates that data items [0], [1], [2] are assigned to cluster ID = 0. Data items [3], [4], [5] are assigned to a special noise cluster ID = -1. Data items [6], [7], [8], [9] are assigned to cluster ID = 1.
The demo concludes by analyzing the clustering results:
Cluster counts:
cluster id 0 count = 3
cluster id 1 count = 4
Number noise items = 3
[ 3] : 3.0, 9.0 : number close neighbors = 0
[ 4] : 4.0, 2.0 : number close neighbors = 1
[ 5] : 5.0, 2.0 : number close neighbors = 1
The noise items are anomalies. Because data item [3] has no close neighbors, it is the most anomalous data item. Data items [4] and [5] are slightly less anomalous. If you look at the graph in Figure 2 you can see that these results make sense.
This article assumes you have intermediate or better programming skill but doesn't assume you know anything about the DBSCAN algorithm. The demo is implemented using C# but you should be able to refactor the demo code to another C-family language if you wish. All normal error checking has been removed to keep the main ideas as clear as possible.
The source code for the demo program is too long to be presented in its entirety in this article. The complete code and data are available in the accompanying file download, and they're also available online.
Understanding DBSCAN Clustering
The DBSCAN clustering algorithm is probably best understood by walking through a concrete example. Assume that epsilon = 1.5 and minPoints = 2, as in the demo. The DBSCAN algorithm iterates through each data item/point. The first item, [0], is at (1.0, 1.0). The algorithm scans all other data items to find the items that are close (within epsilon = 1.5) of item [0], where close is measured by Euclidean distance. Item [1] at (1.5, 1.0) and item [2] at (2.0, 2.0) are close.
If the number of close items is minPoints or greater, then the current data item starts a new cluster. In this case the number of close points meets the minPoints requirement so item [0] is the core point of a new cluster. Internally, during clustering, the cluster IDs are 1-based, so data item [1] is assigned to new cluster ID = 1. After all clustering is complete, the cluster IDs are converted to 0-based form by subtracting 1.
All data items have a label, where 0 means unprocessed, -1 means noise, and 1, 2, 3, . . are cluster IDS. Initially, all data items are assigned label = 0 = unprocessed. After data item [0] has been processed, the algorithm moves to item [1] at (1.5, 1.0). Its current label is 0 = unprocessed. Item [1] is close to the newly created cluster ID = 1, so data item [1] is assigned to cluster ID = 1. Now, all of the close neighbors of item [1] are found. Item [2] is close to item [1] and so it is assigned to cluster ID = 1 too.
At this point, data items [0] and [1] have been examined. Next, the algorithm moves to data item [2]. But item [2] has label = 1 because it was just assigned to cluster ID = 1. Put another way, data item [2] does not have label = 0 = unprocessed, so the algorithm skips over item [2] and moves to data item [3].
Data item [3] at (3.0, 9.0) has label = 0 = unprocessed. All of the close neighbors of [3] are found. There are no items within epsilon = 1.5 so item [3] is assigned label = -1 = noise.
Data item [4] at (4.0, 2.0) is unprocessed. It has one close neighbor, item [5] at (5.0, 2.0). But because there is only one close neighbor, the minPoints requirement is not met, and so item [4] is assigned to cluster ID = -1 = noise. In the same way, item [5] has only one close neighbor and so it is assigned to ID = -1 = noise.
Data item [6] is unprocessed. It has two close neighbors, item [7] at (6.5, 6.0) and item [8] at (7.0, 5.0). Data item [9] at (8.0, 6.0) is not quite within epsilon = 1.5 so it's not close. Because item [6] has two close neighbors, the minPoints requirement is met and item [6] is assigned to new cluster ID = 2.
Item [7] is close to item [6] so it is assigned to cluster ID = 2. The close items to [7] are [6] and [8]. Item [6] is already assigned to cluster ID = 2 but item [8] is unprocessed so it is assigned to cluster ID = 2. The close items to [8] are just item [9] so it is assigned to cluster ID = 2.
Now, all data items have been processed. The 1-based cluster IDs are converted to 0-based IDs. Cluster ID = -1 = noise is left as is.
DBSCAN Clustering vs. k-Means Clustering
The essence of the DBSCAN algorithm is to initialize a cluster that has a minimum number of points close to each other, and then expand outwards, looking for other close items. Data items that are close to each other form a chain of data items in a cluster. Unlike k-means clustering, where clusters have a circular shape, DBSCAN clusters can look like long snake-like structures where the end items in a cluster can be quite far apart. This means that DBSCAN clustering can find significantly different clustering sets than the k-means algorithm.
When using DBSCAN clustering, you don't explicitly specify the number of clusters to create, as you do with k-means clustering. The DBSCAN epsilon and minPoints parameters implicitly determine the number of clusters. In my opinion, the biggest weakness of DBSCAN clustering is that the algorithm is extremely sensitive to the values of the epsilon and minPoints parameters.
The DBSCAN algorithm automatically finds anomalous items -- those assigned to cluster ID = -1. To find anomalous data items using the k-means algorithm, you must perform post-clustering processing, finding data items that are far from their cluster centroid.
The Demo Program
I used Visual Studio 2022 (Community Free Edition) for the demo program. I created a new C# console application and checked the "Place solution and project in the same directory" option. I specified .NET version 8.0. I named the project AnomalyDBSCAN. I checked the "Do not use top-level statements" option to avoid the program entry point shortcut syntax.
The demo has no significant .NET dependencies and any relatively recent version of Visual Studio with .NET (Core) or the older .NET Framework will work fine. You can also use the Visual Studio Code program if you like.
After the template code loaded into the editor, I right-clicked on file Program.cs in the Solution Explorer window and renamed the file to the slightly more descriptive AnomalyDBSCANProgram.cs. I allowed Visual Studio to automatically rename class Program.
The overall program structure is presented in Listing 1. All the control logic is in the Main() method. All of the DBSCAN clustering and anomaly detection functionality is in an AnomalyDBSCAN class. That class also holds miscellaneous helper functions to load data into memory, and display data.
Listing 1: Overall Program Structure
using System;
using System.IO;
using System.Collections.Generic;
namespace AnomalyDBSCAN
{
internal class AnomalyDBSCANProgram
{
static void Main(string[] args)
{
Console.WriteLine("Begin anomaly detection" +
" using DBSCAN clustering ");
// 1. load data
// 2. cluster data using DBSCAN
// 3. slow clustering
// 4. analyze clustering for anomalies
Console.WriteLine("End demo ");
Console.ReadLine();
}
} // Program
public class AnomalyDBSCAN
{
public double eps;
public int minPts;
public double[][] data; // supplied in cluster()
public int[] labels; // supplied in cluster()
public AnomalyDBSCAN(double eps, int minPts)
{
this.eps = eps;
this.minPts = minPts;
}
public int[] Cluster(double[][] data) { . . }
private List<int> RangeQuery(int p) { . . }
private void Expand(int p, List<int> neighbors,
int cid) { . . }
public void Analyze(string[] rawFileArray) { . . }
private static double EucDistance(double[] x1,
double[] x2) { . . }
// ------------------------------------------------------
// misc. helper functions
// ------------------------------------------------------
public static double[][] MatLoad(string fn,
int[] usecols, char sep, string comment) { . . }
public static string[] FileLoad(string fn,
string comment) { . . }
public static int[] VecLoad(string fn, int usecol,
string comment)
public static void MatShow(double[][] M, int dec,
int wid, int numRows, bool showIndices) { . . }
public static void VecShow(int[] vec, int wid) { . . }
public static void VecShow(double[] vec, int decimals,
int wid) { . . }
public static void ListShow(List<int> lst) { . . }
} // class AnomalyDBSCAN
} // ns
The demo starts by loading the 10-item source data into memory as an array of type string:
string rf =
"..\\..\\..\\Data\\toy_10.txt";
string[] rawFileArray =
AnomalyDBSCAN.FileLoad(rf, "#");
Console.WriteLine("First three rows" +
" of raw data: ");
for (int i = 0; i < 3; ++i)
Console.WriteLine("[" + i.ToString().
PadLeft(3) + "] " + rawFileArray[i]);
The idea here is that in some situations you'll have a source file of raw data that hasn't been normalized in addition to normalized data. The raw data can be used to display anomalous items in raw form. The demo only has normalized data. Next, the demo loads the normalized data into memory:
string fn = "..\\..\\..\\Data\\toy_10.txt";
double[][] X = AnomalyDBSCAN.MatLoad(fn,
new int[] { 0, 1 }, ',', "#");
Console.WriteLine("\nFirst three rows" +
" of normalized and encoded data: ");
AnomalyDBSCAN.MatShow(X, 4, 8, 3, true);
The normalized data is stored as an array-of-arrays style matrix of type double using the MatLoad() helper function. The arguments to MatLoad() mean load columns 0 and 1 of the file, where data is separated by commas, and lines that begin with the '#' character are comments to be ignored. The arguments to MatShow() mean, display matrix X, using 4 decimals, each value padded to 8 spaces wide, just the first 3 rows, with index values.
The data is clustered using these statements:
double epsilon = 1.5;
int minPoints = 2;
AnomalyDBSCAN dbscan =
new AnomalyDBSCAN(epsilon, minPoints);
int[] clustering = dbscan.Cluster(X);
Console.WriteLine("Clustering results: ");
AnomalyDBSCAN.VecShow(clustering, 4);
The Cluster() method returns an integer array of 0-based cluster IDs, where a value of -1 indicates the special noise cluster. Behind the scenes, the Cluster() method calls helper methods RangeQuery() to find close neighbors and Expand() to grow clusters after they have been created.
The demo concludes by calling the Analyze() method:
Console.WriteLine("Analyzing");
dbscan.Analyze(rawFileArray);
Console.WriteLine("End demo ");
Console.ReadLine();
The information in the rawFileArray array is used to display the anomalous items in their raw form, rather than in their normalized form. (Both forms are the same for the dummy demo data).
Wrapping Up
The DBSCAN clustering and anomaly detection technique is most often used with purely numeric data. However, the technique can be used for mixed categorical and numeric data. Plain categorical data, such as a variable color with values "red", "blue", green", can be one-over-n-hot encoded: red = (0.3333, 0, 0), blue = (0, 0.3333, 0), green = (0, 0, 0.3333). Categorical data with inherent ordering, such as a variable height with values "short", "medium", "tall", can be equal-interval encoded: short = 0.25, medium = 0.50, tall = 0.75. Binary data can be encoded using either technique for example, "male" = 0.0, "female" = 0.5 or "male" = 0.25, "female = 0.75.
When using these encoding schemes, all encoded values are between 0.0 and 1.0 and so numeric data should be normalized to this range too. Two common techniques are divide-by-constant if all raw values are positive (for example, dividing people's ages by 100), or min-max normalization. Note that z-score normalization can be used but extreme raw values might be normalized to values less than -1 or greater than +1.
About the Author
Dr. James McCaffrey works for Microsoft Research in Redmond, Wash. He has worked on several Microsoft products including Azure and Bing. James can be reached at [email protected].