### Data Anomaly Detection Using LightGBM

Dr. James McCaffrey from Microsoft Research presents a complete program that uses the Python language LightGBM system to create a custom autoencoder for data anomaly detection. You can easily adapt the demo program for your own anomaly detection scenarios.

If you have a set of data items, the goal of anomaly detection is to find items that are different in some way from most of the items. Anomaly detection is sometimes called outlier detection. There are several techniques that can be used for anomaly detection. This article explains how to perform anomaly detection using LightGBM.

LightGBM (lightweight gradient boosting machine) is a sophisticated, open-source, tree-based system introduced in 2017. LightGBM can perform multi-class classification (predict one of three or more possible values), binary classification (predict one of two possible values), regression and ranking.

The best way to see where this article is headed is to take a look at the screenshot of a demo program in **Figure 1**. LightGBM has three programming language interfaces -- C, Python and R. The demo program uses the Python language API. The demo begins by loading a tiny 10-item dataset into memory. The first three data items look like:

[ 1.0000 0.2400 0.0000 0.2950 1.0000] [ 0.0000 0.3900 1.0000 0.5120 0.5000] [ 1.0000 0.6300 0.5000 0.7580 0.0000] . . .

Each line represents a person. The predictor variables are sex, age, state of residence, annual income and political leaning.

The demo creates and trains a LightGBM autoencoder model. An autoencoder predicts its input. The first three predicted items are:

[ 0.8438 0.2557 0.3172 0.2971 0.9238] [-0.0632 0.3839 0.7091 0.4732 0.3222] [ 0.9026 0.5504 0.7404 0.6254 0.2831] . . .

The predicted items are reasonably close to the original source items. When using an autoencoder, the predicted items are sometimes called the reconstructed items.

The demo program compares each of the 10 reconstructed items with their associated original source item. The reconstructed item that is most different from its original version is item [9] and so it is assumed to be anomalous in some way:

Analyzing all data for reconstruction error Most anomalous idx = [9] Item: [ 0.0000 0.3900 1.0000 0.4710 1.0000] Reconstruction: [ 1.0000 0.2400 0.0000 0.2950 1.0000] Error = 0.7887

The sex variable in column [0] and the State of residence value in column [2] are completely wrong. The demo concludes by displaying the top 5 items with the poorest reconstruction (highest reconstruction error):

Top 5 most anomalous items and error: 9 : 0.7887 8 : 0.7121 3 : 0.7115 7 : 0.4974 2 : 0.4139

The top three items, [9], [8], [3], appear to be significantly more anomalous than the next two items [7], [2]. In a non-demo scenario, the most anomalous items would be examined to try and determine why they are so anomalous.

This article assumes you have intermediate or better programming skill with a C-family language, preferably Python, and a basic knowledge of decision tree terminology, but does not assume you know anything about LightGBM. The entire source code for the demo program is presented in this article, and is also available in the accompanying file download. You can also find the source code and data here.

**The Data**

The demo program uses a tiny 10-item set of synthetic data. The raw data is:

F 24 michigan 29500.00 liberal M 39 oklahoma 51200.00 moderate F 63 nebraska 75800.00 conservative M 36 michigan 44500.00 moderate F 27 nebraska 28600.00 liberal F 50 nebraska 56500.00 moderate F 50 oklahoma 55000.00 moderate M 19 oklahoma 32700.00 conservative F 22 nebraska 27700.00 moderate M 39 oklahoma 47100.00 liberal

The fields are sex (M, F), age, state (Michigan, Nebraska, Oklahoma), income, and political leaning (conservative, moderate, liberal). When using LightGBM to act as an autoencoder, you should normalize and encode all data so that the values are in the same range, usually between 0 and 1, or between -1 and +1. This prevents values with large magnitude (such as income) from dominating values with small magnitude (such as age), when computing reconstruction error. The encoded and normalized data is:

1.0000, 0.2400, 0.0000, 0.2950, 1.0000 0.0000, 0.3900, 1.0000, 0.5120, 0.5000 1.0000, 0.6300, 0.5000, 0.7580, 0.0000 0.0000, 0.3600, 0.0000, 0.4450, 0.5000 1.0000, 0.2700, 0.5000, 0.2860, 1.0000 1.0000, 0.5000, 0.5000, 0.5650, 0.5000 1.0000, 0.5000, 1.0000, 0.5500, 0.5000 0.0000, 0.1900, 1.0000, 0.3270, 0.0000 1.0000, 0.2200, 0.5000, 0.2770, 0.5000 0.0000, 0.3900, 1.0000, 0.4710, 1.0000

Sex is encoded as male = 0, female = 1. Age is normalized by dividing by 100. State of residence is encoded as Michigan = 0.0, Nebraska = 0.5, Oklahoma = 1.0. Income is normalized by dividing by 100,000. Political leaning is encoded as conservative = 0.0, moderate = 0.5, liberal = 1.0.

Note that when using LightGBM for regular classification and regression, you don't need to normalize the source data.

To the best of my knowledge, there are no research results related to exactly how data should best be encoded and normalized for tree-based autoencoder systems. In practice, normalizing numeric data using min-max normalization or divide-by-constant, and encoding categorical data using equal-interval encoding, works well in most cases.

**Installing Python and LightGBM**

To use the Python language API for LightGBM, you must have Python installed on your machine. I strongly recommend using the Anaconda distribution of Python. The Anaconda distribution contains a Python interpreter and roughly 500 Python packages that are (mostly) compatible with one another. The demo uses version Anaconda3-2023.09-0, which contains Python version 3.11.5. To install Anaconda on a Windows platform, go here and find installer file Anaconda3-2023.09-0-Windows-x86_64.exe (or newer). Note: it is very easy to accidentally download a version that's not compatible with your machine.

Click on the .exe file link to download it to your machine. After the file is on your machine, double-click on the file to start the GUI-based installation process. In most scenarios, you can accept all the default installation values except the one which does not add Anaconda3 to your machine's PATH environment variable -- I recommend checking the option that adds Anaconda3 to your PATH variable so that you don't have to manually edit your system environment variables, or enter long paths on the command line.

I've published detailed step-by-step instructions for installing Anaconda Python.

You can verify your Anaconda Python installation by opening a command shell and typing the command "python" (without quotes). You should see a reply message that indicates the version of Python, followed by the Python triple greater-than prompt. You can type "exit()" to quit the interpreter.

If you ever need to uninstall Anaconda on a Windows machine, you can do so by going to the Add or Remove Programs setting, and clicking on the Uninstall option.

At the time this article was written, the Anaconda distribution does not contain the LightGBM system, and so it must be installed separately. I strongly recommend using the pip installer program (which is included with Anaconda). To install the most recent version of LightGBM over the internet, open a command shell and type the command "pip install lightgbm." After a few seconds, you should see a message indicating success. To verify, open a command shell and type "python." At the Python prompt, type the command "import lightgbm as L" followed by the command "L.__version__" using double underscores. You should see the version of LightGBM that is installed.

Instead of installing LightGBM over the internet, you can first download the LigbtGBM package to your machine and then install. Go to https://pypi.org/ and search for "lightgbm." The search results will give you a link to a LightGBM package page. Click on the Download Files link. You will go to a page that has a .whl file named like lightgbm-4.3.0-py3-none-win_amd64.whl that you can click on to download the file to your machine. After the download completes, open a command shell, navigate to the directory containing the .whl file and install LightGBM by typing the command "pip install [the .whl file name]."

If you ever need to uninstall LightGBM, you can do so by typing the command "pip uninstall lightgbm." I often use the local-install technique so that I can have a copy of LightGBM on my machine available even when I'm not connected to the internet.

**The LightGBM Demo Program**

The complete demo program is presented in **Listing 1**. The demo begins by loading the normalized and encoded data into memory:

import numpy as np import lightgbm as L def main(): print("Anomaly detection using LightGBM autoencoder ") np.random.seed(1) np.set_printoptions(precision=4, suppress=True, floatmode='fixed', sign=" ") print("Loading source data ") src = ".\\Data\\people_10.txt" # tiny subset data_XY = np.loadtxt(src, usecols=[0,1,2,3,4], delimiter=',', comments="#", dtype=np.float64) . . .

The demo does not use the NumPy global random number generator directly, but it's good practice to set the generator seed value anyway in case the program is modified later to use the RNG in some way.

The demo assumes that the data file is located in a subdirectory named Data. The comma-delimited data is loaded into a NumPy array using the loadtxt() function. The values in columns 0, 1, 2, 3, 4 are loaded as type float64. Lines that begin with "#" are comments and are not loaded.

**Listing 1:** LightGBM Anomaly Detection Demo Program

# people_anomaly_lgbm.py # custom LightGBM autoencoder reconstruction error # Anaconda3 2023.09-0 Python 3.11.5 LightGBM 4.3.0 import numpy as np import lightgbm as L # ----------------------------------------------------------- class Autoencoder(): def __init__(self, dim, n_estimators, min_leaf, lrn_rate): self.dim = dim self.n_estimators = n_estimators self.min_leaf = min_leaf self.lrn_rate = lrn_rate self.sub_models = [] # list of LGBMRegressor models def train(self, data_all): for j in range(self.dim): # each column n = len(data_all) # use 80% of rows all_rows = np.arange(n) selected_rows = np.random.choice(all_rows, size=int(n * 0.80), replace=False) data_partial = data_all[selected_rows,:] train_x = np.delete(data_partial, j, axis=1) train_y = data_partial[:, j] params = { 'objective': 'regression', # not needed 'boosting_type': 'gbdt', # default 'n_estimators': self.n_estimators, # default = 100 'num_leaves': 31, # default 'learning_rate': self.lrn_rate, # default = 0.10 'feature_fraction': 1.0, # default 'min_data_in_leaf': self.min_leaf, # default = 20 'random_state': 0, 'verbosity': -1 } sub_model = L.LGBMRegressor(**params) sub_model.fit(train_x, train_y) self.sub_models.append(sub_model) def predict(self, x): # x is 1D x = x.reshape(1, -1) # 2D for LGBMRegressor.predict() result = np.zeros(self.dim, dtype=np.float64) for i in range(self.dim): xx = np.delete(x, i, axis=1) # peel away target col pred = self.sub_models[i].predict(xx) result[i] = pred return result # ----------------------------------------------------------- def analyze(model, data_XY): n = len(data_XY) most_anom_idx = 0 most_anom_recon = data_XY[0] largest_err = 0.0 for i in range(n): x = data_XY[i] y = model.predict(x) err = np.linalg.norm(x-y) if err > largest_err: largest_err = err most_anom_idx = i most_anom_item = x print("\nMost anomalous idx = [" + str(most_anom_idx) + "]") print("Item: ", end="") print(x) print("Reconstruction: ", end="") print(most_anom_recon) print("Error = %0.4f " % largest_err) # ----------------------------------------------------------- def analyze2(model, data_XY): n = len(data_XY) ids = np.arange(n, dtype=np.int64) # 0, 1, 2, . . errors = np.zeros(n, dtype=np.float64) for i in range(n): x = data_XY[i] y = model.predict(x) err = np.linalg.norm(x-y) errors[i] = err sorted_error_idxs = np.flip(np.argsort(errors)) sorted_errors = errors[sorted_error_idxs] sorted_ids = ids[sorted_error_idxs] print("\nTop 5 most anomalous items and error: ") for i in range(5): print(str(sorted_ids[i]) + " : ", end="") print("%8.4f" % sorted_errors[i]) # ----------------------------------------------------------- def main(): print("\nAnomaly detection using LightGBM autoencoder ") np.random.seed(1) np.set_printoptions(precision=4, suppress=True, floatmode='fixed', sign=" ") print("\nLoading source data ") src = ".\\Data\\people_10.txt" # tiny subset # 1.0000, 0.2400, 0.0000, 0.2950, 1.0000 # 0.0000, 0.3900, 1.0000, 0.5120, 0.5000 # . . . # sex age State income politics data_XY = np.loadtxt(src, usecols=[0,1,2,3,4], delimiter=',', comments="#", dtype=np.float64) print("\nFirst 3 rows source data: ") for i in range(3): print(data_XY[i]) print("\nCreating autoencoder model ") print("dim = 5 ") print("n_estimators = 50 ") print("min_leaf = 2 ") print("learn_rate = 0.05 ") ae_model = Autoencoder(5, 50, 2, 0.05) print("Done ") print("\nTraining model ") ae_model.train(data_XY) print("Done ") print("\nFirst 3 predicted data items: ") for i in range(3): x = data_XY[i] # x is 1D y = ae_model.predict(x) print(y) print("\nAnalyzing all data for reconstruction error ") analyze(ae_model, data_XY) analyze2(ae_model, data_XY) print("\nEnd demo ") if __name__ == "__main__": main()

Next, the demo displays the first three lines of the data as a sanity check:

print("First 3 rows source data: ") for i in range(3): print(data_XY[i])

In a non-demo scenario, you might want to display all the data.

**Creating and Training the LightGBM Autoencoder Model**

The LightGBM system does not have a built-in autoencoder class so one must be created using multiple regression modules. The goal of the autoencoder is to predict its input. Each demo data input vector has five values: sex, age, state, income, politics. Therefore, the autoencoder has five regression sub-models. The first sub-model predicts the sex value in column [0], using the values in columns [1], [2], [3], [4] as predictors. The second sub-model predicts the age value in column [1], using the values in columns [0], [2], [3], [4]. And so on. Together the five sub-models can predict an input vector.

The demo program creates an autoencoder using these statements:

print("Creating autoencoder model ") print("dim = 5 ") print("n_estimators = 50 ") print("min_leaf = 2 ") print("learn_rate = 0.05 ") ae_model = Autoencoder(5, 50, 2, 0.05) print("Done ")

The autoencoder is trained like so:

print("Training model ") ae_model.train(data_XY) print("Done ")

The Autoencoder is a program-defined class. The dim parameter is the number of columns in the source data. The demo program train() method creates and trains each of the five LightGBM regression sub-models using these statements:

params = { 'objective': 'regression', # not needed 'boosting_type': 'gbdt', # default 'n_estimators': self.n_estimators, # default = 100 'num_leaves': 31, # default 'learning_rate': self.lrn_rate, # default = 0.10 'feature_fraction': 1.0, # default 'min_data_in_leaf': self.min_leaf, # default = 20 'random_state': 0, 'verbosity': -1 } sub_model = L.LGBMRegressor(**params) sub_model.fit(train_x, train_y) self.sub_models.append(sub_model)

The train_x parameter is a randomly selected 80 percent of the rows of the source data with a target column removed. The train_y parameter is the target_column that contains the values to predict. Both train_x and train_y use only a randomly selected 80 percent of the rows, instead of all of the rows. The idea here is that LightGBM is powerful and if all rows are used, the predictions will likely be perfect, and so no anomalous items will be found.

The regression object is named sub_model and is instantiated by setting up its parameters as a Python Dictionary collection named params. The main challenge when using LightGBM is wading through the dozens of parameters. The LGBMRegressor class/object has 19 parameters (num_leaves, max_depth, etc.) and behind the scenes there are 57 Learning Control Parameters (min_data_in_leaf, bagging_fraction, etc.), for a total of 76 parameters to deal with.

Documentation for the parameters can be found here and here.

Because the number of parameters is not manageable, you must rely on the default values and then try to find the handful of parameters that will create a good model. Based on my experience, the three most important parameters to explore and modify are n_estimators, min_data_in_leaf and learning_rate.

A LightGBM regression model is made up of n_estimators (default value is 100), relatively small decision trees that are called weak learners, or sometimes base learners. The weak trees are constructed sequentially where each tree uses gradients of the error from the previous tree. If the value of n_estimators is too small, then there aren't enough weak learners to create a model that predicts well (underfit). If the value of n_estimators is too large, then the model will overfit the training data and predict poorly on new, previously unseen data items. The demo uses 50 weak learners.

The num_leaves parameter controls the overall size of the weak learner trees. The default value of 31 translates to a balanced tree that has five levels with 1, 2, 4, 8, 16 leaf nodes respectively. An unbalanced tree might have more levels. Weak learners that are too small might underfit; too large might overfit.

The max_depth parameter controls the number of levels that each weak learner has. The default value is -1, which means that there is no explicit limit. In most cases, the num_leaves parameter will override the max_depth parameter and prevent the depth of the weak learners from becoming too large.

The min_data_in_leaf parameter controls the size of the leaf nodes in the weak learners. The default value of 20 means that each leaf node must have at least 20 associated data items. For a relatively small set of training data, the default greatly reduces the number of leaf nodes. For the demo with only 10 data items, a requirement of at least 20 values in each node means all 10 items would be clumped together. The demo modifies the value of min_data_in_leaf from 20 to 2, which gives much better results.

To recap, the n_estimators parameter controls the overall number of weak tree learners. The key parameters to control the size and shape of the weak learners are num_leaves, max_depth and min_data_in_leaf. Based on my experience, I typically experiment with n_estimators (the default value of 100 is often too large for small datasets) and min_data_in_leaf (the default of 20 is often too large for small datasets). I usually leave the num_leaves and max_depth parameter values at their default values of 31 and -1 (unlimited) respectively unless the model just doesn't predict well.

The demo modifies the learning_rate parameter from the default value of 0.10 to 0.05. The learning rate controls how much each weak learner tree changes from the previous learner. The effect of changing the learning_rate can vary quite a bit depending on the size and shape of the weak learners, but as a rule of thumb, smaller values work better for smaller datasets.

The demo modifies the value of the random_state parameter from its default value of None (Python's version of null) to 0. The None value means that results are not reproducible due to the random initialization component of the training process. Any value other than None will give (mostly) reproducible results, subject to multi-threading issues.

The demo modifies the value of the verbosity parameter from its default value of 1 to -1. The default value of 1 prints warning messages, regular error messages and fatal error messages. The demo value of -1 prints only fatal error messages. I did this only to keep the output small so I could take a screenshot. In a non-demo scenario you should leave the verbosity value at 1 in most situations.

After setting up the parameter values in a Dictionary collection, they are passed to the LGBMRegressor using the Python ** syntax, which means unpack the values to parameters. Parameter values can be passed directly, for example model = lgbm.LGBMRegressor(n_estimators = 50, learning_rate = 0.05 and so on), but because there are so many parameters, this approach is rarely used.

**Analyzing the Reconstructions**

The demo program computes the Euclidean distance between each data item and its reconstructed version. The Euclidean distance is the reconstruction error. For example, if a data item is (0.5, 1.0, 0.3, 0.4, 0.9) and its predicted/reconstructed version is (0.4, 1.0, 0.1, 0.7, 0.9), the Euclidean distance is sqrt( (0.5 - 0.4)^2 + (1.0 - 1.0)^2 + (0.3 - 0.1)^2 + ((0.4 - 0.7)^2 + (0.9 - 0.9)^2 ) = sqrt(0.01 + 0 + 0.04 + 0.09 + 0) = sqrt(0.14) = 0.37.

Notice that if a reconstructed item is identical to its original item, the Euclidean distance is sqrt(0 + 0 + . . + 0) = 0. There is no upper limit on the reconstruction error value. However, if all original data items are normalized and encoded so that each element is between 0 and 1, if each data item has n elements, assuming each reconstruction also has n elements between 0 and 1 (actually, reconstructed elements can be slightly less than 0 or slightly greater than 1), the largest reconstruction error is approximately sqrt(1 + 1 + . . 1) = sqrt(n). For the demo data where each data item has five elements, the largest possible reconstruction error is about sqrt(5) = 2.24. If you are analyzing more than one dataset and want to compare their reconstruction errors, you can normalize reconstruction error by dividing by the number of elements in each data item, or by the approximate largest possible reconstruction error.

The analyze() function computes the single most anomalous data item. The analyze2() function computes the Euclidean distance/error between all 10 data items and the 10 reconstructions, and sorts them from largest reconstruction error to smallest.

**Wrapping Up**

The demo code can be used mostly as-is for many anomaly detection scenarios. Most of your time and effort for anomaly detection will be spent on normalizing and encoding the raw source data.

The fact that there are many different techniques for data anomaly detection is an indication that no one technique works best in all scenarios. Different techniques tend to find different kinds of anomalous data items. Compared to other techniques, using a LightGBM autoencoder is relatively easy to use and easy to modify. The major disadvantage of using a LightGBM autoencoder is the large number of hyperparameters used by the underlying LGBMRegressor object.