The Data Science Lab

AdaBoost Binary Classification Using C#

Dr. James McCaffrey from Microsoft Research presents a C# program that illustrates using the AdaBoost algorithm to perform binary classification for spam detection. Compared to other classification algorithms, AdaBoost is powerful and works well with small datasets, but is sometimes susceptible to model overfitting.

AdaBoost ("adaptive boosting") is a powerful machine learning technique for binary classification -- predicting a variable that has two possible values. This article presents a demo program that illustrates using AdaBoost for email spam detection.

Compared to other binary classification techniques, AdaBoost has very strong predictive power, typically rivaled only by other boosting algorithms and neural network classifiers. Two other advantages of AdaBoost are that it doesn't require data normalization, and that it usually works well with small datasets. Two disadvantages of AdaBoost are that it sometimes is susceptible to model overfitting, and that it's not as interpretable as techniques such as logistic regression and k-nearest neighbors classifiers.

The best way to see where this article is headed is to take a look at the screenshot of the demo program in Figure 1. The demo program begins by loading 200 training items and 50 test items from the UCI Email Spam dataset. Each data item has 57 predictor values. The goal of the program is to predict if an email message is not-spam (-1) or spam (1).

Figure 1: AdaBoost Binary Classification for Spam Detection in Action
[Click on image for larger view.] Figure 1: AdaBoost Binary Classification for Spam Detection in Action

The AdaBoost model requires only one parameter, the number of weak learners to use. The demo uses 100 weak learners. The demo trains the AdaBoost model using the 200 training items. The trained model scores 100% accuracy on the training data (200 out of 200 correct) and 96.00% accuracy on the test data (48 out of 50 correct).

The demo concludes by making a prediction for a new previously unseen email message that is represented as 57 numeric values: (0.00, 0.64, 0.64, . . . 61.00, 278.00). The model predicts that the email message is class 1 = spam.

This article assumes you have intermediate or better programming skill with a C-family language, preferably C#, but doesn't assume you know anything about AdaBoost.

The code for the demo program is a bit too long to be presented in its entirety in this article. The complete code and the data are available in the accompanying file download, and are also available online.

Understanding the Data
The UCI Email Spam Dataset can be found at archive.ics.uci.edu/dataset/94/spambase. The source data has 4,601 rows. Each row represents the contents of an email message. Each comma-delimited row has 57 predictor values followed by a class label 0 = not-spam or 1 = spam. The data looks like:

0,0.64,0.64, . . . 0,3.756,61,278,1
0.21,0.28,0.5, . . . 0.048,5.114,101,1028,1
0.06,0,0.71, . . . 0.01,9.821,485,2259,1
. . .
0,0,0.65, . . . 0,1.25,5,40,0

The first 1,813 rows are all class 1 = spam. The next 2,788 rows are all class 0 = not-spam. The first 54 of the 57 predictor variables on each line are percentages of words (48 variables) and characters (6 variables) such as "free" and "$" in the message. Most of the percentage values on a line are 0. The last 3 of the 57 predictor variables on each line are related to the run length of consecutive capital letters like "SPECIAL."

The demo program uses 200 randomly selected rows of data for the training set and 50 different randomly selected rows for a test set. The demo program loads the training data into memory using these statements:

string trainFile = "..\\..\\..\\Data\\spam_train_200.txt";
int[] colsX = new int[57];
for (int i = 0; i < colsX.Length; ++i)
{ colsX[i] = i; }
double[][] trainX = Utils.MatLoad(trainFile, colsX, ',', "#");
int[] trainY = Utils.MatToIntVec(Utils.MatLoad(trainFile,
  [57], ',', "#"));

The MatLoad() function is a program-defined helper inside a Utils class. The result is a C# array-of-arrays style matrix of type double. The arguments to MatLoad() mean load columns [0] to [56] inclusive, where the data is comma-delimited, and lines that begin with "#" are comments.

The target y values are loaded as a type double matrix and then converted to a type int vector using the program-defined MatToIntVec() helper function.

A very nice characteristic of the AdaBoost binary classification system is that it doesn't require you to normalize numeric predictor values. This can be a huge time saver. A minor quirk of AdaBoost is that target class label values must be encoded as -1 or +1 rather than the far more common 0 or 1. The demo converts the training target labels like so:

for (int i = 0; i < trainY.Length; ++i)
  if (trainY[i] == 0) trainY[i] = -1;

In a non-demo scenario, preparing training and test data usually requires the most time and effort when using AdaBoost. But this is true for virtually all machine learning techniques.

Understanding AdaBoost
A machine learning "boosting" algorithm is one that is made up of many weak learners. In theory almost any primitive type of software component can be used as the weak learners. But in practice, weak learners are almost always the simplest possible decision tree, called a stump. For the demo data, one stump might be similar to "if the value at X[col=72] is less than 0.35 then predict -1."

The "adaptive" part of AdaBoost means that each consecutive weak learner stump is constructed using error information from the previous stump. The result is a list of stumps where each stump is slightly better at predicting difficult items than the previous stump.

Each stump has an alpha value assigned to it. The alpha value is a weight that indicates the quality of the stump. To make a prediction using the list of weak learner stumps, the equation looks like: prediction = sign(sum(alpha(i) * stump(i)). In words, you use each stump to make a prediction of -1 or +1 and multiply that prediction by the associated alpha value. You add all the weighted predictions, and if that sum is positive, the final prediction is class +1, or if the sum is negative, the final prediction is class -1. Because the demo program uses 100 weak learners, there are 100 alpha weights.

Figure 2: The AdaBoost Algorithm in Pseudo-Code
[Click on image for larger view.] Figure 2: The AdaBoost Algorithm in Pseudo-Code

The equations in Figure 2 are a representation of the AdaBoost algorithm. If you're not used to working with math all day long, the equations might look like gibberish. But the algorithm is much simpler than it first appears. From a developer's point of view, the key thing to note is that there are two sets of weights used by the algorithm. The alpha weights are inference weights -- they're used to make the final prediction. The w weights in Figure 2 are training weights. They are used to construct the list of weak learners by giving greater or less weight to each training item. Because the demo uses 200 training items, there are 200 w weights.

Overall Program Structure
I used Visual Studio 2022 (Community Free Edition) for the demo program. I created a new C# console application and checked the "Place solution and project in the same directory" option. I specified .NET version 8.0. I named the project AdaBoost. I checked the "Do not use top-level statements" option to avoid the program entry point shortcut syntax.

The demo has no significant .NET dependencies, and any relatively recent version of Visual Studio with .NET (Core) or the older .NET Framework will work fine. You can also use the Visual Studio Code program if you like.

After the template code loaded into the editor, I right-clicked on file Program.cs in the Solution Explorer window and renamed the file to the slightly more descriptive AdaBoostProgram.cs. I allowed Visual Studio to automatically rename class Program.

The overall program structure is presented in Listing 1. All the control logic of the demo is in the Main() method. The AdaBoost class has all the AdaBoost logic. The Utils class has helper functions to load the training and test data into memory.

Listing 1: AdaBoost Demo Program Structure

using System;
using System.IO;
using System.Collections.Generic;

namespace AdaBoost
{
  internal class AdaBoostProgram
  {
    static void Main(string[] args)
    {
      Console.WriteLine("Begin AdaBoost" +
        " classification demo ");

      // 1. load data
      //    replace 0 labels with -1s
      // 2. create and train model
      // 3. evaluate model
      // 4. use model
 
      Console.WriteLine("End demo ");
      Console.ReadLine();
    } // Main()

  } // Program

  // ========================================================

  public class AdaBoost
  {
    public int nLearners;
    public List<Stump> learners;

    // ------------------------------------------------------

    public class Stump
    {
      public int polarity;  // -1 or +1
      public int colIndex;  // index of the pivot feature
      public double splitVal;  // pivot value
      public double alpha;  // quality; used to update wts

      public Stump()
      {
        this.polarity = 0;
        this.colIndex = 0;
        this.splitVal = 0.0;
        this.alpha = 0.0;
      }
    }

    // ------------------------------------------------------

    public AdaBoost(int nLearners) { . . }
    public void Train(double[][] trainX, int[] trainY) { . . }
    private static double[] UniqueValues(double[][] mat,
      int colIdx) { . . }
    public int[] Predict(double[][] X) { . . }
    public int Predict(double[] x) { . . }
    public double Accuracy(double[][] dataX, int[] dataY) { . . }

  } // class AdaBoost

  // ========================================================

  public class Utils
  {
    // -------------------------------------------------------
    // helpers used by Main: MatLoad(), MatShow(),
    // VecShow(), VecShow(), ShuffleRows(), ExtractRows(),
    // ExtractCols(), ExtractColAsIntVector(), MatMake()
    // -------------------------------------------------------

  } // class Utils

} // ns

The AdaBoost class contains a nested Stump class that implements a weak learner. The AdaBoost class exposes two Predict() methods. One overload accepts a matrix of predictor values and generates a vector of predicted values. The other overload accepts a single vector of input values and generates a single -1 or +1 prediction.

After loading the training data as described earlier, the demo loads the test data using the same pattern:

string testFile = "..\\..\\..\\Data\\spam_test_50.txt";
double[][] testX = Utils.MatLoad(testFile, colsX, ',', "#");
int[] testY = Utils.MatToIntVec(Utils.MatLoad(testFile,
  [57], ',', "#"));

// replace 0 labels with -1s
for (int i = 0; i < trainY.Length; ++i)
  if (trainY[i] == 0) trainY[i] = -1;
for (int i = 0; i < testY.Length; ++i)
  if (testY[i] == 0) testY[i] = -1;
Console.WriteLine("Done ");

Instead of programmatically converting target 0 values to -1 values, you could preprocess the data explicitly. However, the vast majority of machine learning binary classification algorithms use 0-1 encoding for the target variable, so it makes more sense to programmatically convert so that the 0-1 encoded data can be used by other algorithms such as logistic regression, neural binary classification, and k-nearest neighbors classification.

Creating and Training the AdaBoost Classifier
The model is created and trained using these statements:

int nLearners = 100;
Console.WriteLine("Creating AdaBoost model nLearners = " +
  nLearners);
AdaBoost model = new AdaBoost(nLearners);
Console.WriteLine("Done ");

Console.WriteLine("Starting training ");
model.Train(trainX, trainY);
Console.WriteLine("Done ");

Finding a good number of weak learners is a matter of trial and error. Good values to start with are 100, 200, and 500. If the number of weak learners is too small, the model will underfit and predict poorly. If the number of weak learners is too large, the model might overfit and predict poorly on new, previously unseen data.

One minor weakness of AdaBoost compared to other binary classification algorithms is that training is relatively slow. This happens because when each weak learner is created, the algorithm must scan through all values in the training data.

Evaluating and Using the Trained AdaBoost Model
The trained model is evaluated using these statements:

Console.WriteLine("Computing model accuracy ");
double accTrain = model.Accuracy(trainX, trainY);
Console.WriteLine("Accuracy on training data = " +
  accTrain.ToString("F4"));
double accTest = model.Accuracy(testX, testY);
Console.WriteLine("Accuracy on test data = " +
  accTest.ToString("F4"));

The Accuracy() method computes a basic model of accuracy: number of correct predictions divided by the total number of predictions made. You might want to add methods to compute and display accuracy by class as a confusion matrix. For example, the output could look like:

Confusion matrix test data:
 actual = -1 |    26     0  |  1.0000
 actual = +1 |     2    22  |  0.9167
----------
predicted:        -1    +1

The demo concludes by using the trained model to make a prediction. A set of inputs is prepared:

double[] x = [0, 0.64, 0.64, 0, 0.32, 0, 0, 0, 0, 
  0, 0, 0.64, 0, 0, 0, 0.32, 0, 1.29, 1.93, 0, 0.96, 
  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.778, 0,
  0, 3.756, 61, 278];
Console.WriteLine("Predicting for x = ");
Utils.VecShow(x, 3, 8);  // 3 decimals, 8 wide

And the prediction is made:

int predY = model.Predict(x);
Console.WriteLine("Prediction (1 = spam, -1 = not-spam): ");
Console.WriteLine(predY);
Console.WriteLine("End demo ");

In some scenarios, you might want to generate predictions for an entire dataset. In those situations you can create an array-of-arrays matrix of input values and use the overloaded Predict() method that accepts a matrix.

Wrapping Up
The AdaBoost technique was invented in the mid-1990s and was a major breakthrough from both a theoretical perspective (using a collection of weak learners) and a practical perspective (significantly more powerful than most other binary classification techniques of the time).

Other boosting techniques based on the ideas in AdaBoost have been created, notably XGBoost ("extreme gradient boost") in 2014 and LightGBM ("lightweight gradient boosting machine") in 2017. XGBoost and LightGBM sometimes create better predictive models than AdaBoost, but they are fantastically complicated and took many months of dedicated development effort. For many binary prediction scenarios, I prefer the elegance and simplicity of AdaBoost.

comments powered by Disqus

Featured

Subscribe on YouTube