The Data Science Lab
Gradient Boosting Regression Using C#
Dr. James McCaffrey from Microsoft Research presents a complete end-to-end demonstration of the gradient boosting regression technique, where the goal is to predict a single numeric value. Compared to existing library implementations of gradient boosting regression, a from-scratch implementation allows much easier customization and integration with other .NET systems.
A machine learning gradient boosting regression system, also called a gradient boosting machine (GBM), predicts a single numeric value. A GBM is an ensemble (collection) of simple decision tree regressors that are constructed sequentially to predict the differences (residuals) between predicted y values and actual y values. To make a prediction for an input vector x, an initial prediction is estimated and then predictions are computed by accumulating the predicted residuals from each tree in the ensemble. The running predicted y value will slowly get closer and closer to the true target y value.
This article presents a complete demo of gradient boosting regression using the C# language. Although there are several code libraries that contain implementations of gradient boosting regression, such as XGBoost, LightGBM, and CatBoost, implementing a system from scratch allows you to easily modify the system, and easily integrate with other systems implemented with .NET while generating reasonably interpretable results.
A good way to see where this article is headed is to take a look at the screenshot in Figure 1. The demo program begins by loading synthetic training and test data into memory. The data looks like:
-0.1660, 0.4406, -0.9998, -0.3953, -0.7065, 0.4840
0.0776, -0.1616, 0.3704, -0.5911, 0.7562, 0.1568
-0.9452, 0.3409, -0.1654, 0.1174, -0.7192, 0.8054
0.9365, -0.3732, 0.3846, 0.7528, 0.7892, 0.1345
. . .
The first five values on each line are the x predictors. The last value on each line is the target y variable to predict. The demo creates a gradient boosting regression model, evaluates the model accuracy on the training and test data, and then uses the model to predict the target y value for x = [-0.1660, 0.4406, -0.9998, -0.3953, -0.7065].
The first part of the demo output shows how a gradient boosting regression model is created:
Setting numTrees = 200
Setting maxDepth = 2
Setting minSamples = 2
Setting lrnRate = 0.0500
Creating and training GradientBoostRegression model
Done
The second part of the demo output shows the model evaluation, and using the model to make a prediction:
Evaluating model
Accuracy train (within 0.15) = 0.9500
Accuracy test (within 0.15) = 0.8250
Predicting for x =
-0.1660 0.4406 -0.9998 -0.3953 -0.7065
Predicted y = 0.4746
The four parameters that control the behavior of the demo gradient boosting regression model are numTrees, maxDepth, minSamples, lrnRate. The values for all four parameters must be determined by trial and error.
The numTrees parameter specifies the number of decision tree regressors for the ensemble. The maxDepth parameter specifies the number of levels of each decision tree. If maxDepth = 1, a decision tree has three nodes: a root node, a left child and a right child. If maxDepth = 2, a tree has seven nodes. In general, if maxDepth = n, the resulting tree has 2^(n+1) - 1 nodes. The decision trees used for the demo run shown in Figure 1 have maxDepth = 2 so each of the 200 trees has 2^3 - 1 = 8 - 1 = 7 nodes.
The minSamples parameter specifies the fewest number of associated data items in a tree node necessary to allow the node to be split into a left and right child. The demo trees use minSamples = 2, which is the fewest possible because if a node has only one associated item, it can't be split any further. The lrnRate (learning rate) parameter controls how much a predicted y value is updated based on the predicted residual for the current tree.
This article assumes you have intermediate or better programming skill but doesn't assume you know anything about gradient boosting regression. The demo is implemented using C#, but you should be able to refactor the demo code to another C-family language if you wish. All normal error checking has been removed to keep the main ideas as clear as possible.
The source code for the demo program is too long to be presented in its entirety in this article. The complete code and data are available in the accompanying file download, and they're also available online.
The Demo Data
The demo data is synthetic. It was generated by a 5-10-1 neural network with random weights and bias values. The idea here is that the synthetic data does have an underlying, but complex, structure which can be predicted.
All of the predictor values are between -1 and +1. There are 200 training data items and 40 test items. When using decision trees in a gradient boosting system for regression, it's not necessary to normalize the training data predictor values, because no distance between data items is computed. However, it's not a bad idea to normalize the predictors just in case you want to send the data to other regression algorithms that require normalization (for example, k-nearest neighbors regression).
Gradient boosting regression is most often used with data that has strictly numeric predictor variables. It is possible to use gradient boosting regression with mixed categorical and numeric data, by using ordinal encoding on the categorical data. In theory, ordinal encoding shouldn't work well. For example, if you have a predictor variable color with possible encoded values red = 0, blue = 1, green = 2, red will always be less-than-or-equal to any other color value in the decision tree construction process. However, in practice, ordinal encoding for gradient boosting regression often works well.
Understanding Gradient Boosting Regression
I added some WriteLine() to the demo program to show how the demo gradient boosting regression system makes a prediction:
Predicting for x =
-0.1660 0.4406 -0.9998 -0.3953 -0.7065
Initial prediction: 0.3493
t = 0 pred_res = 0.0462 delta = -0.0023 pred = 0.3470
t = 1 pred_res = 0.0427 delta = -0.0021 pred = 0.3449
t = 2 pred_res = 0.0021 delta = -0.0001 pred = 0.3448
t = 3 pred_res = 0.0390 delta = -0.0020 pred = 0.3428
t = 4 pred_res = -0.0124 delta = 0.0006 pred = 0.3434
. . .
t = 198 pred_res = 0.0018 delta = -0.0001 pred = 0.4746
t = 199 pred_res = 0.0000 delta = -0.0000 pred = 0.4746
Predicted y = 0.4746
The input vector x is the first training item, which has a true target y value of 0.4840. The initial prediction guess is the average of all the y values in the training data, in this case 0.3493.
The first tree [0] was trained to predict the residuals (differences) between the predicted y values and the actual y values. For the input x, the residual predicted by tree [0] is 0.0462. The predicted residual is moderated by multiplying by the learning rate of 0.05, and by -1 to control the direction of change for the updated prediction, giving a delta of -1 * 0.0462 * 0.05 = -0.0023. This delta is added to the current prediction giving 0.3493 + –0.0023 = 0.3470. Notice that the new prediction (0.3470) is actually worse than the initial prediction (0.3493) because the new prediction is further away from the true target y value (0.4840).
The process continues for each decision tree regressor in the ensemble. The running prediction for the first 20 or so trees jumps around quite a bit and doesn't improve the prediction much, but eventually the new running predicted value gets closer and closer to the correct target y value. If you look at the next-to-last and last trees, you can see that the predicted residuals are very small and that the predicted values (0.4746) stop changing much and are quite close to the actual target y value (0.4840). A clever and subtle algorithm!
The demo program defines a GradientBoostRegressor class with a Predict() method. The code for Predict(), with WriteLine() statements removed, is surprisingly short:
public double Predict(double[] x)
{
double result = this.pred0;
for (int t = 0; t < this.nTrees; ++t) {
double predResidual = this.trees[t].Predict(x);
double delta = -this.lrnRate * predResidual;
result += delta;
}
return result;
}
The this.pred0 is the initial prediction, which is the average of the target y values in the training data. The this.trees is the List collection of simple decision tree regressors. The Train() method is more complicated and is shown in Listing 1.
Listing 1: Gradient Boosting Regression Train Method
public void Train(double[][] trainX, double[] trainY)
{
int n = trainX.Length;
this.pred0 = Mean(trainY);
double[] preds = new double[n]; //each data item
for (int i = 0; i < n; ++i)
preds[i] = this.pred0;
for (int t = 0; t < this.nTrees; ++t) { // each tree
double[] residuals = new double[n]; // for curr tree
for (int i = 0; i < n; ++i)
residuals[i] = preds[i] - trainY[i];
DecisionTreeRegressor dtr =
new DecisionTreeRegressor(this.maxDepth,
this.minSamples);
dtr.Train(trainX, residuals); // predict residuals
for (int i = 0; i < n; ++i) {
double predResidual = dtr.Predict(trainX[i]);
preds[i] -= this.lrnRate * predResidual;
}
this.trees.Add(dtr);
}
}
The key idea in Train() is that each of the decision tree regressors in the ensemble is trained to predict residuals rather than to predict a target y value. The demo computes residuals as (predicted y - actual y), and so if a residual is positive, that means the predicted y is too large, and so a fraction of the predicted residual must be subtracted from the current running predicted y value. Some gradient boosting regression implementations compute residuals as (actual y - predicted y), in which case a fraction of the residual must be added.
The Demo Program
I used Visual Studio 2022 (Community Free Edition) for the demo program. I created a new C# console application and checked the "Place solution and project in the same directory" option. I specified .NET version 8.0. I named the project GradientBoostRegression. I checked the "Do not use top-level statements" option to avoid the program entry point shortcut syntax.
The demo has no significant .NET dependencies and any relatively recent version of Visual Studio with .NET (Core) or the older .NET Framework will work fine. You can also use the Visual Studio Code program if you like.
After the template code loaded into the editor, I right-clicked on file Program.cs in the Solution Explorer window and renamed the file to the slightly more descriptive GradientBoostRegressionProgram.cs. I allowed Visual Studio to automatically rename class Program.
The overall program structure is presented in Listing 2. All the control logic is in the Main() method in the Program class. The Program class also holds helper functions to load data from file into memory and display data. All of the gradient boosting regression functionality is in a GradientBoostRegressor class. All of the decision tree regression functionality is in a separate DecisionTreeRegressor class. The GradientBoostRegressor class exposes a constructor and three methods: Train(), Predict(), Accuracy().
Listing 2: Overall Program Structure
using System;
using System.IO;
using System.Collections.Generic;
namespace GradientBoostRegression
{
internal class GradientBoostProgram
{
static void Main(string[] args)
{
Console.WriteLine("Begin C# Gradient Boost" +
" regression demo ");
// 1. load data from file into memory
// 2. create and train gradient boosting regression model
// 3. evaluate model
// 4. use model to make a prediction
Console.WriteLine("End demo ");
Console.ReadLine();
}
// helper functions for Main
static double[][] MatLoad(string fn, int[] usecols,
char sep, string comment) { . . }
static double[] MatToVec(double[][] mat) { . . }
public static void VecShow(double[] vec, int dec,
int wid) { . . }
} // class Program
// ========================================================
public class GradientBoostRegressor
{
public double lrnRate;
public int nTrees;
public int maxDepth;
public int minSamples;
public List<DecisionTreeRegressor> trees;
public double pred0; // initial prediction
public GradientBoostRegressor(int nTrees, int maxDepth,
int minSamples, double lrnRate) { . . }
public void Train(double[][] trainX,
double[] trainY) { . . }
public double Predict(double[] x) { . . }
public double Accuracy(double[][] dataX,
double[] dataY, double pctClose) { . . }
private static double Mean(double[] data) { . . }
}
// ========================================================
public class DecisionTreeRegressor
{
public DecisionTreeRegressor(int maxDepth,
int minSamples) { . . }
public void Train(double[][] dataX,
double[] dataY) { . . }
public double Predict(double[] x) { . . }
public double Accuracy(double[][] dataX,
double[] dataY, double pctClose) { . . }
public void Show() { . . }
public void ShowNode(int nodeID) { . . }
// private helper methods here
}
// ========================================================
} // ns
The demo starts by loading the 200-item training data into memory:
string trainFile =
"..\\..\\..\\Data\\synthetic_train_200.txt";
int[] colsX = new int[] { 0, 1, 2, 3, 4 };
double[][] trainX =
MatLoad(trainFile, colsX, ',', "#");
double[] trainY =
MatToVec(MatLoad(trainFile,
new int { 5 }, ',', "#"));
The training X data is stored into an array-of-arrays style matrix of type double. The data is assumed to be in a directory named Data, which is located in the project root directory. The arguments to the MatLoad() function mean load columns 0, 1, 2, 3, 4 where the data is comma-delimited, and lines beginning with "#" are comments to be ignored. The training y data in column [5] is loaded into a matrix and then converted to a one-dimensional vector using the MatToVec() helper function.
The first three training items are displayed like so:
Console.WriteLine("First three train X: ");
for (int i = 0; i < 3; ++i)
VecShow(trainX[i], 4, 8);
Console.WriteLine("First three train y: ");
for (int i = 0; i < 3; ++i)
Console.WriteLine(trainY[i].ToString("F4").PadLeft(8));
In a non-demo scenario, you might want to display all the training data to make sure it was correctly loaded into memory. The 40-item test data is loaded into memory using the same pattern that was used to load the training data:
string testFile =
"..\\..\\..\\Data\\synthetic_test_40.txt";
double[][] testX =
MatLoad(testFile, colsX, ',', "#");
double[] testY =
MatToVec(MatLoad(testFile,
new int[] { 5 }, ',', "#"));
The gradient boosting regression model is prepared for training using these four statements:
int numTrees = 200;
int maxDepth = 2;
int minSamples = 2;
double lrnRate = 0.05;
Notice that unlike some machine learning regression techniques, the demo version of gradient boosting regression doesn't have a seed value for a random number generator because the algorithm is deterministic. This is a nice advantage of the demo version of gradient boosting regression.
The gradient boosting regression model is created and trained like so:
Console.WriteLine("Creating and training" +
" GradientBoostRegression model ");
GradientBoostRegressor gbr =
new GradientBoostRegressor(numTrees, maxDepth,
minSamples, lrnRate);
gbr.Train(trainX, trainY);
Console.WriteLine("Done ")
Next, the demo evaluates model accuracy:
double accTrain = gbr.Accuracy(trainX, trainY, 0.15);
Console.WriteLine("Accuracy train (within 0.15) = " +
accTrain.ToString("F4"));
double accTest = gbr.Accuracy(testX, testY, 0.15);
Console.WriteLine("Accuracy test (within 0.15) = " +
accTest.ToString("F4"));
The Accuracy() method scores a prediction as correct if the predicted y value is within 15% of the true target y value. There are several other ways to evaluate a trained regression model, including root mean squared error, coefficient of determination, and so on. Using the Accuracy() method as a template, other evaluation metrics are easy to implement.
The demo concludes by using the trained decision tree to make a prediction:
double[] x = trainX[0];
Console.WriteLine("Predicting for x = ");
VecShow(x, 4, 9);
double predY = gbr.Predict(x, verbose:true);
Console.WriteLine("Predicted y = " + yPred.ToString("F4"));
The x input is the first training data item. The predicted y value is 0.4746 which is reasonably close to the actual y value of 0.4840. The verbose argument passed to Predict() instructs the method to display its calculations.
Wrapping Up
There are many different variations of gradient boosting regression. The version presented in this article is the simplest possible version and doesn't have a specific name. An open-source version called XGBoost ("extreme gradient boosting") is implemented in C++ but has interfaces to most programming languages and the scikit-learn machine learning library. XGBoost is extremely complicated and has over 50 parameters which makes it difficult to tune, and predictions are not easily interpretable.
Another open-source version of gradient boosting regression is called LightGBM. LightGBM is loosely based on XGBoost. In spite of the name, LightGBM is no less complex than XGBoost.
A third open-source version is called CatBoost. Compared to XGBoost, LightGBM, and the from-scratch version of gradient boosting regression presented in this article, CatBoost has built-in support for categorical predictor variables.