The Data Science Lab
Random Forest Regression and Bagging Regression Using C#
Dr. James McCaffrey from Microsoft Research presents a complete end-to-end demonstration of the random forest regression technique (and a variant called bagging regression), where the goal is to predict a single numeric value. The demo program uses C#, but it can be easily refactored to other C-family languages.
A machine learning random forest regression system predicts a single numeric value. A random forest is an ensemble (collection) of simple decision tree regressors that have been trained on different random subsets of the source training data. To make a prediction for an input vector x, each tree makes a prediction and the final predicted y value is the average of the predicted values computed by the individual trees. A bagging ("bootstrap aggregation") regression system is a specific type of random forest system where all columns/predictors of the source training data are used to construct the training data subsets.
This article presents a complete demo of random forest regression using the C# language. A good way to see where this article is headed is to take a look at the screenshot in Figure 1. The demo program begins by loading synthetic training and test data into memory. The data looks like:
-0.1660, 0.4406, -0.9998, -0.3953, -0.7065, 0.4840
0.0776, -0.1616, 0.3704, -0.5911, 0.7562, 0.1568
-0.9452, 0.3409, -0.1654, 0.1174, -0.7192, 0.8054
0.9365, -0.3732, 0.3846, 0.7528, 0.7892, 0.1345
. . .
The first five values on each line are the x predictors. The last value on each line is the target y variable to predict. The demo creates a random forest regression model, evaluates the model accuracy on the training and test data, and then uses the model to predict the target y value for x = [-0.1660, 0.4406, -0.9998, -0.3953, -0.7065]. The first part of the demo output shows how a random forest regression model is created:
Setting numTrees = 100
Setting maxDepth = 6
Setting minSamples = 2
Setting nRows = 200
Setting nCols = 5
Creating and training RandomForestRegression model
Done
The second part of the demo output shows the model evaluation, and using the model to make a prediction:
Evaluating model
Accuracy train (within 0.15) = 0.9250
Accuracy test (within 0.15) = 0.7250
Predicting for x =
-0.1660 0.4406 -0.9998 -0.3953 -0.7065
Predicted y = 0.4828
The five parameters that control the behavior of the demo random forest regression model are numTrees, maxDepth, minSamples, nRows, nCols. The numTrees parameter specifies how many decision trees should be created for the forest. The values for all five parameters must be determined by trial and error.
The maxDepth parameter specifies the number of levels of each decision tree. If maxDepth = 1, a decision tree has three nodes: a root node, a left child and a right child. If maxDepth = 2, a tree has seven nodes. In general, if maxDepth = n, the resulting tree has 2^(n+1) - 1 nodes. The decision trees used for the demo run shown in Figure 1 have maxDepth = 6 so each of the 100 trees has 2^7 - 1 = 128 - 1 = 127 nodes.
The minSamples parameter specifies the fewest number of associated data items in a tree node necessary to allow the node to be split into a left and right child. The demo trees use minSamples = 2, which is the fewest possible because if a node has only one associated item, it can't be split any further.
The nRows and nCols parameters specify the number of rows and columns to use when creating the random subset training datasets. The source data has 200 rows and nRows is set to 200, so the data subsets have the same number of rows as the original source training data. The nCols parameter is set to 5 which is the same as the number of columns/predictors in the training data.
This article assumes you have intermediate or better programming skill but doesn't assume you know anything about random forest regression. The demo is implemented using C#, but you should be able to refactor the demo code to another C-family language if you wish. All normal error checking has been removed to keep the main ideas as clear as possible.
The source code for the demo program is too long to be presented in its entirety in this article. The complete code and data are available in the accompanying file download, and they're also available online.
The Demo Data
The demo data is synthetic. It was generated by a 5-10-1 neural network with random weights and bias values. The idea here is that the synthetic data does have an underlying, but complex, structure which can be predicted.
All of the predictor values are between -1 and +1. There are 200 training data items and 40 test items. When using decision trees for regression, it's not necessary to normalize the training data predictor values because no distance between data items is computed. However, it's not a bad idea to normalize the predictors just in case you want to send the data to other regression algorithms that require normalization (for example, k-nearest neighbor regression).
Random forest regression is most often used with data that has strictly numeric predictor variables. It is possible to use random forest regression with mixed categorical and numeric data, by using ordinal encoding on the categorical data. In theory, ordinal encoding shouldn't work well. For example, if you have a predictor variable color with possible encoded values red = 0, blue = 1, green = 2, red will always be less-than-or-equal to any other color value in the decision tree construction process. However, in practice, ordinal encoding for random forest regression often works well.
Understanding Random Forest Regression
The motivation for combining many simple decision tree regressors into a forest is the fact that a simple decision tree will always overfit training data if the tree is deep enough. A deep enough decision tree will predict its training data perfectly (except for very unusual data scenarios), but is likely to predict poorly on new, previously unseen data. By using a collection of trees that have been trained on different subsets of the source data, the averaged prediction of the collection is much less likely to overfit.
The demo source training data has 200 rows and nRows is set to 200, so the data subsets have the same number of rows as the original source training data. This isn't required and nRows can be smaller or larger than the number of source rows. As a very general rule of thumb, nRows is often a value between one-half and twice the number of source rows.
When the training data subsets are constructed probabilistically, rows are selected from the source data with replacement. This means that some rows will likely be duplicated and some source rows will likely not be selected. Duplicate rows will have no effect when deterministic decision trees are used. Unused rows will tend to cause the tree to underfit, which is a good thing unless the underfit is too extreme. The net implication is that setting nRows to the number of training data source rows often works well.
The demo program selects row indices from the source data with replacement using this function:
private static int[] GetRowIdxs(int N, int n, Random rnd)
{
// pick n rows from N with replacement
int[] result = new int[n];
for (int i = 0; i < n; ++i)
result[i] = rnd.Next(0, N);
Array.Sort(result);
return result;
}
Each result row is selected independently, with equal probability (1/200 = 0.05 for the demo data) from the source data.
The demo program selects column indices without replacement using a GetColIdxs(int N, int n, Random rnd) function. For the demo data, the original five column indices (0, 1, 2, 3, 4) are stored into an array, and then the indices are shuffled using the Fisher-Yates mini-algorithm, giving something like (3, 0, 2, 4, 1), and then the first n indices are selected.
The resulting row indices and column indices are passed to a MakeSubsetX(double[][] trainX, int[] rows, int[] cols) function to create a subset of training predictor data, and just the row indices are passed to a MakeSubsetY(double[] trainY, int[] rows) function to create a corresponding subset of training target y values.
When training data subsets are constructed, columns are selected without replacement. For the demo program, the nCols parameter is set to 5, which is the same as the number of columns/predictors in the training data. The nCols parameter can be less than the number of predictors but not greater. When nCols is set to the number of source predictors, the resulting model is a called bagging regression ("bootstrap aggregation"), a variant of a random forest regression system.
For source datasets with a small number of columns/predictors, it's usually best to use all columns. But for source datasets with many columns/predictors, setting nCols to a smaller number sometimes improves the resulting random forest model. Because of the way simple decision trees, such as those used in the demo program, are constructed, the order of the predictor columns can have a small effect, but this effect is usually not significant in practice.
After the collection of simple decision trees have been trained, making a prediction is simple:
public double Predict(double[] x)
{
// average of predicted values
double sum = 0.0;
for (int t = 0; t < this.nTrees; ++t)
sum += this.trees[t].Predict(x);
return sum / this.nTrees;
}
For an input vector x, the predicted y value is just the average of the predicted values of the component trees. Simple.
The Demo Program
I used Visual Studio 2022 (Community Free Edition) for the demo program. I created a new C# console application and checked the "Place solution and project in the same directory" option. I specified .NET version 8.0. I named the project RandomForestRegression. I checked the "Do not use top-level statements" option to avoid the horrible program entry point shortcut syntax.
The demo has no significant .NET dependencies and any relatively recent version of Visual Studio with .NET (Core) or the older .NET Framework will work fine. You can also use the Visual Studio Code program if you like.
After the template code loaded into the editor, I right-clicked on file Program.cs in the Solution Explorer window and renamed the file to the slightly more descriptive RandomForestRegressionProgram.cs. I allowed Visual Studio to automatically rename class Program.
The overall program structure is presented in Listing 1. All the control logic is in the Main() method in the Program class. The Program class also holds helper functions to load data from file into memory and display data. All of the random forest regression functionality is in an RandomForestRegressor class. All of the decision tree regression functionality is in a separate DecisionTreeRegressor class. The RandomForestRegressor class exposes a constructor and three methods: Train(), Predict(), Accuracy().
Listing 1: Overall Program Structure
using System;
using System.IO;
using System.Collections.Generic;
namespace RandomForestRegression
{
internal class RandomForestRegressionProgram
{
static void Main(string[] args)
{
Console.WriteLine("Begin C# Random Forest" +
" regression demo ");
// 1. load data from file into memory
// 2. create and train random forest regression model
// 3. evaluate model
// 4. use model to make a prediction
Console.WriteLine("End demo ");
Console.ReadLine();
}
// helper functions for Main
static double[][] MatLoad(string fn, int[] usecols,
char sep, string comment) { . . }
static double[] MatToVec(double[][] mat) { . . }
public static void VecShow(double[] vec, int dec,
int wid) { . . }
} // class Program
// ========================================================
public class RandomForestRegressor
{
public int nTrees;
public int maxDepth;
public int minSamples;
public int nRows;
public int nCols;
public List<DecisionTreeRegressor> trees;
public Random rnd;
public RandomForestRegressor(int nTrees, int maxDepth,
int minSamples, int nRows, int nCols, int seed) { . . }
public void Train(double[][] trainX,
double[] trainY) { . . }
public double Predict(double[] x) { . . }
public double Accuracy(double[][] dataX,
double[] dataY, double pctClose) { . . }
private static int[] GetRowIdxs(int N, int n,
Random rnd) { . . }
private static int[] GetColIdxs(int N, int n,
Random rnd) { . . }
private static double[][] MakeSubsetX(double[][] trainX,
int[] rows, int[] cols) { . . }
private static double[] MakeSubsetY(double[] trainY,
int[] rows) { . . }
}
// ========================================================
public class DecisionTreeRegressor
{
public DecisionTreeRegressor(int maxDepth,
int minSamples) { . . }
public void Train(double[][] dataX,
double[] dataY) { . . }
public double Predict(double[] x) { . . }
public double Accuracy(double[][] dataX,
double[] dataY, double pctClose) { . . }
public void Show() { . . }
public void ShowNode(int nodeID) { . . }
// private helper methods here
}
// ========================================================
} // ns
The demo starts by loading the 200-item training data into memory:
string trainFile =
"..\\..\\..\\Data\\synthetic_train_200.txt";
int[] colsX = new int[] { 0, 1, 2, 3, 4 };
double[][] trainX =
MatLoad(trainFile, colsX, ',', "#");
double[] trainY =
MatToVec(MatLoad(trainFile,
new int { 5 }, ',', "#"));
The training X data is stored into an array-of-arrays style matrix of type double. The data is assumed to be in a directory named Data, which is located in the project root directory. The arguments to the MatLoad() function mean load columns 0, 1, 2, 3, 4 where the data is comma-delimited, and lines beginning with "#" are comments to be ignored. The training y data in column [5] is loaded into a matrix and then converted to a one-dimensional vector using the MatToVec() helper function.
The first three training items are displayed like so:
Console.WriteLine("First three train X: ");
for (int i = 0; i < 3; ++i)
VecShow(trainX[i], 4, 8);
Console.WriteLine("First three train y: ");
for (int i = 0; i < 3; ++i)
Console.WriteLine(trainY[i].ToString("F4").PadLeft(8));
In a non-demo scenario, you might want to display all the training data to make sure it was correctly loaded into memory. The 40-item test data is loaded into memory using the same pattern that was used to load the training data:
string testFile =
"..\\..\\..\\Data\\synthetic_test_40.txt";
double[][] testX =
MatLoad(testFile, colsX, ',', "#");
double[] testY =
MatToVec(MatLoad(testFile,
new int[] { 5 }, ',', "#"));
The random forest regression model is prepared for training using these six statements:
int numTrees = 100;
int maxDepth = 6;
int minSamples = 2;
int nRows = 200;
int nCols = 5;
int seed = 0;
Instead of explicitly setting the number of rows for the training data subsets, you might want to specify a percentage of the number of source rows. The random forest model is created and trained like so:
Console.WriteLine("Creating and training " +
"RandomForestRegression model ");
RandomForestRegressor rfr =
new RandomForestRegressor(numTrees, maxDepth,
minSamples, nRows, nCols, seed);
rfr.Train(trainX, trainY);
Console.WriteLine("Done ");
Next, the demo evaluates model accuracy:
double accTrain = rfr.Accuracy(trainX, trainY, 0.15);
Console.WriteLine("Accuracy train (within 0.15) = " +
accTrain.ToString("F4"));
double accTest = rfr.Accuracy(testX, testY, 0.15);
Console.WriteLine("Accuracy test (within 0.15) = " +
accTest.ToString("F4"));
The Accuracy() method scores a prediction as correct if the predicted y value is within 15% of the true target y value. There are several other ways to evaluate a trained regression model, including root mean squared error, coefficient of determination, and so on. Using the Accuracy() method as a template, other evaluation metrics are easy to implement.
The demo concludes by using the trained decision tree to make a prediction:
double[] x = trainX[0];
Console.WriteLine("Predicting for x = ");
VecShow(x, 4, 8);
double yPred = rfr.Predict(x);
Console.WriteLine("Predicted y = " + yPred.ToString("F4"));
The x input is the first training data item. The predicted y value is 0.4828 which is quite close to the actual y value of 0.4840.
Wrapping Up
Random forest regression, and its variant bagging tree regression, suffer from a bit of disrespect in the research community. Random forest models are so simple, they can't generate many ideas for exploration in research papers. And believe me, in research, the publish-or-perish imperative is still alive and well. On the other hand, random forest and bagging tree regression models seem to have a good reputation among machine learning practitioners (most of my colleagues at least) because the models often work well and are relatively interpretable, especially compared to regression techniques such as neural networks and Gaussian process regression systems.
Random forest and bagging regression systems are closely related to techniques called adaptive boosting regression and gradient boosting regression. Examples include AdaBoost (adaptive boosting) regression, gradient boosting machine (GBM) regression, LightGBM regression, and XGBoost (extreme gradient boosting) regression.