The Data Science Lab
Decision Tree Regression from Scratch Without Pointers or Recursion Using C#
Dr. James McCaffrey presents a complete end-to-end demonstration of decision tree regression from scratch using the C# language. The goal of decision tree regression is to predict a single numeric value. The demo implementation doesn't use pointers (references) for simplicity and does not use recursion for better maintainability and customization.
Decision tree regression is a machine learning technique that incorporates a set of if-then rules in a tree data structure to predict a single numeric value. For example, a decision tree regression model prediction might be, "If employee age is greater than 39.0 and age is less than or equal to 42.5 and years-experience is less than or equal to 8.0 and height is greater than 64.0 then bank account balance is $754.14."
There are many ways to implement a decision tree regression system. Four significant design decisions are 1.) use pointers/references for tree nodes, or use list storage, 2.) use recursion to construct the tree, or use a non-recursive stack, or use list iteration, 3.) use weighted variance minimization for the node split function, or use variance reduction maximization, 4.) explicitly store row indices associated with each node in the nodes, or discard the associated rows information after training. There are pros and cons to each option.
This article presents decision tree regression, implemented from scratch with C#, using list storage for tree nodes (no pointers/references), a FIFO stack to build the tree (no recursion), weighted variance minimization for the node split function, and saves associated row indices in each node. This design has maximum flexibility.
A single decision tree can be used by itself for regression problems. But a more common approach is to use a collection of decision trees, called an ensemble. Examples include bagging tree regression, random forest regression, adaptive boosting regression, and gradient boosting regression. The decision tree design presented in this article can be easily adapted for any ensemble technique. But regardless of which tree technique is used, everything starts with a decision tree.
A good way to see where this article is headed is to take a look at the screenshot in Figure 1 and the diagram in Figure 2. The demo program begins by loading synthetic training and test data into memory. The data looks like:
-0.1660, 0.4406, -0.9998, -0.3953, -0.7065, 0.4840
0.0776, -0.1616, 0.3704, -0.5911, 0.7562, 0.1568
-0.9452, 0.3409, -0.1654, 0.1174, -0.7192, 0.8054
0.9365, -0.3732, 0.3846, 0.7528, 0.7892, 0.1345
. . .
The first five values on each line are the x predictors. The last value on each line is the target y variable to predict. The demo creates a decision tree model, evaluates the model accuracy on the training and test data, and then uses the model to predict the target y value for x = [-0.1660, 0.4406, -0.9998, -0.3953, -0.7065].
[Click on image for larger view.] Figure 1: Decision Tree Regression in Action on Synthetic Data
The first part of the demo output shows how a decision tree is created:
Loading synthetic train (200) and test (40) data
Done
Setting maxDepth = 3
Setting minSamples = 2
Setting minLeaf = 18
Using default numSplitCols = -1
Creating and training tree
Done
A decision tree regression system has four parameters that control the size and shape of the tree: maxDepth, minSamples, minLeaf, numSplitCols. Briefly, maxDepth sets a maximum number of levels, and minSamples and minLeaf set node splitting granularity. The numSplitCols parameter is a regularization technique that is used when a decision tree is part of a random forest regression model.
For the demo, maxDepth is set to an artificially small value (3) and minLeaf is set to an artificially large value (18) to create a small tree that can be easily visualized. The demo uses a special -1 value for numSplitCols that means "use all columns when computing splits".
The next part of the demo output displays the trained decision tree. Each line is a tree node:
ID 0 | idx 0 | v -0.2102 | L 1 | R 2 | y 0.3493 | leaf F
ID 1 | idx 4 | v 0.1431 | L 3 | R 4 | y 0.5345 | leaf F
ID 2 | idx 0 | v 0.3915 | L 5 | R 6 | y 0.2382 | leaf F
ID 3 | idx 0 | v -0.6553 | L 7 | R 8 | y 0.6358 | leaf F
ID 4 | idx -1 | v 0.0000 | L 9 | R 10 | y 0.4123 | leaf T
ID 5 | idx 4 | v -0.2987 | L 11 | R 12 | y 0.3032 | leaf F
ID 6 | idx 2 | v 0.3777 | L 13 | R 14 | y 0.1701 | leaf F
ID 7 | idx -1 | v 0.0000 | L 15 | R 16 | y 0.6952 | leaf T
ID 8 | idx -1 | v 0.0000 | L 17 | R 18 | y 0.5598 | leaf T
ID 11 | idx -1 | v 0.0000 | L 23 | R 24 | y 0.4101 | leaf T
ID 12 | idx -1 | v 0.0000 | L 25 | R 26 | y 0.2613 | leaf T
ID 13 | idx -1 | v 0.0000 | L 27 | R 28 | y 0.1882 | leaf T
ID 14 | idx -1 | v 0.0000 | L 29 | R 30 | y 0.1381 | leaf T
The fields are node ID, index of split column (or -1 if the node cannot be split), split threshold (or 0.0000 if the node cannot be split), (L)eft child index, (R)ight child index, predicted y value, and a Boolean indicating if the node is a leaf node (T)rue or not (F)alse for split node. The tree is shown as a diagram in Figure 2.
[Click on image for larger view.] Figure 2: Visual Representation of the Demo Decision Tree Regression Model
Notice that tree nodes [9] and [10] are empty nodes, so they are not displayed in Figure 1 or Figure 2.
The next part of the demo output shows the tree model evaluation metrics:
Evaluating model
Accuracy train (within 0.10) = 0.3750
Accuracy test (within 0.10) = 0.4750
MSE train = 0.0048
MSE test = 0.0054
The prediction accuracy of the tree model is poor -- just 37.50% (75 correct out of 200) on the training data and 47.50% (19 correct out of 40) on the test data. A prediction is scored as correct if it's within 10% of the true target y value. The accuracy and MSE (mean squared error) are poor because maxDepth is too small and minLeaf is too large (to keep the tree small enough for easy visualization). By setting maxDepth = 7, minSamples = 2, minLeaf = 1, numSplitCols = -1, the decision tree scores 91.00% accuracy on the training data.
The last part of the demo output shows an example prediction:
Predicting for trainX[0] =
-0.1660 0.4406 -0.9998 -0.3953 -0.7065
Predicted y = 0.4101
IF
column 0 > -0.2102 AND
column 0 <= 0.3915 AND
column 4 <= -0.2987 AND
THEN
predicted = 0.4101
One advantage of decision tree regression compared to other techniques, such as kernel ridge regression and neural network regression, is that decision trees are significantly more interpretable. If you examine the diagram in Figure 2, and follow the red dashed arrows, you can see exactly how the predicted y value is computed.
This article assumes you have intermediate or better programming skill but doesn't assume you know anything about decision tree regression. The demo is implemented using C# but you should be able to refactor the demo code to another C-family language if you wish. All normal error checking has been removed to keep the main ideas as clear as possible.
The source code for the demo program is too long to be presented in its entirety in this article. The complete code and data are available in the accompanying file download, and they're also available online at this post.
The Demo Data
The demo data is synthetic. It was generated by a 5-10-1 neural network with random weights and bias values. The idea here is that the synthetic data does have an underlying, but complex, structure which can be predicted.
All of the predictor values are between -1 and +1. There are 200 training data items and 40 test items. When using decision tree regression, it's not necessary to normalize the training data predictor values because no distance between data items is computed. However, normalizing the predictors doesn't hurt and it's useful just in case you want to send the data to other regression algorithms that require normalization, for example, nearest neighbors regression and kernel ridge regression.
In practice, decision tree regression is most often used with data that has strictly numeric predictor variables. But it is possible to use decision tree regression with categorical data. For ordinal categorical data that has inherent ordering, you can use standard zero-based or one-based ordinal encoding. For example, for a predictor variable employee-height with possible values (short, medium, tall), you could set short = 0, medium = 1, tall = 2.
For binary data, you can use zero-one encoding. For example, for a predictor variable employee-sex, you could set male = 0, female = 1, or vice versa.
For nominal categorical data without inherent order, in theory, decision tree regression isn't applicable. But in practice, standard zero-based or one-based ordinal encoding often works well. For example, suppose you have a predictor variable residence-state with possible values (Alaska, Florida, Indiana, Montana, Oregon). You could set Alaska = 0, Florida = 1, Indiana = 2, Montana = 3, Oregon = 4. One internal node of the decision tree might be, "If column [7] <= 2 Then go left." This would group Alaska, Florida, Indiana together, and Montana, Oregon together.
The downside of ordinal encoding nominal categorical data is that the resulting tree has a significant dependency on the order in which you encode the data. The upside is that if the tree is deep enough, the technique often works well.
Understanding Tree Storage Using a List
The demo decision tree has maxDepth = 3. This creates a tree with 15 nodes (some of which could be empty). In general, if a tree has maxDepth = n, then it has 2^(n+1) - 1 nodes. These tree nodes could be stored as separate program-defined Node objects that are connected via pointers/references. But the architecture used by the demo decision tree is to create a List collection with size/count = 15, where each item in the List is a Node object.
With this design, if the root node is at index [0], then the left child is at [1] and the right child is at [2]. In general, if a node is at index [i], then the left child is at index [2*i+1] and the right child is at index [2*i+2]. This design also allows you to easily save a decision tree to file storage.
If a node is empty, there is very little memory space wasted because node objects are C# references that can be set to null.
Understanding Decision Tree Regression -- Splitting
The tree splitting part of constructing a decision tree regression system is perhaps best understood by looking at a concrete example. The root node of the demo decision tree is associated with all 200 rows of the training data. The splitting process selects some of the 200 rows to be assigned to the left child node, and the remaining rows to be assigned to the right child node.
For simplicity, suppose that instead of the demo training data with 200 rows and 5 columns of predictors, a tree node is associated with just 5 rows of data, and the data has just 3 columns of predictors:
X y
[0] 0.99 0.22 0.77 3.0
[1] 0.55 0.44 0.00 9.0
[2] 0.88 0.66 0.88 7.0
[3] 0.11 0.33 0.22 1.0
[4] 0.55 0.88 0.33 5.0
[0] [1] [2]
The split algorithm will select some of the five rows to go to the left child and the remaining rows to go to the right child. The idea is to select the two sets of rows in a way so that their associated y values are close together. The demo implementation does this by finding the split that minimizes the weighted variance of the target y values.
The splitting algorithm examines every possible x value in every row and every column. For each possible candidate split value (also called the threshold), the weighted variance of the y values of the resulting split is computed. For example, for x = 0.33 in column [1], the rows where the values in column [1] are less than or equal to 0.33 are rows [3] and [0] with associated y values (1.0, 3.0). The other rows are [1], [2], [4] with associated y values (9.0, 7.0, 5.0).
The variance of a set of y values is the average of the sum of the squared differences between the values and their mean (average).
For this proposed split, the variance of y values (1.0, 3.0) is computed as 1.00. The mean of (1.0, 3.0) = (1.0 + 3.0) / 2 = 2.0 and so the variance is ((3.0 - 2.0)^2 + (1.0 - 2.0)^2) / 2 = 1.00.
The mean of y values (9.0, 7.0, 5.0) = (9.0 + 7.0 + 5.0) / 3 = 7.0 and the variance is ((9.0 - 7.0)^2 + (7.0 - 7.0)^2 + (5.0 - 7.0)^2) / 3 = 2.67.
Because the proposed split has different numbers of rows (two rows for the left child and three rows for the right child), the weighted variance of the proposed split is used: ((2 * 1.00) + (3 * 2.67)) / 5 = 2.00.
This process is repeated for every x value in the training data. The x threshold value and its column that generate the smallest weighted variance define the split. In practice it's common to keep track of which values in a given column have already been checked, to avoid unnecessary work computing weighted variance.
After a best split is found, the predicted y value for a leaf node (no further splits) is the average of the y values associated with the node. The demo program computes predicted y values for internal, non-leaf nodes, but those predicted values are not used.
A source of minor vocabulary confusion is that the term "mean squared error" can have two different meanings in machine learning. In most scenarios, mean squared error is used to evaluate how well a model predicted values match the known correct target values in the training data. This MSE is the average of the squared differences between predicted y values and correct target y values.
But in decision tree regression splitting contexts, sometimes mean squared error is used as a synonym for variance. The idea is that in a decision tree, the predicted values are the y values of a split and the virtual correct target y values are all identical, the mean of the y values. In my opinion, this use of "mean squared error" in a decision tree context is just wrong, but sadly I have not yet been elected King of Machine Learning.
Implementing the Split Function
When implementing a decision tree split function, the minSamples parameter specifies the minimum number of rows associated with a node necessary before a split is to be considered. In practice this is usually set to 2 which leads to many splits and a more granular tree.
The minLeaf parameter specifies the minimum number of rows required after a split is performed. In practice this is usually set to 1 which allows leaf nodes with a single associated y value, but prohibits any leaf nodes with no associated y values, which is logically inconsistent.
A common decision tree procedure is to randomly shuffle the order in which columns are examined when looking for the best split. The idea here is that without shuffling, the first predictor columns have priority over later columns in cases where there are multiple split values that produce the same minimum mean squared error. Shuffling the order of columns prevents a bias towards early columns, but introduces a small amount of randomness into a decision tree regression system.
The numSplitCols parameter specifies how many of the columns of the X data should be examined when looking for a split value and split column. In almost all scenarios, you want to examine all columns. An exception is when a collection of decision trees is used for a random forest regression system where there are many predictor variables. The DecisionTreeRegressor has a default value of -1 for numSplitCols, which is a special value that means to "use all columns".
Understanding Decision Tree Regression -- Building
There are three main approaches for building a regression decision tree: using recursion, or using a stack data structure, or using list iteration. The most common technique presented in online examples and university courses is to use recursion. I suspect this is mostly because building a decision tree using recursion is elegant and fancy. However, many of the production teams that I have worked with have a formal ban on the use of any kind of recursion in production code. Recursive code is tricky to implement, tricky to debug, tricky to profile, and tricky to modify.
The demo does not use recursion to build the tree. Instead, the demo uses a stack data structure. In high-level pseudo-code:
push (root, depth)
while stack not empty
pop (currNode, currDepth)
if at maxDepth or not enough data then
node is a leaf, predicted = avg of associated Y values
compute split value and split column rows in currNode
if unable to split then
node is a leaf, predicted = avg of associated Y values
compute left-rows, right-rows
create and push (leftNode, currDepth+1)
create and push (rightNode, currDepth+1)
end-while
The build-tree algorithm is short but very tricky. The stack holds a tree node, which holds the associated rows of the node, and the current depth of the tree. The current depth of the tree is maintained so that the algorithm knows when maxDepth has been reached. Storing the rows that are associated with a node is a design that allows you to diagnose how the tree was constructed. If for some reason you want to minimize memory usage for huge decision trees, you can add code to the end of the Train() method to remove the stored rows in all nodes, because they're not needed after training.
There are many alternative ways to build a decision tree. Instead of using a stack, with list storage a tree can be constructed sequentially. And, instead of explicitly storing indices of the left and right children in a tree node, indices can be computed on the fly. And, instead of programmatically tracking the current tree depth as the tree is constructed, the current depth can be inferred from the current node ID. Each possible alternative has pros and cons. The technique used by the demo has moderate complexity and high flexibility.
The Demo Program
I used Visual Studio 2026 (Community Free Edition) for the demo program. I created a new C# console application and checked the "Place solution and project in the same directory" option. I specified .NET version 10.0 but I could have used the 8.0 or 9.0 versions that were already installed on my machine from Visual Studio 2022. I named the project DecisionTreeRegression. I checked the "Do not use top-level statements" option to avoid the program entry point shortcut syntax.
The demo has no significant .NET dependencies and any relatively recent version of Visual Studio with .NET (Core) or the older .NET Framework will work fine. You can also use the Visual Studio Code program if you wish.
After the template code loaded into the editor, I right-clicked on file Program.cs in the Solution Explorer window and renamed the file to the slightly more descriptive DecisionTreeRegressionProgram.cs. I allowed Visual Studio to automatically rename class Program.
The overall program structure is presented in Listing 1. All the control logic is in the Main() method in the Program class. The Program class also holds helper functions to load data from file into memory and display data.
All of the decision tree regression functionality is in a DecisionTreeRegressor class, which contains a nested Node container class, and a nested StackInfo container class. The DecisionTreeRegressor class exposes a constructor and six methods: Train(), Predict(), Explain(), Accuracy(), MSE(), and Display().
Listing 1: Overall Program Structure
using System;
using System.Collections.Generic;
using System.IO;
namespace DecisionTreeRegression
{
internal class DecisionTreeProgram
{
static void Main(string[] args)
{
Console.WriteLine("Begin decision tree regression ");
// 1. load data
// 2. create and train/build tree
// 3. evaluate model
// 4. use model
Console.WriteLine("End demo ");
Console.ReadLine();
} // Main()
// helpers for Main()
static double[][] MatLoad(string fn, int[] usecols,
char sep, string comment) { . . }
static double[] MatToVec(double[][] mat) { . . }
static void VecShow(double[] vec, int dec, int wid) { . . }
}
public class DecisionTreeRegressor
{
public int maxDepth;
public int minSamples; // aka min_samples_split
public int minLeaf; // min number of values in a leaf
public int numSplitCols; // mostly for random forest
public List<Node> tree = new List<Node>();
public Random rnd; // order in which cols are searched
public double[][] trainX; // store data by ref
public double[] trainY;
// ------------------------------------------------------
public class Node // nested class
{
public int id;
public int colIdx; // aka featureIdx
public double thresh;
public int left; // index into List
public int right;
public double value;
public bool isLeaf;
public List<int> rows; // assoc rows in train data
public Node(int id, int colIdx, double thresh,
int left, int right, double value, bool isLeaf,
List<int> rows) { . . }
}
// --------------------------------------------
public class StackInfo // used to build tree
{
public Node node; // node holds associated rows
public int depth;
public StackInfo(Node n, int d)
{
this.node = n;
this.depth = d;
}
}
// --------------------------------------------
public DecisionTreeRegressor(int maxDepth = 2,
int minSamples = 2, int minLeaf = 1,
int numSplitCols = -1, int seed = 0) { . . } // ctor
public void Train(double[][] trainX, double[] trainY) { . . }
public double Predict(double[] x) { . . }
public void Explain(double[] x) { . . }
public double Accuracy(double[][] dataX, double[] dataY,
double pctClose) { . . }
public double MSE(double[][] dataX, double[] dataY) { . . }
public void Display(bool showRows) { . . }
private void MakeTree() { . . }
private double[] BestSplit(List<int> rows) { . . }
private double TreeTargetMean(List<int> rows) { . . }
private double TreeTargetVariance(List<int> rows) { . . }
} // class DecisionTreeRegressor
} // ns
The Explain() method is essentially a verbose version of the Predict() method. The MakeTree() helper does most of the work for the Train() method. MakeTree() calls BestSplit(), and BestSplit() calls the TreeTargetMean() and TreeTargetVariance() helpers.
Loading the Training and Test Data
The demo starts by loading the 200-item training data into memory:
string trainFile = "..\\..\\..\\Data\\synthetic_train_200.txt";
int[] colsX = new int[] { 0, 1, 2, 3, 4 };
double[][] trainX = MatLoad(trainFile, colsX, ',', "#");
double[] trainY = MatToVec(MatLoad(trainFile,
new int[] { 5 }, ',', "#"));
The training X data is stored into an array-of-arrays style matrix of type double. The data is assumed to be in a directory named Data, which is located in the project root directory. The arguments to the MatLoad() function mean load columns 0, 1, 2, 3, 4 where the data is comma-delimited, and lines beginning with "#" are comments to be ignored. The training y data in column [5] is loaded into a matrix and then converted to a one-dimensional vector using the MatToVec() helper function.
The 40-item test data is loaded into memory using the same pattern that was used to load the training data:
string testFile = "..\\..\\..\\Data\\synthetic_test_40.txt";
double[][] testX = MatLoad(testFile, colsX, ',', "#");
double[] testY = MatToVec(MatLoad(testFile,
new int[] { 5 }, ',', "#"));
Next, the first three training items are displayed using these statements:
Console.WriteLine("First three train X: ");
for (int i = 0; i < 3; ++i)
VecShow(trainX[i], 4, 8);
Console.WriteLine("First three train y: ");
for (int i = 0; i < 3; ++i)
Console.WriteLine(trainY[i].ToString("F4").PadLeft(8));
In a non-demo scenario, you might want to display all the training data to make sure it was correctly loaded into memory.
Training the Decision Tree
The decision tree regression model is trained/fit/constructed like so:
int maxDepth = 3;
int minSamples = 2;
int minLeaf = 18;
int numSplitCols = -1; // all columns
DecisionTreeRegressor dtr =
new DecisionTreeRegressor(maxDepth, minSamples,
minLeaf, numSplitCols, seed:0);
dtr.Train(trainX, trainY);
If maxDepth is set to 0, the tree has just a single (root) node and the predicted y value for any input is the average of the training y data. If maxDepth is set to 1, the tree has at most three nodes: root node, left child, right child. If maxDepth is set to 2, the tree has at most seven nodes. In general, if maxDepth is set to n, the tree has at most 2^(n+1) - 1 nodes.
Decision trees are often sensitive to pathological data, and it's usually a good idea to display a trained tree to look for various weirdnesses:
Console.WriteLine("Tree: ");
dtr.Display(showRows: false);
Each tree node holds the indices of the rows of training data. When the showRows parameter is set to false, the associated rows are not displayed.
Evaluating and Using the Decision Tree
Next, the demo evaluates model accuracy:
double accTrain = dtr.Accuracy(trainX, trainY, 0.10);
Console.WriteLine("Accuracy train (within 0.10) = " +
accTrain.ToString("F4"));
double accTest = dtr.Accuracy(testX, testY, 0.10);
Console.WriteLine("Accuracy test (within 0.10) = " +
accTest.ToString("F4"));
The Accuracy() method scores a prediction as correct if the predicted y value is within 10% of the true target y value. The 10% is arbitrary and a reasonable closeness percentage will vary from problem to problem.
The demo computes and displays model mean squared error using these statements:
double mseTrain = dtr.MSE(trainX, trainY);
Console.WriteLine("MSE train = " + mseTrain.ToString("F4"));
double mseTest = dtr.MSE(testX, testY);
Console.WriteLine("MSE test = " + mseTest.ToString("F4"));
This code invokes standard MSE which is a measure of how well the model predicts. Another common regression model evaluation metric is the coefficient of determination, R2. See this R2 example for details.
The demo concludes by using the trained decision tree to make a prediction in two ways:
double[] x = trainX[0];
Console.WriteLine("Predicting for x = ");
VecShow(x, 4, 9);
double predY = dtr.Predict(x);
Console.WriteLine("y = " + predY.ToString("F4"));
dtr.Explain(x);
The Predict() method spits out the predicted y value without any explanation. The Explain() method is essentially a verbose predict function.
Wrapping Up
Implementing decision tree regression from scratch requires a significant amount of effort, but it allows you to easily integrate a prediction model with other .NET systems. A from-scratch implementation also allows you to easily modify the system. For example, you can comment-out the code statements that shuffle the order in which columns are processed in the split function, so that your decision tree is completely deterministic. Or you can add only error-checking that's relevant for your scenario.
The biggest disadvantage of a simple decision tree regression system is that a single tree usually overfits the training data. In overfitting, the trained model predicts well on the training data, but predicts poorly on new, previously unseen data. Because of the overfitting problem, decision trees are rarely used by themselves. Instead, multiple trees are combined into an ensemble model such as bagging tree regression, random forest regression, adaptive boosting regression, or gradient boosting regression. The decision tree design presented in this article is well-suited for use in ensemble models.