The Data Science Lab

Decision Tree Regression from Scratch Using C#

Dr. James McCaffrey from Microsoft Research presents a complete end-to-end demonstration of decision tree regression using the C# language. Unlike most implementations, this one does not use recursion or pointers, which makes the code easy to understand and modify.

Decision tree regression is a machine learning technique . To predict the output y for an input vector X, the tree structure encodes a set of if-then rules such as, "If the value of X at index [2] is less than or equal to 0.44 and the value at index [0] is greater than 0.63 then the predicted y value is 0.58."

Decision tree regression is a fundamental technique that can be used by itself, and is also the basis for powerful ensemble techniques (a collection of many decision trees), notably, AdaBoost (adaptive boosting regression), random forest regression (which includes bagging regression), and gradient boosting regression.

This article presents a demo of decision tree regression, implemented from scratch, using the C# language. The concepts underlying decision tree regression are relatively simple, but implementation is tricky. This article uses a tree design that does not use recursion or pointers/references. Recursion is beautiful intellectually, but code that uses recursion is difficult to debug or modify -- so much so that the majority of the product teams I've worked in, do not allow recursive code.

A good way to see where this article is headed is to take a look at the screenshot in Figure 1 and the diagram in Figure 2. The demo program begins by loading synthetic training and test data into memory. The data looks like:

-0.1660,  0.4406, -0.9998, -0.3953, -0.7065,  0.4840
 0.0776, -0.1616,  0.3704, -0.5911,  0.7562,  0.1568
-0.9452,  0.3409, -0.1654,  0.1174, -0.7192,  0.8054
 0.9365, -0.3732,  0.3846,  0.7528,  0.7892,  0.1345
. . .
Figure 1: Decision Tree Regression in Action on Synthetic Data
[Click on image for larger view.] Figure 1: Decision Tree Regression in Action on Synthetic Data

The first five values on each line are the x predictors. The last value on each line is the target y variable to predict. The demo creates a decision tree model, evaluates the model accuracy on the training and test data, and then uses the model to predict the target y value for x = [-0.1660, 0.4406, -0.9998, -0.3953, -0.7065]. The key parts of the demo output are:

Loading synthetic train (200) and test (40) data
Set maxDepth = 6
Set minSamples = 2
Creating regression tree
Done
Accuracy train (within 0.15) = 0.8850
Accuracy test (within 0.15) = 0.5500

Predicting for x =
  -0.1660   0.4406  -0.9998  -0.3953  -0.7065
y = 0.4909

The two parameters that control the behavior of the demo decision tree are maxDepth and minSamples. The maxDepth parameter specifies the number of levels of the decision tree. If maxDepth = 1, the decision tree has three nodes: a root node, a left child and a right child. If maxDepth = 2, the tree has seven nodes. In general, if maxDepth = n, the resulting tree has 2^(n+1) - 1 nodes. The demo tree shown in Figure 1 has maxDepth = 6, so the tree has 2^7 - 1 = 128 - 1 = 127 nodes.

The minSamples parameter specifies the fewest number of associated data items in a node necessary to allow the node to be split into a left and right child. The demo tree uses minSamples = 2, which is the fewest possible because if a node has only one associated item, it can't be split any further.

This article assumes you have intermediate or better programming skill but doesn't assume you know anything about decision tree regression. The demo is implemented using C#, but you should be able to refactor the demo code to another C-family language if you wish. All normal error checking has been removed to keep the main ideas as clear as possible.

The source code for the demo program is too long to be presented in its entirety in this article. The complete code and data are available in the accompanying file download, and they're also available online.

The Demo Data
The demo data is synthetic. It was generated by a 5-10-1 neural network with random weights and bias values. The idea here is that the synthetic data does have an underlying, but complex, structure which can be predicted.

All of the predictor values are between -1 and +1. There are 200 training/reference data items and 40 test items. When using decision tree regression, it's not necessary to normalize the training data predictor values because no distance between data items is computed. However, it's not a bad idea to normalize the predictors just in case you want to send the data to other regression algorithms that require normalization (for example, k-nearest neighbor regression).

In practice, decision tree regression is most often used with data that has strictly numeric predictor variables. It is possible to use decision regression with mixed categorical and numeric data, by using ordinal encoding on the categorical data. However, this is a subtle topic that I'll explain at the end of this article.

Understanding Decision Tree Regression - Prediction
The two most important aspects of decision tree regression are understanding how to use the tree to make a prediction, and how to create the tree from the training data. Figure 2 shows a small decision tree with maxDepth = 2 and minSamples = 2, which was created using a tiny set of 10 training items. Each training item is a vector with three elements, at indices [0], [1], and [2].

Figure 2: Decision Tree Architecture (No Recursion, No Pointers/References)
[Click on image for larger view.] Figure 2: Decision Tree Architecture (No Recursion, No Pointers/References)

To predict the y value for an input of X = (0.17, 0.96, 0.44), the algorithm starts at the root node. If the X element at the split column ([1]) is less than or equal to the split value (0.64), a virtual indicator moves to the left sub-tree. Otherwise, the virtual indicator moves to the right sub-tree. For X = (0.17, 0.96, 0.44), because 0.96 is greater than 0.64, the indicator moves to the right. In the same way, at the next node, the indicator moves to the left sub-tree because the X element 0.17 is less than or equal to the split value 0.38.

At this point, the virtual indicator is at a terminal/leaf node. The predicted value is y = 0.24, which is the average of the y values associated with data rows 1, 5, 8. In short, after a decision tree has been created, a prediction is made by walking down the tree, advancing left or right according to node split column and split value, until a leaf node is reached, and the predicted y value is the prediction value in the leaf node. OK, but how is the tree constructed from the training data?

Understanding Decision Tree Regression - Construction
Tree construction for a non-recursive, no-pointer design starts by creating an array or list set of empty nodes. If maxDepth = n, then the number of nodes in the tree is 2^(n+1) - 1. For the tree in Figure 2, maxDepth is set to 2 so the number of nodes is 2^3 - 1 = 7. The nodes are 0-based indexed from 0 to 6.

During tree construction, also known as training, or fitting, each node is associated with certain rows of the training data. The root node is associated with all 10 rows of the tiny dataset. If a prediction was made based just on the root node, the predicted y value would be the average of all 10 target y values in the training data.

At each node, all of the x values in the associated rows in the training data are examined to find the one x value (the split value) and its column (the split column) that would separate the training data into two subsets that have the smallest amount of variability of y values. For example, if a node had only four associated rows, [1], [5], [8], [9], and the corresponding y values were 0.25, 0.76, 0.73, 0.28, the split that would create smallest variability would be ([1], [9]) and ([5], [8]). There are several ways to measure variability. The demo implementation uses weighted statistical variance, which is the most common technique.

The splitting process continues until a node is too deep to have children (nodes 3 through 6) in Figure 2, or if a node has fewer than the required minSamples associated rows / y values. Some nodes may have no associated rows in which case they are tagged with isEmpty = true. Nodes that are not empty but have no children are tagged with isLeaf = true. The predicted y value for a leaf node is the average of the y values of the training data rows associated with the node.

The demo program Train() method calls helper method GetSplitInfo(). Notice that at each level of a tree, all x-values in the training data are examined. For large datasets, this could be a large number of examinations of candidate split values. In the early days of machine learning, with limited computer memory and slow CPU processing speeds, various sampling techniques were used to reduce the number of candidate split value examinations. But those techniques introduce non-reproducibility and so with modern computing systems, sampling techniques to find the split column and split value are rarely used.

If you search the internet for examples of decision tree regression implementations, the vast majority of examples you'll find use recursion to construct and make predictions, and use pointers/references to store tree nodes. When I was a university professor, I taught my students using this design in sophomore-level data structures classes (with the Pascal language in those days). But recursion can be a nightmare in a production environment, so I prefer storing tree nodes in a fixed list or array. With a list/array storage design, for a node i, its left child is located at index [(2*i) + 1] and its right child is located at index [(2*i) + 2]. The minor downside to using list/array storage is that empty nodes are explicitly stored, but in my opinion that's a small price to pay for greatly increased ease of debugging, interpretability, and customization.

The Demo Program
I used Visual Studio 2022 (Community Free Edition) for the demo program. I created a new C# console application and checked the "Place solution and project in the same directory" option. I specified .NET version 8.0. I named the project the rather wordy DecisionTreeRegressionNoRecursion. I checked the "Do not use top-level statements" option to avoid the program entry point shortcut syntax.

The demo has no significant .NET dependencies and any relatively recent version of Visual Studio with .NET (Core) or the older .NET Framework will work fine. You can also use the Visual Studio Code program if you like.

After the template code loaded into the editor, I right-clicked on file Program.cs in the Solution Explorer window and renamed the file to the slightly more descriptive TreeRegressionProgram.cs. I allowed Visual Studio to automatically rename class Program.

The overall program structure is presented in Listing 1. All the control logic is in the Main() method in the Program class. The Program class also holds helper functions to load data from file into memory and display data. All of the decision tree regression functionality is in a DecisionTreeRegressor class, which contains a nested TreeNode container class. The DecisionTreeRegressor class exposes a constructor and five methods: Train(), Predict(), Accuracy(), Show(), and ShowNode().

Listing 1: Overall Program Structure

using System;
using System.IO;
using System.Collections.Generic;

namespace DecisionTreeRegressionNoRecursion
{
  internal class TreeRegressionProgram
  {
    static void Main(string[] args)
    {
      Console.WriteLine("Begin regression tree" +
        " (no recursion) demo ");

      // 1. load data
      // 2. create and train decision tree
      // 3. show (part of) decision tree
      // 4. evaluate decision tree
      // 5. use tree to make prediction

      Console.WriteLine("End demo ");
      Console.ReadLine();
    } // Main

    // helpers for Main()

    static double[][] MatLoad(string fn, int[] usecols,
      char sep, string comment) { . . }
 
    static double[] MatToVec(double[][] mat) { . . }

    static void VecShow(double[] vec, int dec,
      int wid) { . . }

  // --------------------------------------------------------

  public class DecisionTreeRegressor
  {
    public int maxDepth;  // depth = 0 is just a root node
    public int minSamples; // must be at least 2
    public List<TreeNode> tree;  // not recursive

    public DecisionTreeRegressor(int maxDepth,
      int minSamples) { . . }

    // ------------------------------------------------------

    public class TreeNode  // nested container class
    {
      public int nodeID;
      public int splitCol;
      public double splitVal;
      public bool isEmpty;
      public bool isLeaf;
      public double predictedY;

      public TreeNode(int id)
      {
        this.nodeID = id;
        this.splitCol = -1;      // invalid
        this.splitVal = -99.99;  // invalid
        this.isEmpty = true;
        this.isLeaf = false;
        this.predictedY = 0.0;
      }
    }

    // ------------------------------------------------------

    public void Train(double[][] dataX,
      double[] dataY) { . . }
 
    private static void GetSplitInfo(double[][] dataX,
      double[] dataY, List<int> nodeRows,
      out int splitCol, out double splitVal) { . . }

    public double Predict(double[] x) { . . }

    private void Validate() { . . }
 
    public double Accuracy(double[][] dataX, double[] dataY,
      double pctClose) { . . }

    public void Show() { . . }

    public void ShowNode(int nodeID) { . . }

    private static double Mean(List<double> yValues) { . . }

    private static double Variance(List<double> yValues) { . . }
 
  } // class DecisionTreeRegressor
} // ns

The demo starts by loading the 200-item training data into memory:

string trainFile =
  "..\\..\\..\\Data\\synthetic_train_200.txt";
int[] colsX = new int[] { 0, 1, 2, 3, 4 };
double[][] trainX =
  MatLoad(trainFile, colsX, ',', "#");
double[] trainY =
  MatToVec(MatLoad(trainFile,
  new int { 5 }, ',', "#"));

The training X data is stored into an array-of-arrays style matrix of type double. The data is assumed to be in a directory named Data, which is located in the project root directory. The arguments to the MatLoad() function mean load columns 0, 1, 2, 3, 4 where the data is comma-delimited, and lines beginning with "#" are comments to be ignored. The training y data in column [5] is loaded into a matrix and then converted to a one-dimensional vector using the MatToVec() helper function.

The first three training items are displayed like so:

Console.WriteLine("First three train X: ");
for (int i = 0; i < 3; ++i)
  VecShow(trainX[i], 4, 8);
Console.WriteLine("First three train y: ");
for (int i = 0; i < 3; ++i)
  Console.WriteLine(trainY[i].ToString("F4").PadLeft(8));

In a non-demo scenario, you might want to display all the training data to make sure it was correctly loaded into memory. The 40-item test data is loaded into memory using the same pattern that was used to load the training data:

string testFile =
  "..\\..\\..\\Data\\synthetic_test_40.txt";
double[][] testX =
  MatLoad(testFile, colsX, ',', "#");
double[] testY =
  MatToVec(MatLoad(testFile,
  new int[] { 5 }, ',', "#"));

The decision tree regression model is trained/fit/constructed using these statements:

int maxDepth = 6;
int minSamples = 2;
DecisionTreeRegressor dtr = 
  new DecisionTreeRegressor(maxDepth, minSamples);
dtr.Train(trainX, trainY);

The values of maxDepth and minSamples must be determined by trial and error. If you increase the value of maxDepth, you'll eventually get a tree large enough so that every training y value is in its own leaf node, and the tree will score 100% accuracy on the training data. However, as maxDepth increases, a tree becomes overfitted and the accuracy on new, previously unseen data will decrease.

Behind the scenes, the Train() method calls a private Validate() methos to make sure that the tree nodes are consistent (an empty node cannot have a non-empty child, and so on). The call to Validate() is optional. You can easily enhance Validate() for custom scenarios.

After the decision tree is created, the demo displays node [62] using method ShowNode():

// dtr.Show();  // show all nodes
Console.WriteLine("Showing node 62: ");
dtr.ShowNode(62);

The output is:

Showing node 62:
========================
nodeID: 62
splitCol = 4
splitVal = 0.5972
isLeaf: False
isEmpty: False
predictedY = 0.0000
========================

Node [62] is an interior node, which is neither empty, nor a leaf node. Next, the demo evaluates model accuracy:

double accTrain = dtr.Accuracy(trainX, trainY, 0.15);
Console.WriteLine("Accuracy train (within 0.15) = " +
  accTrain.ToString("F4"));
double accTest = dtr.Accuracy(testX, testY, 0.15);
Console.WriteLine("Accuracy test (within 0.15) = " +
  accTest.ToString("F4"));

The Accuracy() method scores a prediction as correct if the predicted y value is within 15% of the true target y value. There are several other ways to evaluate a trained regression model, including root mean squared error, coefficient of determination, and so on. Using the Accuracy() method as a template, other evaluation metrics are easy to implement.

The demo concludes by using the trained decision tree to make a prediction:

double[] x = trainX[0];
Console.WriteLine("Predicting for x = ");
VecShow(x, 4, 9);
double predY = dtr.Predict(x);
Console.WriteLine("y = " + predY.ToString("F4"));

The x input is the first training data item. The predicted y value is 0.4909, which is quite close to the actual y value of 0.4840. Because of the way decision trees for regression are created/trained, they should always do well when predicting training data items. However, decision trees for regression are highly prone to model overfitting. Because of this, decision trees are rarely used by themselves.

Wrapping Up
Even though decision trees for regression do not predict well on new, previously unseen data, when many regression trees are combined into an ensemble technique, they can produce extremely accurate models. The most common tree ensemble models for regression are AdaBoost ("adaptive boosting"), random forest (includes bagging models), and gradient boosting models. These ensemble techniques will be the topics of future Visual Studio Magazine articles.

The demo program presented in this article uses purely numeric predictor data. It is possible to use categorical predictors (or mixed numeric and categorical predictors) by ordinal encoding the data. For example, if a predictor variable color has possible values red, blue, green, yellow, you can encode red = 0, blue = 1, green = 2, yellow = 3. In theory, this shouldn't work very well because the split operation compares on less-than-or-equal. With this comparison, red will always be less than any other color value. Put another way, with ordinal encoding, predictions will depend on the order used in the encoding. In spite of this, in practice, ordinal encoding of categorical predictor variables often works quite well.

comments powered by Disqus

Featured

Subscribe on YouTube