The Data Science Lab

Random Neighborhoods Regression Using C#

Dr. James McCaffrey from Microsoft Research presents a complete end-to-end demonstration of the random neighborhoods regression technique, where the goal is to predict a single numeric value. Compared to other ML regression techniques, advantages are that it can handle both large and small datasets, and the results are highly interpretable.

The goal of a machine learning regression problem is to predict a single numeric value. There are roughly a dozen different regression techniques such as linear regression, k-nearest neighbors regression, random forest regression, Gaussian process regression, and neural network regression. Random neighborhoods regression is essentially a variation of k-nearest neighbors regression.

In regular k-nearest neighbors regression, to predict the target value y for an input vector x, you find the k-nearest training items to x and then calculate and return the average of the y values associated with the closest items. In random neighborhoods regression, you create an ensemble (collection) of several k-nearest neighbor regressor systems, each using a different subset of the source training data, and each with a different value of k. The predicted y value is the average of the predictions made by the collection of k-nearest neighbor systems.

This article presents a complete demo of random neighborhoods regression using the C# language. Although there are several code libraries that contain implementations of standard k-nearest neighbors regression, to the best of my knowledge, there are no libraries that directly implement random neighborhoods regression. Even if a library implementation of random neighborhoods regression exists, implementing a system from scratch allows you to easily integrate with other systems implemented with .NET and easily modify the system.

A good way to see where this article is headed is to take a look at the screenshot in Figure 1. The demo program begins by loading synthetic training and test data into memory. The data looks like:

-0.1660,  0.4406, -0.9998, -0.3953, -0.7065,  0.4840
 0.0776, -0.1616,  0.3704, -0.5911,  0.7562,  0.1568
-0.9452,  0.3409, -0.1654,  0.1174, -0.7192,  0.8054
 0.9365, -0.3732,  0.3846,  0.7528,  0.7892,  0.1345
. . .

There are 200 training items and 40 test items. The first five values on each line are the x predictors. The last value on each line is the target y value to predict. The demo creates a random neighborhoods regression model, evaluates the model accuracy on the training and test data, and then uses the model to predict the target y value for x = [-0.1660, 0.4406, -0.9998, -0.3953, -0.7065].

The first part of the demo output shows how a random neighborhoods regression model is created:

Creating and training random neighborhoods regression model
Setting numNeighborhoods = 6
Setting pctData = 0.90
Setting minK = 2
Setting maxK = 7
Done

This sets up a system with 6 neighborhoods, where each uses a randomly selected 90% of the 200 training items (180 rows), and each uses a randomly selected value of k neighbors between 2 and 7. These four parameter values must be determined by trial and error. The second part of the demo output shows the model evaluation:

Evaluating model
Accuracy train (within 0.15) = 0.8100
Accuracy test (within 0.15) = 0.7500

A prediction is scored correct if it's within 0.15 of the true target value. The last part of the output shows how the system makes a prediction:

Predicting for x =
  -0.1660   0.4406  -0.9998  -0.3953  -0.7065
neighborhood [ 0]  k = 3 :  pred y = 0.5287
neighborhood [ 1]  k = 5 :  pred y = 0.5972
neighborhood [ 2]  k = 2 :  pred y = 0.5193
neighborhood [ 3]  k = 5 :  pred y = 0.5972	
neighborhood [ 4]  k = 5 :  pred y = 0.5823
neighborhood [ 5]  k = 5 :  pred y = 0.5972
Predicted y = 0.5703

The x input is the first training item. The predicted y value is the average of the six k-nearest neighbors systems: (0.5287 + 0.5972 + . . + 0.5972) / 6 = 0.5703.

Figure 1: Random Neighborhoods Regression Using C# in Action
[Click on image for larger view.] Figure 1: Random Neighborhoods Regression Using C# in Action

This article assumes you have intermediate or better programming skill but doesn't assume you know anything about random neighborhoods regression. The demo is implemented using C#, but you should be able to refactor the demo code to another C-family language if you wish. All normal error checking has been removed to keep the main ideas as clear as possible.

The source code for the demo program is too long to be presented in its entirety in this article. The complete code and data are available in the accompanying file download, and they're also available online.

The Demo Data
The demo data is synthetic. It was generated by a 5-10-1 neural network with random weights and bias values. The idea here is that the synthetic data does have an underlying, but complex, structure, which can be predicted.

All of the predictor values are between -1 and +1. There are 200 training data items and 40 test items. Random neighborhoods regression usually works well with both large and small datasets. When using random neighborhoods regression, you should normalize the training data predictor values because the "nearest" part of k-nearest neighbors is calculated using Euclidean distance. If you don't normalize, then columns/predictors with large magnitudes (such as employee annual income) can overwhelm predictors with small magnitudes (such as employee age).

Random neighborhoods regression is most often used with data that has strictly numeric predictor variables. It is possible to use random neighborhoods regression with categorical (such as a variable color with possible values red, blue, green) or data that has inherent ordering (such as a variable size with possible values small, medium, large).

Understanding k-Nearest Neighbors Regression
Because random neighborhoods regression is based on k-nearest neighbors (k-NN) regression, in order to understand random neighborhoods regression, you must understand regular k-NN regression. The basic k-NN regression technique is perhaps best understood by looking at a concrete example. Suppose that you have a set of training data with only five items:

      X                y
--------------------------
[0] (0.2, 0.9, 0.3)   0.22
[1] (0.4, 0.4, 0.1)   0.66
[2] (0.5, 0.7, 0.2)   0.33
[3] (0.6, 0.8, 0.6)   0.11
[4] (0.9, 0.3, 0.4)   0.44

And suppose you want to predict the y value for an input x = (0.5, 0.3, 0.4). First, you compute the Euclidean distance from the input x to all five training items. For example, the distance from x to training item [0] is computed like so:

dist = sqrt((0.5 - 0.2)^2 + (0.3 - 0.9)^2 + (0.4 - 0.3)^2)
     = sqrt(0.09 + 0.36 + 0.01)
     = sqrt(0.46)
     = 0.678

And the distances from the input x to the five training items are:

    distance   y
---------------------
[0]  0.678    0.22
[1]  0.332    0.66
[2]  0.447    0.33
[3]  0.548    0.11
[4]  0.400    0.44

Now, if you set k = 2, then the two closest training items (i.e. smallest distance) are [1] and [4]. The predicted y value is the average of the associated target y values: (0.66 + 0.44) / 2 = 1.10 / 2 = 0.55.

Two of the weaknesses of regular k-NN regression are 1.) the value of k must be determined by trial and error and that value can have a big influence on the predicted y value, and 2.) k-NN regression tends to overfit the training data and so even when prediction accuracy is good on the training data, accuracy on new, previously unseen data can be poor. Random neighborhoods regression deals with both weaknesses of standard k-NN regression.

The Demo Program
I used Visual Studio 2022 (Community Free Edition) for the demo program. I created a new C# console application and checked the "Place solution and project in the same directory" option. I specified .NET version 8.0. I named the project RandomNeighborhoodsRegression. I checked the "Do not use top-level statements" option to avoid the strange program entry point shortcut syntax.

The demo has no significant .NET dependencies and any relatively recent version of Visual Studio with .NET (Core) or the older .NET Framework will work fine. You can also use the Visual Studio Code program if you like.

After the template code loaded into the editor, I right-clicked on file Program.cs in the Solution Explorer window and renamed the file to the more descriptive RandomNeighborhoodsRegressionProgram.cs. I allowed Visual Studio to automatically rename class Program.

The overall program structure is presented in Listing 1. All the control logic is in the Main() method in the Program class. The Program class also holds helper functions to load data from file into memory and display data. All of the random neighborhoods regression functionality is in a RandomNeighborhoodsRegressor class. The RandomNeighborhoodsRegressor class exposes a constructor and three methods: Train(), Predict(), Accuracy().

Listing 1: Overall Program Structure

using System;
using System.IO;
using System.Collections.Generic;

namespace RandomNeighborhoodsRegression
{
  internal class RandomNeighborhoodsRegressionProgram
  {
    static void Main(string[] args)
    {
      Console.WriteLine("Begin random neighborhoods" +
        " regression demo using C# ");

      // 1. load data
      // 2. create and train model
      // 3. evaluate model
      // 4. use model
 
      Console.WriteLine("End demo ");
      Console.ReadLine();
    } 

    // helpers for Main()

    static double[][] MatLoad(string fn, int[] usecols,
      char sep, string comment) { . . }

    static double[] MatToVec(double[][] mat) { . . }

    static void VecShow(double[] vec, int dec,
      int wid) { . . }

  } // class Program

  // ========================================================

  public class RandomNeighborhoodsRegressor
  {
    public int numHoods;
    public double[][] trainX;
    public double[] trainY;
    public List<int[]> rows;  // each neighborhood
    public int[] kValues;  // k each neighborhood
    private Random rnd;

    public RandomNeighborhoodsRegressor(int numHoods,
      int seed) { . . }

    public void Train(double[][] trainX, double[] trainY,
      double pctData, int minK, int maxK) { . . }

    public double Predict(double[] x, 
      bool verbose = false) { . . }

    public double Accuracy(double[][] dataX,
      double[] dataY, double pctClose) { . . }

    // helpers for Train() and Predict()

    private void Shuffle(int[] indices) { . . }

    private static double EucDistance(double[] x1,
      double[] x2) { . . }

    private static int[] ArgSort(double[] values) { . . }

  } // class RandomNeighborhoodsRegressor

  // ========================================================

} // ns

The demo program starts by loading the 200-item training data into memory:

string trainFile =
  "..\\..\\..\\Data\\synthetic_train_200.txt";
int[] colsX = new int[] { 0, 1, 2, 3, 4 };
double[][] trainX =
  MatLoad(trainFile, colsX, ',', "#");
double[] trainY =
  MatToVec(MatLoad(trainFile,
  new int { 5 }, ',', "#"));

The training X data is stored into an array-of-arrays style matrix of type double. The data is assumed to be in a directory named Data, which is located in the project root directory. The arguments to the MatLoad() function mean load columns 0, 1, 2, 3, 4 where the data is comma-delimited, and lines beginning with "#" are comments to be ignored. The training y data in column [5] is loaded into a matrix and then converted to a one-dimensional vector using the MatToVec() helper function.

The 40-item test data is loaded into memory using the same pattern that was used to load the training data:

string testFile =
  "..\\..\\..\\Data\\synthetic_test_40.txt";
double[][] testX =
  MatLoad(testFile, colsX, ',', "#");
double[] testY =
  MatToVec(MatLoad(testFile,
  new int[] { 5 }, ',', "#"));

The first three training items are displayed like so:

Console.WriteLine("First three train X: ");
for (int i = 0; i < 3; ++i)
  VecShow(trainX[i], 4, 8);
Console.WriteLine("First three train y: ");
for (int i = 0; i < 3; ++i)
  Console.WriteLine(trainY[i].ToString("F4").PadLeft(8));

In a non-demo scenario, you might want to display all the training data to make sure it was correctly loaded into memory.

The parameters for the random neighborhoods regression model are set using these four statements:

int numNeighborhoods = 6;
double pctData = 0.90;
int minK = 2;
int maxK = 7;

The values for all four parameters must be determined by trial and error. In a non-demo scenario, you'll probably want to use a greater number of k-NN regressors than the six used by the demo.

The pctData parameter value of 0.90 means each k-NN regressor will use a randomly selected 90% of the 200 training items. The training items are selected "with replacement," which means that you can, and will almost certainly, get duplicate training items in each k-NN regressor. Duplicate training items help prevent model overfitting.

The values of the minK = 2 and maxK = 7 parameters mean each k-NN regressor will use between 2 and 7 nearest neighbors when computing its prediction. The values of 2 and 7 have worked well for me across a large number of different regression problem scenarios.

The random neighborhoods regression model is created and trained like so:

RandomNeighborhoodsRegressor model = 
  new RandomNeighborhoodsRegressor(numNeighborhoods,
  seed:0);
model.Train(trainX, trainY, pctData, minK, maxK);
Console.WriteLine("Done ");

The seed parameter controls the behavior of a random number generator, which is used to select the random rows of training data for each k-NN regressor. Because there is an element of randomness, the random neighborhoods regression technique is not deterministic (for example, if you refactored to another programming language with a different random number algorithm), which is a minor disadvantage of the technique.

Next, the demo evaluates model accuracy:

double accTrain = model.Accuracy(trainX, trainY, 0.15);
Console.WriteLine("Accuracy train (within 0.15) = " +
  accTrain.ToString("F4"));
double accTest = model.Accuracy(testX, testY, 0.15);
Console.WriteLine("Accuracy test (within 0.15) = " +
  accTest.ToString("F4"));

The Accuracy() method scores a prediction as correct if the predicted y value is within 15% of the true target y value. There are several other ways to evaluate a trained regression model, including root mean squared error, coefficient of determination, and so on. Using the Accuracy() method as a template, other evaluation metrics are easy to implement.

The demo concludes by using the trained random neighborhoods regression model to make a prediction:

double[] x = trainX[0];
Console.WriteLine("Predicting for x = ");
VecShow(x, 4, 9);
double predY = model.Predict(x, verbose:true);
Console.WriteLine("Predicted y = " + predY.ToString("F4"));

The x input is the first training data item. The predicted y value is 0.5703, which is not very close to the actual y value of 0.4840. The verbose argument passed to Predict() instructs the method to display information for each of the k-NN regressors. You can modify the demo program code to make the prediction output even more detailed along the lines of:

. . .
neighborhood [5]  k = 3 : pred y = 0.5004
  nearest neighbor 0 = [167]  distance = 0.0213  y = 0.5023
  nearest neighbor 1 = [ 42]  distance = 0.0349  y = 0.4867
  nearest neighbor 2 = [112]  distance = 0.0467  y = 0.5122
. . .

Compared to some other regression techniques such as kernel ridge regression, Gaussian process regression, and neural network regression, random neighborhoods regression is highly interpretable. In some problem scenarios, high interpretability might be required for legal or other reasons.

Wrapping Up
The version of random neighborhoods regression presented in this article selects data subsets using only rows of the training data. In scenarios with many predictors, say more than 20, you can consider also selecting random columns. If you do so, you should select columns without replacement so that there are no duplicate columns. Selecting random rows without replacement and random columns with replacement, when used with a collection of simple decision tree regressors, is called bagging ("bootstrap aggregation") regression. There is some old research evidence that suggests bagging doesn't work well with k-NN regressors but that research evidence is not convincing in my opinion.

Random neighborhoods regression is most often used with data that has strictly numeric predictor variables. It is possible to use random neighborhoods regression with mixed categorical and numeric data. Plain categorical data, such as a variable color with values "red," "blue," "green," can be one-over-n-hot encoded: red = (0.3333, 0, 0), blue = (0, 0.3333, 0), green = (0, 0, 0.3333). Categorical data with inherent ordering, such as a variable height with values "short," "medium," "tall," can be equal-interval encoded: short = 0.25, medium = 0.50, tall = 0.75. Binary data can be encoded using either technique for example, "male" = 0.0, "female" = 0.5 or "male" = 0.25, "female" = 0.75.

In the early days of machine learning, random neighborhoods regression was used quite often, at least by me and many of my colleagues. But random neighborhoods regression is not used nearly as often as it used to be. I'm not sure exactly why the use of random neighborhoods regression has declined, but the technique often works well in a wide range of problem scenarios and can make a nice addition to your personal machine learning toolkit.

comments powered by Disqus

Featured

  • Random Neighborhoods Regression Using C#

    Dr. James McCaffrey from Microsoft Research presents a complete end-to-end demonstration of the random neighborhoods regression technique, where the goal is to predict a single numeric value. Compared to other ML regression techniques, advantages are that it can handle both large and small datasets, and the results are highly interpretable.

  • As Some Orgs Restrict DeepSeek AI Usage, Microsoft Offers Models and Dev Guidance

    While some organizations are restricting employee usage of the new open source DeepSeek AI from a Chinese company due to data collection concerns, Microsoft has taken a different approach.

  • Useful New-ish Features in .NET/C#

    We often hear about the big new features in .NET or C#, but what about all of those lesser known, but useful new features? How exactly do you use constructs like collection indices and ranges, date features, and pattern matching?

  • TypeScript 5.8 Beta Speeds Program Loads, Updates

    "TypeScript 5.8 introduces a number of optimizations that can both improve the time to build up a program, and also to update a program based on a file change in either --watch mode or editor scenarios."

  • AI Toolkit for VS Code Now Lets You 'Bring Your Own Model'

    "AI Toolkit extension for VS code now supports external local models via Ollama. It has also added support remote hosted models using API keys for OpenAI, Google and Anthropic."

Subscribe on YouTube