The Data Science Lab
Simple k-NN Regression Using C#
Dr. James McCaffrey from Microsoft Research presents a complete end-to-end demonstration of k-nearest neighbors regression to predict a single numeric value. Compared to other machine learning regression techniques, k-NN regression is often slightly less accurate, but is very simple to implement and customize, and the results are highly interpretable.
The goal of a machine learning regression problem is to predict a single numeric value. For example, you might want to predict the price of a house based on its square footage, number of bedrooms, property tax rate, and so on. Common regression techniques include multiple linear regression, tree-based regression (decision tree, AdaBoost, random forest, bagging), neural network regression, and k-nearest neighbors (k-NN) regression. Note that logistic regression, in spite of its name, is a binary classification algorithm rather than a regression algorithm.
This article presents a demo of k-NN regression using the C# language. Compared to other regression techniques, k-NN regression is often slightly less accurate (except that k-NN is often more accurate than multiple linear regression), but k-NN regression is very simple to implement and customize, and k-NN regression results are highly interpretable.
A good way to see where this article is headed is to take a look at the screenshot in Figure 1. The demo program begins by loading synthetic training and test data into memory. The data looks like:
-0.1660, 0.4406, -0.9998, -0.3953, -0.7065, 0.4840
0.0776, -0.1616, 0.3704, -0.5911, 0.7562, 0.1568
-0.9452, 0.3409, -0.1654, 0.1174, -0.7192, 0.8054
0.9365, -0.3732, 0.3846, 0.7528, 0.7892, 0.1345
. . .
The first five values on each line are the x predictors. The last value on each line is the target y variable to predict. The demo creates a k-NN regression model, evaluates the model accuracy on the training and test data, and then uses the model to predict the target y value for x = [0.7462, 0.4006, -0.0590, 0.6543, -0.0083]. The demo output is:
Creating k-NN model, k=4
Accuracy train (within 0.15): 0.7950
Accuracy test (within 0.15): 0.7750
Predicting for: [0.7462, 0.4006, -0.0590, 0.6543, -0.0083]
y = 0.2137
The k-NN regression technique is very simple. To make a prediction for some input vector x, you examine the training / reference data and find the k-closest items to x. The value of k is often set to 3 or 5, and it must be determined by trial and error. The predicted y value is the average of the known y values of the close items in the training data.
This article assumes you have intermediate or better programming skill but doesn't assume you know anything about the k-NN regression algorithm. The demo is implemented using C#, but you should be able to refactor the demo code to another C-family language if you wish. All normal error checking has been removed to keep the main ideas as clear as possible.
The source code for the demo program is a bit too long to be presented in its entirety in this article. The complete code and data are available in the accompanying file download, and they're also available online.
The Demo Data
The demo data is synthetic. It was generated by a 5-10-1 neural network with random weights and bias values. The idea here is that the synthetic data does have an underlying, but complex, structure which can be predicted.
All of the predictor values are between -1 and +1. When using k-NN regression, because nearest neighbors are computed using Euclidean distance, it's important to normalize the source data so that variables with large magnitudes (for example, employee annual income) don't overwhelm variables with small magnitudes (for example, employee age). All the target values are between 0 and 1. There are 200 training/reference data items and 40 test items.
In practice, the k-NN regression technique is most often used with data that has strictly numeric predictor variables. It is possible to use k-NN regression with mixed categorical and numeric data.
Plain categorical data, such as a variable color with values "red", "blue", green", can be one-over-n-hot encoded: red = (0.3333, 0, 0), blue = (0, 0.3333, 0), green = (0, 0, 0.3333). Categorical data with inherent ordering, such as a variable height with values "short", "medium", "tall", can be equal-interval encoded: short = 0.25, medium = 0.50, tall = 0.75. Binary data can be encoded using either technique for example, "male" = 0.0, "female" = 0.5 or "male" = 0.25, "female = 0.75.
When using these encoding schemes, all encoded values are between 0.0 and 1.0 and so numeric data should be normalized to this range too. Two common techniques are divide-by-constant if all raw values are positive (for example, dividing people's ages by 100), or min-max normalization. Note that z-score normalization can be used but extreme raw values might be normalized to values less than -1 or greater than +1.
Understanding k-NN Regression
The k-NN regression technique is perhaps best understood by looking at a concrete example. Suppose, as in the demo, you want to predict the y value for x = [0.7462, 0.4006, -0.0590, 0.6543, -0.0083]. The algorithm scans through the 200 training items and computes and saves the distance between x and each training item.
Next, the distances and their associated index values are sorted from smallest (nearest) distance to largest. For the demo data, the k = 4 nearest training data items to x are items [120], [180], [46], [146]:
[120] = (0.8557, 0.7919, -0.0169, 0.7134, -0.1628) y = 0.2002
[180] = (0.9950, 0.1641, -0.4132, 0.8579, 0.0142) y = 0.2003
[ 46] = (0.1450, 0.4663, 0.0380, 0.5418, 0.1377) y = 0.2931
[146] = (0.7498, -0.0963, 0.4169, 0.5549, -0.0103) y = 0.1614
The predicted y value is the average of the nearest neighbors: y = (0.2002 + 0.2003 + 0.2931 + 0.1614) / 4 = 0.8550 / 4 = 0.2137.
Because of the simple way in which k-NN regression works, it is highly interpretable. Any prediction can be clearly and unambiguously explained. In some scenarios, the high interpretability of k-NN regression trumps the better accuracy of difficult-to-interpret techniques such as AdaBoost regression and neural network regression.
The k-NN regression technique is somewhat different from most other regression techniques. The k-NN technique doesn't explicitly use training data to create a mathematical model (such as the linear regression technique or the neural network regression technique), after which the training data is no longer needed. Instead, in k-NN regression, the training data is, in effect, the model. To make a prediction for an input vector x, it must be compared against all the training data to find the nearest neighbors. Because of this, training data for k-NN regression is sometime called reference data, and the method to train a k-NN regression model is sometimes named "fit" instead of "train."
The Demo Program
I used Visual Studio 2022 (Community Free Edition) for the demo program. I created a new C# console application and checked the "Place solution and project in the same directory" option. I specified .NET version 8.0. I named the project KNNRegressionSimple. I checked the "Do not use top-level statements" option to avoid the program entry point shortcut syntax.
The demo has no significant .NET dependencies and any relatively recent version of Visual Studio with .NET (Core) or the older .NET Framework will work fine. You can also use the Visual Studio Code program if you like.
After the template code loaded into the editor, I right-clicked on file Program.cs in the Solution Explorer window and renamed the file to the slightly more descriptive KNNRegressionProgram.cs. I allowed Visual Studio to automatically rename class Program.
The overall program structure is presented in Listing 1. All the control logic is in the Main() method in the Program class. That class also holds helper functions to load data from file into memory and display data. All of the k-NN regression functionality is in a KNNR class.
Listing 1: Overall Program Structure
using System;
using System.IO;
using System.Collections.Generic;
namespace KNNRegressionSimple
{
internal class KNNRegressorProgram
{
static void Main(string[] args)
{
Console.WriteLine("Begin k-NN regression" +
" simple demo ");
// 1. load data
// 2. create and train/fit model
// 3. evaluate model
// 4. use model
Console.WriteLine("End demo ");
Console.ReadLine();
} // Main
static double[][] MatLoad(string fn, int[] usecols,
char sep, string comment) { . . }
static double[] MatToVec(double[][] mat) { . . }
static void VecShow(double[] vec, int dec, int wid) { . . }
} // class Program
public class KNNR
{
public int nNeighbors;
public double[][] trainX;
public double[] trainY;
public KNNR(int nNeighbors)
{
this.nNeighbors = nNeighbors;
}
public void Train(double[][] trainX,
double[] trainY) { . . }
public double Predict(double[] x) { . . }
public double Accuracy(double[][] dataX,
double[] dataY, double pctClose) { . . }
private double Distance(double[] x1,
double[] x2) { . . }
private int[] ArgSort(double[] values) { . . }
} // class KNNR
} // ns
The demo starts by loading the 200-item training data into memory:
string trainFile =
"..\\..\\..\\Data\\synthetic_train_200.txt";
int[] colsX = new int[] { 0, 1, 2, 3, 4 };
double[][] trainX =
MatLoad(trainFile, colsX, ',', "#");
double[] trainY =
MatToVec(MatLoad(trainFile,
[5], ',', "#"));
The training x data is stored into an array-of-arrays style matrix of type double. The data is assumed to be in a directory named Data, which is located in the project root directory. The arguments to the MatLoad() function mean load columns 0, 1, 2, 3, 4 where the data is comma-delimited, and lines beginning with "#" are comments to be ignored. The training y data in column [5] is loaded into a matrix and then converted to a one-dimensional vector using the MatToVec() helper function.
The first three training items are displayed like so:
Console.WriteLine("First three train X: ");
for (int i = 0; i < 3; ++i)
VecShow(trainX[i], 4, 8);
Console.WriteLine("First three train y: ");
for (int i = 0; i < 3; ++i)
Console.WriteLine(trainY[i].ToString("F4").PadLeft(8));
In a non-demo scenario, you might want to display all the training data to make sure it was correctly loaded into memory. Next, the 40-item test data is loaded into memory using the same pattern that was used to load the training data:
string testFile =
"..\\..\\..\\Data\\synthetic_test_40.txt";
double[][] testX =
MatLoad(testFile, colsX, ',', "#");
double[] testY =
MatToVec(MatLoad(testFile,
[5], ',', "#"));
The k-NN regression model is trained/fitted using these statements:
Console.WriteLine("Creating k-NN model, k=4 ");
KNNR knnr = new KNNR(4);
knnr.Train(trainX, trainY);
As explained above, the Train() method doesn't do anything except store a reference copy of the training X and y data. If you customize the code presented in this demo so that the training data is modified, you might want to edit the Train() method so that it saves a value copy of the data instead of a reference copy.
The model is evaluated using these statements:
double accTrain = knnr.Accuracy(trainX, trainY, 0.15);
Console.WriteLine("Accuracy train (within 0.15): " +
accTrain.ToString("F4"));
double accTest = knnr.Accuracy(testX, testY, 0.15);
Console.WriteLine("Accuracy test (within 0.15): " +
accTest.ToString("F4"));
The Accuracy() method scores a prediction as correct if it is within a specified percent of the true target value. The demo uses 0.15 as the percentage but a reasonable percentage will vary from problem to problem. An alternate, more granular, measure of model accuracy is root mean squared error (the square root of the average of the squared differences between predicted and target y values).
The demo concludes by using the k-NN regression model to make a prediction:
Console.WriteLine("Predicting for:" +
" [0.7462, 0.4006, -0.0590, 0.6543, -0.0083]");
double[] x = new double[] { 0.7462, 0.4006,
-0.0590, 0.6543, -0.0083 };
double y = knnr.Predict(x);
Console.WriteLine("y = " + y.ToString("F4").PadLeft(8));
As discussed earlier, to supplement the Predict() method, you might want to implement an Explain() method that displays the k-nearest data items, and optionally their distance to the input x vector.
Wrapping Up
The k-NN regression system presented in this article is the simplest possible version. In the early days of machine learning, several variations of k-NN regression were explored, for example, weighting the predicted value by the distances to the nearest neighbors, or using the median of the y values of the closest neighbors. In most cases, the added complexity of these variations does not significantly improve k-NN regression and so the variations are not widely used today.
About the Author
Dr. James McCaffrey works for Microsoft Research in Redmond, Wash. He has worked on several Microsoft products including Azure and Bing. James can be reached at [email protected].