The Data Science Lab
Implementing k-NN Classification Using C#
Dr. James McCaffrey of Microsoft Research presents a full demo of k-nearest neighbors classification on mixed numeric and categorical data. Compared to other classification techniques, k-NN is easy to implement, supports numeric and categorical predictor variables, and is highly interpretable.
Multi-class classification is a machine learning technique to predict the value of a variable that can be one of three or more possible values. For example, you might want to predict the political leaning of a person (conservative, moderate, liberal) based on their sex (male, female), height (short, medium, tall), age, state of residence, and annual income. Binary classification is a special case when there are just two possible values, for example predicting the sex of a person.
The k-nearest neighbors (k-NN or kNN) classification technique can perform both multi-class and binary classification. This article presents a complete demo program using k-nearest neighbors classification. Other common classification techniques include standard logistic regression (binary classification), multi-class logistic regression, several decision tree-based algorithms such AdaBoost, neural network classification, and naive Bayes classification (categorical predictor variables).
Compared to other classification techniques, k-NN is easy to implement, works with both numeric and categorical predictor variables, and is highly interpretable. The main disadvantage of k-NN classification is that it is very sensitive to the training data used, and is therefore prone to model overfitting where the model predicts well on the training data but poorly on new, previously unseen data.
The k-nearest neighbors classification technique is very simple. To classify a new input vector of predictor values, you determine the k-nearest training data items, where k is typically 3, 5, or 7, and nearest is usually determined by Euclidean distance. The prediction is the class that is most frequent among the nearest items. For example, if k = 5 and there are three classes (0, 1, 2), and the class labels of the five nearest neighbors are (1, 0, 1, 2, 1), then the predicted class is 1.
A good way to see where this article is headed is to take a look at the screenshot in Figure 1. The demo program begins by loading a 200-item set of training data and a 40-item set of test data into memory. The first four training predictor values and the corresponding first four target political leaning values are:
First four training data:
0.5000 0.2500 0.2400 0.2500 0.0000 0.0000 0.0000 0.2950
0.0000 0.7500 0.3900 0.0000 0.0000 0.2500 0.0000 0.5120
0.5000 0.2500 0.6300 0.0000 0.2500 0.0000 0.0000 0.7580
0.0000 0.5000 0.3600 0.0000 0.0000 0.0000 0.2500 0.4450
First four target labels:
2 1 0 1
The predictor values represent sex, height, age, state of residence, and income, as you'll see shortly. The demo examines five different values of k to determine the best one:
k = 2
Accuracy train = 0.8250
Accuracy test = 0.7000
k = 3
Accuracy train = 0.8250
Accuracy test = 0.6750
k = 4
Accuracy train = 0.7600
Accuracy test = 0.6750
k = 5
Accuracy train = 0.7750
Accuracy test = 0.7000
k = 6
Accuracy train = 0.7450
Accuracy test = 0.6250
The k = 5 parameter appears to give a good balance of accuracy on the training and test data (but k = 2 also looks like a strong candidate). The demo sets k = 5 and computes a confusion matrix of the 40-item test data to give more granular accuracy metrics:
Using k = 5
Confusion matrix test data:
actual 0 | 6 4 1 | 0.5455
actual 1 | 2 12 0 | 0.8571
actual 2 | 2 3 10 | 0.6667
----------
predicted: 0 1 2
Class 1 (political moderate) people are predicted nicely at 85.71% accuracy (12 out of 14 correct), but class 0 (conservatives) are not predicted too well (54.55% accuracy).
The demo concludes by predicting the political leaning of a new, previously unseen person who is male, tall, age 33, from Colorado, and makes $65,000 annually:
Predicting for M tall 33 Colorado $65,000
Predicted pseudo-probs:
0.0000 0.8000 0.2000
Predicted class = 1
(0 = conservative, 1 = moderate, 2 = liberal)
This article assumes you have intermediate or better programming skill but doesn't assume you know anything about k-nearest neighbors classification. The demo is implemented using C#, but you should be able to refactor the demo code to another C-family language if you wish. All normal error checking has been removed to keep the main ideas as clear as possible.
The source code for the demo program is a bit too long to be presented in its entirety in this article. The complete code and data are available in the accompanying file download, and are also available online.
The Demo Data
The 240-item synthetic raw demo data looks like:
F, short, 24, arkansas, 29500, liberal
M, tall, 39, delaware, 51200, moderate
F, short, 63, colorado, 75800, conservative
M, medium, 36, illinois, 44500, moderate
F, short, 27, colorado, 28600, liberal
. . .
The fields are sex, height (short, medium, tall), age, state of residence (Arkansas, Colorado, Delaware, Illinois), annual income, and political leaning (conservative, moderate, liberal).
Because k-NN classification computes the distance between data items using Euclidean distance, the predictor data must be encoded and normalized. When using k-NN, the target variable to predict, political leaning in this example, should be ordinal encoded. The resulting encoded and normalized data is:
0.5, 0.25, 0.24, 0.25, 0.00, 0.00, 0.00, 0.2950, 2
0.0, 0.75, 0.39, 0.00, 0.00, 0.25, 0.00, 0.5120, 1
0.5, 0.25, 0.63, 0.00, 0.25, 0.00, 0.00, 0.7580, 0
0.0, 0.50, 0.36, 0.00, 0.00, 0.00, 0.25, 0.4450, 1
0.5, 0.25, 0.27, 0.00, 0.25, 0.00, 0.00, 0.2860, 2
. . .
Sex is encoded as male = 0.0 and female = 0.5. Height is equal-interval encoded as short = 0.25, medium = 0.50, tall = 0.75. Age is normalized by dividing by 100. State is one-over-n-hot encoded as Arkansas = (0.25, 0, 0, 0), Colorado = (0, 0.25, 0, 0), Delaware = (0, 0, 0.25, 0), Illinois = (0, 0, 0, 0.25). Income is normalized by dividing by 100,000. The target political leaning variable is ordinal encoded as conservative = 0, moderate = 1, liberal = 2.
The idea is to encode and normalize predictor variables so that all values have roughly the same magnitude. This prevents large values from overwhelming small values when computing distance. My experience with k-NN classification suggests that in most cases, binary predictor variables, like sex, are best encoded as 0.0 and 0.5 rather than the more logical 0.0 and 1.0. A third alternative encoding for binary variables is to use equal-interval encoding, with values of 0.3333 and 0.6667. In most cases, using 0.0 and 0.5 encoding is simple and works well.
For categorical variables that have some natural order, such as height, equal-interval encoding works well. If short = 0.25, medium = 0.50, tall = 0.75, the notion that tall is-greater-than short makes sense. If there are n possible values, equal-interval encoding assigns values 1/(n+1), 2/(n+1), and so on.
For numeric variables, such as age and income, the simplest normalization is divide-by-constant. The constant should be chosen so that, as much as possible, the normalized values are all between 0.0 and 1.0, or if the raw values contain negative values, between -1.0 and +1.0. Alternatives to divide-by-constant normalization are min-max normalization and z-score normalization, but these alternatives have some drawbacks when using k-NN classification.
For categorical variables that don't have a natural order, such as state of residence, I recommend using one-over-n-hot encoding. Standard one-hot encoding for the demo state of residence variable that has four possible values would be Arkansas = 1000, Colorado = 0100, Delaware = 0010, Illinois = 0001. But one-hot encoding doesn't take into account the number of possible values. Using one-hot encoding, the Euclidean distance component between any two different state values, for example, Arkansas and Colorado, is sqrt((1-0)^2 + (0-1)^2 + (0-0)^2 + (0-0)^2) = sqrt(2) = 1.4. Even if there were 50 possible state values, the Euclidean distance component between two different states is still sqrt(2) = 1.4.
To take the number of possible values into account, you can use one-over-n-hot encoding where the 1 in a one-hot encoding is replaced by 1/n, where n is the number of possible values. For the demo state of residence with four possible values, the Euclidean distance component between two different one-over-n-hot encoded states is sqrt((0.25-0)^2 + (0-0.25)^2 + 0 + 0) = sqrt(0.0625 + 0.0625) = 0.35. If there were five possible states, the distance component would be sqrt((0.20-0)^2 + (0-0.20)^2 + 0 + 0 + 0) = sqrt(0.08) = 0.28. The intuition is that for a variable that has few possible values, a mismatch provides more information than a mismatch in a variable that has many possible values.
The Demo Program
I used Visual Studio 2022 (Community Free Edition) for the demo program. I created a new C# console application and checked the "Place solution and project in the same directory" option. I specified .NET version 8.0. I named the project KNNpolitics. I checked the "Do not use top-level statements" option to avoid the program entry point shortcut syntax.
The demo has no significant .NET dependencies, and any relatively recent version of Visual Studio with .NET (Core) or the older .NET Framework will work fine. You can also use the Visual Studio Code program if you like.
After the template code loaded into the editor, I right-clicked on file Program.cs in the Solution Explorer window and renamed the file to the slightly more descriptive KNNProgram.cs. I allowed Visual Studio to automatically rename class Program.
The overall program structure is presented in Listing 1. All the control logic is in the Main() method. All of the k-nearest neighbors classification functionality is in a KNN class. A Utils class holds helper functions for Main() to load data from file to memory, and functions to display vectors and matrices.
Listing 1: Overall Program Structure
using System;
using System.IO;
using System.Collections.Generic;
namespace KNNpolitics
{
internal class KNNProgram
{
static void Main(string[] args)
{
Console.WriteLine("Begin multi-class" +
" classification predict politics using kNN ");
// 1. load data into memory
// 2. create model
// 3. try different k values
// 4. set k and use model
Console.WriteLine("End demo ");
Console.ReadLine();
} // Main()
} //class Program
// --------------------------------------------------------
public class KNN
{
public int C; // number classes
public int k; // number neighbors
public double[][] trainX;
public int[] trainY;
public KNN(int C, int k) { . . }
public void Train(double[][] trainX, int[] trainY) { . . }
public double[] PredictProbs(double[] x) { . . }
public int Predict(double[] x) { . . }
private static double EucDist(double[] v1, double[] v2) { . . }
public double Accuracy(double[][] dataX, int[] dataY) { . . }
public int[][] ConfusionMatrix(double[][] dataX, int[] dataY) { . . }
public void ShowConfusion(int[][] cm) { . . }
}
// --------------------------------------------------------
public class Utils
{
public static double[][] MatLoad(string fn,
int[] usecols, char sep, string comment) { . . }
public static int[] MatToIntVec(double[][] m) { . . }
public static void MatShow(double[][] m,
int dec, int wid) { . . }
public static void VecShow(double[] vec, int dec,
int wid) { . . }
public static void VecShow(int[] vec, int wid) { . . }
}
} // ns
The demo starts by loading the 200-item training data into memory:
string trainFile = "..\\..\\..\\Data\\people_train.txt";
double[][] trainX = Utils.MatLoad(trainFile,
new int[] { 0, 1, 2, 3, 4, 5, 6, 7 }, ',', "#");
int[] trainY =
Utils.MatToIntVec(Utils.MatLoad(trainFile, [8],
',', "#"));
The demo assumes that the data is stored in separate training and test files in a directory named Data, which is located in the project root directory. The training data predictor values in columns [0] through [7] are read into memory as an array-of-arrays style matrix using the program-defined MatLoad() helper function. The arguments to MatLoad() are the 0-based indices of the columns to load, the delimiter character, and string that indicates a comment line. The training data target class labels in column [8] are read into memory as a matrix and then converted to an integer vector using the MatToIntVec() helper function.
The 40-item test data is loaded in the same way:
string testFile = "..\\..\\..\\Data\\people_test.txt";
double[][] testX = Utils.MatLoad(testFile,
new int[] { 0, 1, 2, 3, 4, 5, 6, 7 }, ',', "#");
int[] testY =
Utils.MatToIntVec(Utils.MatLoad(testFile, [8],
',', "#"));
The demo displays the first four input x vectors and the associated target y values as a sanity check:
Console.WriteLine("First four training data: ");
for (int i = 0; i < 4; ++i)
Utils.VecShow(trainX[i], 4, 8);
Console.WriteLine("First four target labels: ");
for (int i = 0; i < 4; ++i)
Console.Write(trainY[i] + " ");
Console.WriteLine("");
In a non-demo scenario, you might want to display all the data to make sure it has been correctly loaded into memory.
Creating and Training the k-NN Model
The k-nearest neighbors classification model is created like so:
int C = 3; // number classes
KNN knn;
for (int k = 2; k < 7; ++k) {
Console.WriteLine("k = " + k);
knn = new KNN(C, k);
knn.Train(trainX, trainY);
double accTrain = knn.Accuracy(trainX, trainY);
Console.WriteLine("Accuracy train = " + accTrain);
double accTest = knn.Accuracy(testX, testY);
Console.WriteLine("Accuracy test = " + accTest);
}
The KNN object needs to know the number of classes and the number of nearest neighbors to consider. The number of classes is passed in as parameter C. You could infer the number of classes by scanning through the trainY vector to find the largest value, but in my opinion that's not as good a design because it obscures the value of C.
The loop instantiates a new KNN model for values of k = 2 through 6. In a non-demo scenario, you might want to try more values of k, say, up to 11. The value of k is the one true parameter for k-NN classification and it must be determined by trial and error. In practice, it's sometimes best to use an odd number for k, often 3 or 5 or 7, to reduce the likelihood of a tie for the number of most frequent class label values among the nearest neighbor data items. Many machine learning code libraries, such as the scikit-learn KNN module, use a default value of 5 for k.
The KNN.Train() method is different from the train methods of most machine learning classification algorithms. The KNN.Train() method loads the training and test data into memory, but Train() doesn't perform any computation -- all of the work is done in the Predict() method. As a comparison, when training a logistic regression classification model, constants called the model weights and bias are computed from the training data during training, and then the training data is no longer needed to make a prediction. The KNN classifier needs all the data in memory in order to compute the k-nearest neighbors to the input vector.
The accuracy values are just the number of correct predictions divided by the total number of predictions. Evaluating the best value of k to use is subjective. The k-NN algorithm is sensitive to the local geometry of the training data and so k-NN can overfit. A good value of k is not necessarily the value that gives the best prediction accuracy on either the training or the test data.
Evaluating and Using the Trained Model
For most multi-class and binary classification models, it's standard practice to compute separate accuracy metrics for each class in the form of a confusion matrix:
knn = new KNN(C, 5);
knn.Train(trainX, trainY);
Console.WriteLine("Confusion matrix test data: ");
int[][] cm = knn.ConfusionMatrix(testX, testY);
knn.ShowConfusion(cm);
The demo uses two different functions, one to compute the confusion matrix/table and one to display the matrix because exactly how you want to display your confusion data can vary from problem to problem.
The demo concludes by using the k-NN model to make a prediction for a new, previously unseen data item:
Console.WriteLine("Predicting for M tall 33 Colorado $65,000 ");
double[] x = new double[] { 0.0, 0.75, 0.33, 0,0.25,0,0, 0.6500 };
double[] pseudoProbs = knn.PredictProbs(x);
Console.WriteLine("Predicted pseudo-probs: ");
Utils.VecShow(pseudoProbs, 4, 8);
int predY = knn.Predict(x)
Console.WriteLine("Predicted class = " + predY);
Console.WriteLine("(0 = conservative, 1 = moderate, 2 = liberal) ");
The PredictProbs() method returns the frequencies of the class labels of the nearest neighbors. Because the pseudo-probabilities are (0.00, 0.80, 0.20) and k is 5, this means that none of the five nearest neighbors have class 0 (conservative), four of the five nearest neighbors have class 1 (moderate), and one of the five nearest neighbors have class 2 (liberal).
The Predict() method is simple in the sense that in case of a tie among the counts of the class labels associated with the k-nearest training data items, the method returns the lower number class. For example, if k is set to 7 and there are C = 4 classes, and the labels of the closest items to the input to predict are (3, 1, 0, 1, 2, 0, 3) and there is a tie between classes 0, 1, and 3, Predict() would return 0. In practice, this usually isn't a problem, but for special scenarios you can easily modify the Predict() method source code to print an informational/warning message along the lines of:
int numMaxCounts = 0;
for (int i = 0; i < this.C; ++i)
if (counts[i] == maxIdx)
++numMaxCounts;
if (numMaxCounts > 1)
Console.WriteLine("Note: tie vote in Predict()");
Unlike many classification techniques, it usually doesn't make sense to implement a method to save the trained k-NN model. In essence, the training data is the trained model -- there are no model weight or other internal parameter values to save.
Wrapping Up
The k-nearest neighbors classification technique is one of the simplest possible machine learning classification algorithms. In general, k-NN classification is less accurate than other algorithms because it only looks at nearest neighbors and therefore doesn't directly use all of the available information in the training data. Because of this, k-NN classification is often best used as part of an ensemble approach to validate other classification algorithms. For example, you could predict using a logistic regression model and a k-NN model. If the k-NN prediction is different from the logistic regression prediction, you might want to investigate to try and determine why the predictions differ.
The k-NN classification technique is highly interpretable. When making a prediction, you can display the nearest neighbors to see exactly why a prediction was made. In some prediction scenarios, for example those that require explicit interpretability for legal reasons, the interpretability of k-NN classification trumps the better accuracy of other algorithms.