### How to Do Machine Learning Perceptron Classification Using C#

Dr. James McCaffrey of Microsoft Research uses code samples and screen shots to explain perceptron classification, a machine learning technique that can be used for predicting if a person is male or female based on numeric predictors such as age, height, weight, and so on. It's mostly useful to provide a baseline result for comparison with more powerful ML techniques such as logistic regression and k-nearest neighbors.

Perceptron classification is arguably the most rudimentary machine learning (ML) technique. The perceptron technique can be used for binary classification, for example predicting if a person is male or female based on numeric predictors such as age, height, weight, and so on. From a practical point of view, perceptron classification is useful mostly to provide a baseline result for comparison with more powerful ML techniques such as logistic regression and k-nearest neighbors.

From a conceptual point of view, understanding how perceptron classification works is often considered fundamental knowledge for ML engineers, is interesting historically, and contains important techniques used by logistic regression and neural network classification. In fact, the simplest type of neural network is often called a multi-layer perceptron.

Additionally, understanding exactly how perceptron classification works by coding a system from scratch allows you to understand the system's strengths and weaknesses in case you encounter the technique in an ML code library. For example, the Azure ML.NET library has a perceptron classifier, but the library documentation doesn't fully explain how the technique works or when to use it.

A good way to get a feel for what perceptron classification is and to see where this article is headed is to
take a look at the screenshot of a demo program in **Figure 1**. The goal of the demo is to create a model
that predicts if a banknote (think dollar bill or euro) is authentic or a forgery.

The demo program sets up a tiny set of 10 items to train the model. Each data item has four predictor variables (often called features in ML terminology) that are characteristics of a digital image of each banknote: variance, skewness, kurtosis, and entropy. Each data item is labeled as -1 (authentic) or +1 (forgery).

Behind the scenes, the demo program uses the 10-item training dataset to create a perceptron prediction model. The final model scores 0.6000 accuracy on the training data (6 correct predictions, 4 wrong). The demo concludes by using the perceptron model to predict the authenticity of a new, previously unseen banknote with predictor values (0.00, 2.00, -1.00, 1.00). The computed output is -1 (authentic).

This article assumes you have intermediate or better skill with C# but doesn’t assume you know anything about perceptron classification. The complete code for the demo program shown is presented in this article. The code is also available in the file download that accompanies this article.

**Understanding the Data**

The demo program uses a tiny 10-item subset of a well-known benchmark collection of data called the Banknote
Authentication Dataset. The full dataset has 1,372 items, with 762 authentic and 610 forgery items. You can find
the complete dataset in many places on the Internet, including here for
convenience.

Most versions of the dataset encode authentic as 0 and forgery as 1. For perceptron classification, it's much more convenient to encode the two possible class labels to predict as -1 and +1 instead of 0 and 1. Which class is encoded as -1 and which class is encoded as +1 is arbitrary but it's up to you to keep track of what each value means.

Because the data has four dimensions, it's not possible to display the data in a two-dimensional graph. However,
you can get an idea of what the data is like by taking a look at a graph of partial data shown in **Figure
2**.

The graph plots just the skewness and entropy of the 10 items. The key point is that perceptron classifiers only
work well with data that is linearly separable. For data that is linearly separable, it's possible to draw a line
(or hyperplane for three or more dimensions) that separates the data so that all of one class is on one side of
the line and all of the other class is on the other side. You can see in **Figure 2** that no line will perfectly
separate the two classes. In general, you won't know in advance if your data is linearly separable or not.

**Understanding How Perceptron Classification Works**

Perceptron
classification is very simple. For a dataset with n predictor variables, there will be n weights plus one
special weight called a bias. The weights and bias are just numeric constants with values like -1.2345 and
0.9876. To make a prediction, you sum the products of each predictor value and its associated weight and then
add the bias. If the sum is negative the prediction is class -1 and if the sum is positive the prediction is
class +1.

For example, suppose you have a dataset with three predictor variables and suppose that the three associated weight values are (0.20, -0.50, 0.40) and the bias value is 1.10. If the item to predict has values (-7.0, 3.0, 9.0) then the computed output is (0.20 * -7.0) + (-0.50 * 3.0) + (0.40 * 9.0) + 1.10 = -1.4 + (-1.5) + 3.6 + 1.1 = +1.8 and therefore the predicted class is +1.

Of course the tricky part is determining the weights and bias values of a perceptron classifier. This is called training the model. Briefly, training is an iterative process that tries different values for the model's weights and the bias until the computed outputs closely match the known correct class values in the training data.

Because of the way perceptron classification output is computed, it's usually a good idea to normalize the training data so that small predictor values (such as a GPA of 3.15) aren't overwhelmed by large predictor values (such as an annual income of 65,000.00). The demo program doesn't used normalized data because all the predictor values are roughly in the same range (about -15.0 to + 15.0). The three most common normalization techniques are min-max normalization, z-score normalization, and order of magnitude normalization.

**The Demo Program**

To create the demo program, I launched Visual Studio 2019. I used the Community (free) edition but any
relatively recent version of Visual Studio will work fine. From the main Visual Studio start window I selected
the "Create a new project" option. Next, I selected C# from the Language dropdown control and Console from the
Project Type dropdown, and then picked the "Console App (.NET Core)" item.

The code presented in this article will run as a .NET Core console application or as a .NET Framework application. Many of the newer Microsoft technologies, such as the ML.NET code library, specifically target .NET Core so it makes sense to develop most new C# machine learning code in that environment.

I entered "Perceptron" as the Project Name, specified C:\VSM on my local machine as the Location (you can use any convenient directory), and checked the "Place solution and project in the same directory" box.

After the template code loaded into Visual Studio, at the top of the editor window I removed all using statements to unneeded namespaces, leaving just the reference to the top-level System namespace. The demo needs no other assemblies and uses no external code libraries.

In the Solution Explorer window, I renamed file Program.cs to the more descriptive PerceptronProgram.cs and then
in the editor window I renamed class Program to class PerceptronProgram to match the file name. The structure of
the demo program, with a few minor edits to save space, is shown in **Listing 1**.

**Listing 1. Perceptron Classification Demo Program Structure**

using System; namespace Perceptron { class PerceptronProgram { static void Main(string[] args) { Console.WriteLine("Begin perceptron demo"); Console.WriteLine("Authentic (-1) fake (+1)"); Console.WriteLine("Data looks like: "); Console.WriteLine(" 3.6216, 8.6661," + " -2.8073, -0.44699, -1"); Console.WriteLine("-2.0631, -1.5147," + " 1.219, 0.44524, +1"); Console.WriteLine("Loading data"); double[][] xTrain = new double[10][]; xTrain[0] = new double[] { 3.6216, 8.6661, -2.8073, -0.44699 }; // auth xTrain[1] = new double[] { 4.5459, 8.1674, -2.4586, -1.4621 }; xTrain[2] = new double[] { 3.866, -2.6383, 1.9242, 0.10645 }; xTrain[3] = new double[] { 2.0922, -6.81, 8.4636, -0.60216 }; xTrain[4] = new double[] { 4.3684, 9.6718, -3.9606, -3.1625 }; xTrain[5] = new double[] { -2.0631, -1.5147, 1.219, 0.44524 }; // forgeries xTrain[6] = new double[] { -4.4779, 7.3708, -0.31218, -6.7754 }; xTrain[7] = new double[] { -3.8483, -12.8047, 15.6824, -1.281 }; xTrain[8] = new double[] { -2.2804, -0.30626, 1.3347, 1.3763 }; xTrain[9] = new double[] { -1.7582, 2.7397, -2.5323, -2.234 }; int[] yTrain = new int[] { -1, -1, -1, -1, -1, 1, 1, 1, 1, 1 }; // -1 = auth, 1 = forgery int maxIter = 100; double lr = 0.01; Console.WriteLine("Starting training"); double[] wts = Train(xTrain, yTrain, lr, maxIter, 0); Console.WriteLine("Training complete"); double acc = Accuracy(xTrain, yTrain, wts); Console.WriteLine("Accuracy = "); Console.WriteLine(acc.ToString("F4")); Console.WriteLine("Weights and bias: "); for (int i = 0; i < wts.Length; ++i) Console.Write(wts[i].PadLeft(8)); Console.WriteLine(""); Console.WriteLine("Note (0.00 2.00 -1.00 1.00)"); double[] unknown = new double[] { 0.00, 2.00, -1.00, 1.00 }; double z = ComputeOutput(unknown, wts); Console.WriteLine("Computed output = "); Console.WriteLine(z); // -1 or +1 Console.WriteLine("End perceptron demo "); Console.ReadLine(); } // Main static int ComputeOutput(double[] x, double[] wts) { . . } static double[] Train(double[][] xData, int[] yData, double lr, int maxEpochs, int seed) { . . } static void Shuffle(int[] indices, Random rnd) { . . } static double Accuracy(double[][] xData, int[] yData, double[] wts) { . . } } // Program class } // ns

All of the program logic is contained in the Main method. The demo uses a static method approach rather than an OOP approach for simplicity. All normal error checking has been removed to keep the main ideas as clear as possible.

The demo begins by setting up the training data:

double[][] xTrain = new double[10][]; xTrain[0] = new double[] { 3.6216, 8.6661, -2.8073, -0.44699 }; . . . int[] yTrain = new int[] { -1, -1, -1, -1, -1, 1, 1, 1, 1, 1 };

The predictor values are hard-coded and stored into an array-of-arrays style matrix. The class labels are stored in a single integer array. In a non-demo scenario you'd likely want to store your training data as a text file:

3.6216, 8.6661, -2.8073, -0.44699, -1 4.5459, 8.1674, -2.4586, -1.4621, -1 . . . -1.7582, 2.7397, -2.5323, -2.234, 1

And then you'd read the training data into memory using helper functions along the lines of:

double[][] xTrain = MatLoad("..\\data.txt", new int[] { 0, 1, 2, 3 }, ","); int[] yTrain = VecLoad("..\\data.txt", 4, ",");

In many scenarios you'd want to set aside some of your source data as a test dataset. After training you'd compute the prediction accuracy of the model on the held-out dataset. This accuracy metric would be a rough estimate of the accuracy you could expect on new, previously unseen data.

After setting up the training data, the demo program trains the model using these statements:int maxIter = 100; double lr = 0.01; Console.WriteLine("Starting training"); double[] wts = Train(xTrain, yTrain, lr, maxIter, 0);

The maxIter variable holds the number of training iterations to perform and the lr variable holds the learning rate. Both of these values are hyperparameters that must be determined using trial and error. The learning rate influences how much the weights and bias change on each training iteration.

The 0 argument passed to the Train() function is the seed value for a Random object that is used to scramble the order in which training items are processed. The Train() function returns an array that holds the weights and the bias, which essentially defines the perceptron classification model.

After training, the demo program computes the model's accuracy on the training data, and then displays the values of the weights and bias:

double acc = Accuracy(xTrain, yTrain, wts); Console.WriteLine(acc.ToString("F4")); Console.WriteLine("\nModel weights and bias: "); for (int i = 0; i < wts.Length; ++i) Console.Write(wts[i].ToString("F4").PadLeft(8)); Console.WriteLine("");

The demo concludes by making a prediction for a new banknote item:

double[] unknown = new double[] { 0.00, 2.00, -1.00, 1.00 }; double z = ComputeOutput(unknown, wts); Console.WriteLine("Computed output = "); Console.WriteLine(z);

The Accuracy() function computes the number of correct and incorrect predictions on the training data. Because the training data has five authentic and five forgery items, just by guessing either class you would get 50 percent accuracy. Therefore the 60 percent accuracy of the demo model isn't very strong and in a non-demo scenario you'd likely next try a more powerful approach such as logistic regression, k-nearest neighbors, numeric naive Bayes, or a neural network.

**Computing Output**

The ComputeOutput() function is very simple:

static int ComputeOutput(double[] x, double[] wts) { double z = 0.0; for (int i = 0; i < x.Length; ++i) z += x[i] * wts[i]; z += wts[wts.Length - 1]; // add the bias if (z < 0.0) return -1; else return +1; }

Exactly how to store the bias value is a design choice. The demo program stores the bias in the last cell of the weights array. Two alternatives are to pass the bias as a separate parameter, or to store the bias in the first cell of the weights array. All three designs are common so if you're using a library implementation you'll have to look at the source code to see which approach is being used.

The ComputeOutput() function returns -1 or +1 depending on the sign of the sum of products term. Rather inexplicably, some references, including Wikipedia, apply the Signum() function on the sum of products term. This returns -1, 0 (if the sum is exactly 0), or +1. This approach is incorrect. For perceptron classification, a sum of products of 0.0 must be arbitrarily associated to either class -1 or class +1. The demo associates a sum of exactly 0.0 to class +1.

**Training a Perceptron Model**

The Train() function is presented in **Listing 2**. The demo program uses a variation of perceptron training
called average perceptron. The key statements for both basic perceptron training and average perceptron training
are:

int output = ComputeOutput(xData[i], wts); int target = yData[i]; // -1 or +1 if (output != target) { double delta = target - output; for (int j = 0; j < n; ++j) wts[j] = wts[j] + (lr * delta * xData[i][j]); wts[n] = wts[n] + (lr * delta * 1); }

For each training item, the computed output using the current eights and bias values will be either -1 or +1. The correct target value will be -1 or +1. If the computed output and the target values are the same, the predicted class is correct and the weights and bias are not adjusted.

If the computed and target are different, the difference between the two values will be either -2 or +2. Each weight and the bias is adjusted to make the computed output value closer to the target output value by adding or subtracting the learning rate times the delta, times the associated predictor input value. The sign of the input value controls the direction of the change in the associated weight, and larger input value produce a larger change in weight. Very clever!

**Listing 2. Perceptron Train Function**

static double[] Train(double[][] xData, int[] yData, double lr, int maxIter, int seed) { int N = xData.Length; // num items int n = xData[0].Length; // num predictors double[] wts = new double[n + 1]; // for bias double[] accWts = new double[n + 1]; double[] avgWts = new double[n + 1]; int[] indices = new int[N]; Random rnd = new Random(seed); int iter = 0; int numAccums = 0; while (iter < maxIter) { Shuffle(indices, rnd); foreach (int i in indices) { int output = ComputeOutput(xData[i], wts); int target = yData[i]; // -1 or +1 if (output != target) { double delta = target - output; for (int j = 0; j < n; ++j) wts[j] = wts[j] + (lr * delta * xData[i][j]); wts[n] = wts[n] + (lr * delta * 1); } for (int j = 0; j < wts.Length; ++j) accWts[j] += wts[j]; ++numAccums; } // for each item ++iter; } // while epoch for (int j = 0; j < wts.Length; ++j) avgWts[j] = accWts[j] / numAccums; return avgWts; } // Train

The average perceptron training technique accumulates each weight value on each training iteration. Then, when the training iterations complete, then function returns the average weight and bias values. The idea is that weights that produce a correct output result are retained in a sense, rather than being discarded.

The averaging technique usually produces a slightly better model than the basic non-averaging technique. However, if you have a very large dataset the accumulated sum could overflow. In such situations you can compute a rolling average, m, using the equation m(k) = m(k-1) + [ (x(k) – m(k-1)) / k ] where m(k) is the mean for the kth value and x(k) is the kth value.

The Train() function processes the data items in a random order on each pass through the training dataset. This helps prevent an oscillation where the weight updates due to one item are immediately undone by the next item. The demo program scrambles the order of the training items using a program-defined function Shuffle():

static void Shuffle(int[] indices, Random rnd) { int n = indices.Length; for (int i = 0; i < n; ++i) { int ri = rnd.Next(i, n); int tmp = indices[ri]; indices[ri] = indices[i]; indices[i] = tmp; } }

The Shuffle() function uses the Fisher-Yates algorithm, which is one of the most common techniques in machine learning. The function accepts a Random object. An alternative design is to define and use a static class-scope Random object.

**Computing Model Accuracy
**

Function Accuracy() computes the percentage of correct predictions made by a model with specified weights and
bias values. The function definition is presented in **Listing 3**. The function walks through each training
item's predictor values, uses the predictors to compute a -1 or +1 output value, and fetches the corresponding
target -1 or +1 value. If the computed value and target value are the same then the prediction is correct,
otherwise the prediction is wrong.

**Listing 3. Perceptron Accuracy Function**

static double Accuracy(double[][] xData, int[] yData, double[] wts) { int numCorrect = 0; int numWrong = 0; int N = xData.Length; for (int i = 0; i < N; ++i) // each item { double[] x = xData[i]; int target = yData[i]; int computed = ComputeOutput(x, wts); if (target == 1 && computed == 1 || target == -1 && computed == -1) { ++numCorrect; } else { ++numWrong; } } return (1.0 * numCorrect) / (numCorrect + numWrong); }

The demo code checks if both target and computed values are 1, or if both are -1. This approach is used to make it explicit that -1 and +1 are the only two possible output values. A more efficient approach is simply to check if the computed value equals the target value.

If you're familiar with other machine learning classification techniques, such as logistic regression, you might wonder about computing an error metric. Computing error for perceptron classification isn't feasible because the outputs are discrete. A computed output value prediction is either correct or incorrect. Put another way, for perceptron classification, accuracy and error are essentially the same metric.

**Wrapping Up
**

In practice, perceptron classification should be used rarely. Although perceptron classification is simple and
elegant, logistic regression is only slightly more complex and usually gives better results. Some of my
colleagues have asked me why averaged perceptron classification is part of the new ML.NET library. As it turns
out, averaged perceptron was the first classifier algorithm implemented in the predecessor to ML.NET library, an
internal Microsoft library from Microsoft Research named TMSN, which was later renamed to TLC. The averaged
perceptron classifier was implemented first because it is so simple. The average perceptron classifier was
retained from version to version, not because of its practical value, but because removing it would require
quite a bit of effort.