The Data Science Lab
Naive Bayes Regression Using C#
Dr. James McCaffrey from Microsoft Research presents a complete end-to-end demonstration of the naive Bayes regression technique, where the goal is to predict a single numeric value. Compared to other machine learning regression techniques, naive Bayes regression is usually less accurate, but is simple, easy to implement and customize, works on both large and small datasets, is highly interpretable, and doesn't require tuning any hyperparameters.
The goal of a machine learning regression problem is to predict a single numeric value. There are roughly a dozen different regression techniques such as basic linear regression, k-nearest neighbors regression, kernel ridge regression, random forest regression, and neural network regression. Naive Bayes regression is essentially a variation of basic linear regression.
In simple linear regression, there is a single predictor variable x, and a single target variable y to predict. For example, you might want to predict a person's income (y) from their age (x). Using a set of training data, you might get a prediction equation like y = (10.2 * x) + 3.57 where the 10.2 is the regression coefficient and the 3.57 is the regression constant.
Now suppose you have a regression problem with two or more predictor variables. For example, you might want to predict a person's income (y) from their age (x0), height (x1), and years of work experience (x2). In naive Bayes regression, you'd predict income from age, and predict income from height, and predict income from years of work experience. To generate the final predicted income, you'd compute the average of the three predicted income values.
The naive Bayes regression technique is called naive because each predictor variable is treated independently of the others, as opposed to taking into account interactions between predictors. The technique is called Bayes, meaning probabilistic, which isn't entirely accurate because the technique does not directly rely on Bayesian principles. Compared to other regression techniques, naive Bayes regression is simple, easy to implement, and results are easy to interpret. But in most problem scenarios, naive Bayes regression is less accurate than other techniques, and is therefore best used as a baseline for comparison with other techniques. However, in some problem scenarios, naive Bayes regression is surprisingly effective.
[Click on image for larger view.] Figure 1: Naive Bayes Regression Using C# in Action
This article presents a complete demo of naive Bayes regression using the C# language. Although there are many code libraries that contain implementations of simple linear regression, to the best of my knowledge, there are no libraries that directly implement naive Bayes regression. Even if a library implementation of naive Bayes regression exists, implementing a system from scratch allows you to easily integrate with other systems implemented with .NET and easily modify the system.
A good way to see where this article is headed is to take a look at the screenshot in Figure 1. The demo program begins by loading a subset of the well-known Wine Dataset into memory. The data looks like:
7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56;9.4;5
7.8;0.88;0;2.6;0.098;25;67;0.9968;3.2;0.68;9.8;5
7.8;0.76;0.04;2.3;0.092;15;54;0.997;3.26;0.65;9.8;5
11.2;0.28;0.56;1.9;0.075;17;60;0.998;3.16;0.58;9.8;6
. . .
There are 200 training items and 40 test items. Each line represents a glass of red wine. There are 12 semicolon-separated values on each line. The goal of the demo program is to predict the value in the first column from the values in the remaining 11 columns.
The demo loads training and test data into memory, creates a naive Bayes regression model, then evaluates the model accuracy on the training and test data:
Creating and training naive Bayes regression model
Done
Evaluating model
Accuracy train (within 0.15) = 0.8550
Accuracy test (within 0.15) = 0.8000
A prediction is scored correct if it's within 15% of the true y value. The model predicts the training data with 85.50% accuracy (171 out of 200 correct), and the test data with 80.00% accuracy (32 out of 40 correct).
The demo then uses the model to predict the target y value for the first training item x = [0.7000, 0.0000, 1.9000, 0.0760, 11.0000, 34.0000, 0.9978, 3.5100, 0.5600, 9.4000, 5.0000]. The predictions for each of the 11 predictor variables, and the final overall prediction, are displayed as:
predictor[ 0] : pred y = 7.4869
predictor[ 1] : pred y = 7.0785
predictor[ 2] : pred y = 7.5916
predictor[ 3] : pred y = 7.5403
predictor[ 4] : pred y = 7.5417
predictor[ 5] : pred y = 7.5447
predictor[ 6] : pred y = 8.2576
predictor[ 7] : pred y = 6.8554
predictor[ 8] : pred y = 7.4776
predictor[ 9] : pred y = 7.7879
predictor[10] : pred y = 7.5722
Predicted y = 7.5213
The predicted y value of 7.5213 is reasonably close to the true y value of 7.4 in the training data.
This article assumes you have intermediate or better programming skill but doesn't assume you know anything about naive Bayes regression. The demo is implemented using C# but you should be able to refactor the demo code to another C-family language if you wish. All normal error checking has been removed to keep the main ideas as clear as possible.
The source code for the demo program is too long to be presented in its entirety in this article. The complete code and data are available in the accompanying file download, and they're also available online online.
The Demo Data
The demo data is a subset of the well-known Wine Quality dataset. Each line represents a glass of red wine from Portugal. The 12 values on each line are: 1.) fixed acidity, 2.) volatile acidity, 3.) citric acid, 4.) residual sugar, 5.) chlorides, 6.) free sulfur dioxide, 7.) total sulfur dioxide, 8.) density, 9.) pH, 10.) sulphates, 11.) alcohol, and 12.) human-generated quality score from 1 to 9.
The full red wine dataset has 1,599 items. There's also a white wine dataset with 4,898 items that's not used by the demo program. The complete wine data can be found here. The demo program uses the first 200 red wine items as training data, and the next 40 items as test data.
Most online examples that use the Wine Quality dataset predict the quality value in column 12 from the other values in columns 1 through 11. However, that problem is much too easy because 1,319 of the 1,599 quality values in the red white set are either 5 or 6. So, for a better challenge, the demo program predicts the value of fixed acidity in column 1 from the values in columns 2 through 12.
The demo data is not normalized. When using many regression techniques, such as k-nearest neighbors regression, it's important to normalize predictor columns to the same scale (typically between 0 and 1, or between -1 and +1) so that a column with large magnitudes (such as employee income) doesn't overwhelm columns with small magnitudes (such as employee age). But because naive Bayes regression treats each predictor column independently, there's no need to normalize the data.
Understanding Simple Linear Regression
Because naive Bayes regression is based on simple linear regression, in order to understand naive Bayes regression, you must understand simple linear regression. The "simple" means one predictor variable. The simple linear regression technique is perhaps best understood by looking at a concrete example. Suppose that you have a set of data with only five items:
x y
-------
3.5 8
1.5 4
4.5 10
1.5 5
2.5 6
You want to find the value of the coefficient a, and the constant b, so that the sum of squared differences between predicted y' = (a * x) + b and the actual y values, is minimized. There are several closed form solutions. One version, used by the demo program, first computes the sums of x, y, (x * x), (y * y), and (x * y):
x y (x * x) (y * y) (x * y)
-----------------------------------
3.5 8 12.25 64 28
1.5 4 2.25 16 6
4.5 10 20.25 100 45
1.5 4 2.25 16 6
2.5 6 6.25 36 15
-----------------------------------
13.5 32 43.25 232 100
Then, if n = 5 is the number of items, and if Sx, Sy, Sxx, Syy, and Sxy represent the sums:
denom = (n * Sxx) - (Sx * x)
= (5 * 13.5) - (13.5 * 13.5)
= 34.0
coeff = a = ((n * Sxy) - (Sx * Sy)) / denom
= ((5 * 100) - (13.5 * 32)) / 34.0
= 68.0 / 34.0
= 2.0
const = b = ((Sy * Sxx) - (Sx * Sxy)) / denom
= ((32 * 43.25) - (13.5 * 100)) / 34.0
= 34.0 / 34.0
= 1.0
Note that this tiny example is artificial in the sense that each data item can be predicted perfectly using the equation y' = (2.0 * x) + 1.0. When working with real-life data, the training data cannot be predicted perfectly.
The Demo Program
I used Visual Studio 2022 (Community Free Edition) for the demo program. I created a new C# console application and checked the "Place solution and project in the same directory" option. I specified .NET version 8.0. I named the project NaiveBayesRegression. I checked the "Do not use top-level statements" option to avoid the weird program entry point shortcut syntax.
The demo has no significant .NET dependencies and any relatively recent version of Visual Studio with .NET (Core) or the older .NET Framework will work fine. You can also use the Visual Studio Code program if you like.
After the template code loaded into the editor, I right-clicked on file Program.cs in the Solution Explorer window and renamed the file to the more descriptive NaiveBayesRegressionProgram.cs. I allowed Visual Studio to automatically rename class Program.
The overall program structure is presented in Listing 1. All the control logic is in the Main() method in the Program class. The Program class also holds helper functions to load data from file into memory and display data. All of the naive Bayes regression functionality is in a NaiveBayesRegressor class. The NaiveBayesRegressor class exposes a constructor and three methods: Train(), Predict(), Accuracy().
Listing 1: Overall Program Structure
using System;
using System.IO;
using System.Collections.Generic;
namespace NaiveBayesRegression
{
internal class NaiveBayesRegressionProgram
{
static void Main(string[] args)
{
Console.WriteLine("Begin naive Bayes regression ");
Console.WriteLine("Predict red wine fixed acidity" +
" from volatile acidity, citric acid, etc. ");
// 1. load data
// 2. create and train model
// 3. evaluate model
// 4. use model
Console.WriteLine("End demo ");
Console.ReadLine();
}
// helpers for Main()
static double[][] MatLoad(string fn, int[] usecols,
char sep, string comment) { . . }
static double[] MatToVec(double[][] mat) { . . }
static void VecShow(double[] vec, int dec,
int wid) { . . }
} // class Program
// ========================================================
public class RandomNeighborhoodsRegressor
{
public double[] coefs; // one per predictor
public double[] constants;
public NaiveBayesRegressor() { . . }
public void Train(double[][] trainX,
double[] trainY) { . . }
public double Predict(double[] x,
bool verbose = false) { . . }
public double Accuracy(double[][] dataX,
double[] dataY, double pctClose) { . . }
}
// ========================================================
} // ns
The demo program starts by loading the 200-item training data into memory:
string trainFile =
"..\\..\\..\\Data\\wine_train_200.txt";
int[] colsX = new int[] { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 };
double[][] trainX = MatLoad(trainFile, colsX, ';', "#");
double[] trainY = MatToVec(MatLoad(trainFile,
new int[] { 0 }, ';', "#"));
The training X data is stored into an array-of-arrays style matrix of type double. The data is assumed to be in a directory named Data, which is located in the project root directory. The arguments to the MatLoad() function mean load zero-based columns 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 where the data is semicolon-delimited, and lines beginning with "#" are comments to be ignored. The training target y data in column [0] is loaded into a matrix and then converted to a one-dimensional vector using the MatToVec() helper function.
The 40-item test data is loaded into memory using the same pattern that was used to load the training data:
string testFile =
"..\\..\\..\\Data\\wine_test_40.txt";
double[][] testX = MatLoad(testFile, colsX, ';', "#");
double[] testY = MatToVec(MatLoad(testFile,
new int[] { 0 }, ';', "#"));
Console.WriteLine("Done ");
The first five training items are displayed using 4 decimals with 8 columns width like so:
Console.WriteLine("First five train X: ");
for (int i = 0; i < 5; ++i)
VecShow(trainX[i], 4, 8);
Console.WriteLine("First five train y: ");
for (int i = 0; i < 5; ++i)
Console.WriteLine(trainY[i].ToString("F4").PadLeft(8));
In a non-demo scenario, you might want to display all the training data to make sure it was correctly loaded into memory. The random neighborhoods regression model is created and trained like so:
Console.WriteLine("Creating and training " +
"naive Bayes regression model ");
NaiveBayesRegressor model = new NaiveBayesRegressor();
model.Train(trainX, trainY);
Console.WriteLine("Done ");
Unlike many regression techniques, the demo NaiveBayesRegressor model doesn't require any hyperparameters, such as a learning rate, that must be tuned. Next, the demo evaluates model accuracy:
Console.WriteLine("Evaluating model ");
double accTrain = model.Accuracy(trainX, trainY, 0.15);
Console.WriteLine("Accuracy train (within 0.15) = " +
accTrain.ToString("F4"));
double accTest = model.Accuracy(testX, testY, 0.15);
Console.WriteLine("Accuracy test (within 0.15) = " +
accTest.ToString("F4"));
The Accuracy() method scores a prediction as correct if the predicted y value is within 15% of the true target y value. There are several other ways to evaluate a trained regression model, including root mean squared error, coefficient of determination, and so on. Using the Accuracy() method as a template, other evaluation metrics are easy to implement.
The demo concludes by using the trained naive Bayes regression model to make a prediction:
double[] x = trainX[0];
Console.WriteLine("Predicting for x = ");
VecShow(x, 4, 8);
double predY = model.Predict(x, verbose:true);
Console.WriteLine("\nPredicted y = " +
predY.ToString("F4"));
The x input is the first training data item. The verbose argument passed to Predict() instructs the method to display information for each of the 11 simple linear regression models. For additional interpretability, you could modify the demo program code to display the coefficient and constant values for each of the 11 models.
Wrapping Up
Because naive Bayes regression doesn't take into account interactions between predictor variables, the technique can only handle relatively simple data. One possible example of when naive Bayes regression can work well is when two of the predictor variables are mathematically correlated in such a way that they cancel each other out.
Naive Bayes regression is closely related to naive Bayes classification. In naive Bayes classification, the goal is to predict the value of a categorical target variable from categorical predictor values. For example, you could use naive Bayes classification to predict the political leaning of a person (conservative, moderate, liberal) from sex (male, female), income (low, medium, high), and height (short, tall). Both naive Bayes regression and naive Bayes classification sometimes (but certainly not always) work well in spite of their simplicity.