The Data Science Lab
Linear Regression with Two-Way Interactions Using C#
The goal of a machine learning regression problem is to predict a single numeric value. For example, you might want to predict an employee's salary based on age, height, high school grade point average, and so on. There are approximately a dozen common regression techniques. The most basic technique is called linear regression, or sometimes multiple linear regression, where the "multiple" indicates two or more predictor variables.
The form of a basic linear regression prediction model is y' = (w0 * x0) + (w1 * x1) + . . . + (wn * xn) + b, where y' is the predicted value, the xi are predictor values, the wi are weights (also called coefficients), and b is the bias (also called the intercept). The form of a linear regression with two-way interactions model is y' = (w0 * x0) + . . . + (wn * xn) + (w01 * x0 * x1) + (w02 * x0 * x2) + . . . + b. The interaction terms are the products of all combinations of pairs of the predictor variables.
Compared to basic linear regression, linear regression with interactions can handle more complex data. Compared to other regression techniques that are designed to handle complex data, such as kernel ridge regression and Gaussian process regression, linear regression with interactions often has slightly worse prediction accuracy, but has better model interpretability.
This article presents a demo of linear regression with two-way interactions, implemented from scratch, using the C# language. A good way to see where this article is headed is to take a look at the screenshot in Figure 1. The demo program begins by loading synthetic training and test data into memory. The data looks like:
-0.1660, 0.4406, -0.9998, -0.3953, -0.7065, 0.4840
0.0776, -0.1616, 0.3704, -0.5911, 0.7562, 0.1568
-0.9452, 0.3409, -0.1654, 0.1174, -0.7192, 0.8054
0.9365, -0.3732, 0.3846, 0.7528, 0.7892, 0.1345
. . .
The first five values on each line are the x predictors. The last value on each line is the target y variable to predict. There are 200 training items and 40 test items.
The demo creates and trains a linear regression with two-way interactions model, evaluates the model accuracy on the training and test data, and then uses the model to predict the target y value for the first training item x = [-0.1660, 0.4406, -0.9998, -0.3953, -0.7065].
The first part of the demo output shows how a linear regression with interactions model is created and trained:
Creating Linear Regression with two-way interactions model
Setting lrnRate = 0.0010
Setting maxEpochs = 100
Starting training
epoch = 0 MSE = 0.3430
epoch = 20 MSE = 0.0570
epoch = 40 MSE = 0.0344
epoch = 60 MSE = 0.0298
epoch = 80 MSE = 0.0281
Done
In theory, a linear regression with interactions model can be trained using a closed-form solution that involves computing a matrix inverse. But in practice, a model is usually trained using iterative stochastic gradient descent, which requires a learning rate and a maximum number of training epochs. The value of the MSE (mean squared error) slowly decreases, which indicates training is working.
The next part of the demo output shows the trained model weights and the bias:
Model base weights:
-0.2624 0.0340 -0.0462 0.0322 -0.1152
Model bias/intercept: 0.3620
Model interaction weights:
0.0000 0.0000 0.0000 0.0000 0.0000
-0.0014 0.0000 0.0000 0.0000 0.0000
0.0321 0.0101 0.0000 0.0000 0.0000
0.0183 -0.0110 0.0008 0.0000 0.0000
0.0947 0.0328 -0.0452 0.0024 0.0000
Because the demo data has five predictor variables, there are five basic weights. With two-way interactions, there are an additional 10 interaction weights: w01, w02, w03, w04, w12, w13, w14, w23, w24, w34. In general, if there are n predictor variables, there are (n * (n-1)) / 2 interaction weights. The demo stores the interaction weights in the lower left part of an n-by-n matrix where the row index is the first predictor and the column index is the second predictor. For instance, the -0.0110 value at [3][1] is the weight for the x3 and x1 interaction.
[Click on image for larger view.] Figure 1: Linear Regression with Two-Way Interactions in Action
The demo concludes by evaluating the trained model and making a prediction:
Evaluating model
Accuracy train (within 0.15) = 0.8300
Accuracy test (within 0.15) = 0.8000
Predicting for x =
-0.1660 0.4406 -0.9998 -0.3953 -0.7065
Predicted y = 0.5094
The model accuracy is reasonably good compared to many other regression techniques applied to the synthetic data -- 83.00% accuracy on the training data (166 out of 200 correct) and 80.00% accuracy on the test data (32 out of 40 correct). A prediction is scored as correct if it's within 15% of the true target value.
This article assumes you have intermediate or better programming skill but doesn't assume you know anything about linear regression with two-way interactions. The demo is implemented using C# but you should be able to refactor the demo code to another C-family language if you wish. All normal error checking has been removed to keep the main ideas as clear as possible.
The source code for the demo program is too long to be presented in its entirety in this article. The complete code and data are available in the accompanying file download, and they're also available online.
The Demo Data
The demo data is synthetic. It was generated by a 5-10-1 neural network with random weights and bias values. The idea here is that the synthetic data does have an underlying, but complex, non-linear structure which can be predicted.
All of the predictor values are between -1 and +1. When using linear regression with interactions, technically, it's not necessary to normalize/scale your data. But normalizing usually leads to a better prediction model, especially if some raw predictor values are very large (such as employee income) and some values are small (such as employee age). Also, using normalized data allows you to interpret the model weights more easily (larger magnitudes mean more effect on the predicted y value).
Linear regression with interactions is most often used with data that has strictly numeric predictor variables. It is possible to use the technique with categorical data that has an inherent order using equal-interval encoding. For example, a predictor variable height with possible values (short, medium, tall) could be encoded as short = 0.25, medium = 0.50, tall = 0.75.
However, for categorical data without inherent order, such as color with possible values (red, blue, green), there's no obvious way to encode the values so that when multiplied together you get a meaningful number. That said, I have seen examples where non-ordered categorical predictor values were equal-interval encoded and the resulting prediction model worked quite well. However, there are no solid research results, that I'm aware of, that support this technique.
Understanding Linear Regression with Two-Way Interactions
Linear regression with two-way interactions is probably best understood by looking at a concrete example. Suppose you have three predictor variables (instead of five as in the demo data). If the predictor values are x0, x1, x2 and the base weights are w0, w1, w2, and the interaction weights are w01, w02, w12, and the bias is b, the prediction equation is y' = (w0 * x0) + (w1 * x1) + (w2 * x2) + (w01 * x0 * x1) + (w02 * x0 * x2) + (w12 * x1 * x2) + b.
If x = (x0, x1, x2) = (0.3, 0.7, 0.5), and w = (w0, w1, w2) = (1.5, 2.5, 0.4), and w01 = 3.2, w02 = 1.8, w12 = 1.0, and b = -2.8, then:
y' = (w0 * x0) + (w1 * x1) + (w2 * x2) +
(w01 * x0 * x1) +
(w02 * x0 * x2) +
(w12 * x1 * x2) + b
= (1.5 * 0.3) + (2.5 * 0.7) + (0.4 * 0.5) +
(3.2 * 0.3 * 0.7) +
(1.8 * 0.3 * 0.5) +
(1.0 * 0.7 * 0.5) + (-2.8)
= 0.45 + 1.75 + 0.20 + 0.672 + 0.270 + 0.350 - 2.80
= 1.692
The interaction effects are captured by multiplying predictor values. There are other ways to generate interaction effects but multiplication is simple and effective. An obvious extension of two-way interactions between all possible pairs of predictor variables is to use three-way or four-way or greater interactions. However, anything greater than two-way interactions quickly becomes a complicated mess, and in my opinion, you're better off using a more sophisticated regression technique that is designed for all-way interactions, specifically kernel ridge regression or neural network regression.
Training the Linear Regression with Interactions Model
The demo program uses standard stochastic gradient descent (SGD) optimization to train the regression model. In high-level pseudo-code:
initialize weights and bias to small random values
loop max_epochs times
shuffle order of training data
loop each training item
get training inputs X
get training target value actual y
compute predicted y using curr weights
loop each weight and the bias
new_wt = old_wt * -1 * lrn_rate * (pred_y - actual_y) * x
end-loop
end-loop
end-loop
The idea is to adjust each weight so that the predicted y gets closer to the actual y. The (pred_y - actual_y) is a simple measure of how far off the predicted value is. If pred_y is larger than actual_y, you want to decrease the associated weight to make the pred_y smaller.
The learning rate, lrn_rate, is typically a value like 0.01 that lessens the amount of change in the weights so that there are no wild fluctuations. The learning rate must be determined by trial and error. If the learning rate is too large, SGD might jump past a good weight value. If the learning rate is too small, training might be too slow.
The -1 term controls the direction of the change in the weight being adjusted. If the difference term was (actual_y - pred_y), the -1 term could be dropped, but it's traditional to use (pred_y - actual_y). The x term in the update controls the direction of the weight change, depending on the sign of x, and the amount of change in the weight, depending on the magnitude of x.
The order in which the training data is processed is randomized in each epoch (one complete pass through all items). This prevents training from oscillating back and forth when one item's weight updates undoes the previous item's weight updates.
The Demo Program
I used Visual Studio 2022 (Community Free Edition) for the demo program. I created a new C# console application and checked the "Place solution and project in the same directory" option. I specified .NET version 8.0. I named the project LinearRegressionInteractions. I checked the "Do not use top-level statements" option to avoid the strange program entry point shortcut syntax.
The demo has no significant .NET dependencies and any relatively recent version of Visual Studio with .NET (Core) or the older .NET Framework will work fine. You can also use the Visual Studio Code program if you like.
After the template code loaded into the editor, I right-clicked on file Program.cs in the Solution Explorer window and renamed the file to the more descriptive LinearRegressionInteractionsProgram.cs. I allowed Visual Studio to automatically rename class Program.
The overall program structure is presented in Listing 1. All the control logic is in the Main() method in the Program class. The Program class also holds helper functions to load data from file into memory and display data. All of the linear regression with two-way interactions functionality is in a LinearRegressor class. The LinearRegressor class exposes a constructor and four methods: Train(), Predict(), Accuracy(), and MeanSqError().
Listing 1: Overall Program Structure
using System;
using System.IO;
using System.Collections.Generic;
namespace LinearRegressionInteractions
{
internal class LinearRegressionInteractionsProgram
{
static void Main(string[] args)
{
Console.WriteLine("Begin C# linear regression" +
" with two-way interactions demo ");
// 1. load data
// 2. create model
// 3. train model
// 4. show model weights
// 5. evaluate model
// 6. use model
Console.WriteLine("End demo ");
Console.ReadLine();
}
static double[][] MatLoad(string fn, int[] usecols,
char sep, string comment) { . . }
static double[] MatToVec(double[][] mat) { . . }
static void VecShow(double[] vec, int dec,
int wid) { . . }
}
// --------------------------------------------------------
public class LinearRegressor
{
public double[] weights;
public double[][] interactionWts;
public double bias;
private Random rnd;
public LinearRegressor(int seed) { . . }
public void Train(double[][] trainX, double[] trainY,
double lrnRate, int maxEpochs) { . . }
public double Predict(double[] x) { . . }
public double Accuracy(double[][] dataX, double[] dataY,
double pctClose) { . . }
public double MeanSqError(double[][] dataX,
double[] dataY) { . . }
private static void Shuffle(int[] indices,
Random rnd) { . . }
}
// --------------------------------------------------------
} // ns
The LinearRegressor class declares the fields interactionWts , weights, and bias with public scope so that they can be accessed directly to view them. The private Random object rnd is used by the Shuffle() method, which is called by the Train() method, to shuffle the order in which training items are processed.
Loading the Data into Memory
The demo program starts by loading the 200-item training data into memory:
string trainFile =
"..\\..\\..\\Data\\synthetic_train_200.txt";
int[] colsX = new int[] { 0, 1, 2, 3, 4 };
double[][] trainX =
MatLoad(trainFile, colsX, ',', "#");
double[] trainY =
MatToVec(MatLoad(trainFile,
new int[] { 5 }, ',', "#"));
The training X data is stored into an array-of-arrays style matrix of type double. The data is assumed to be in a directory named Data, which is located in the project root directory. The arguments to the MatLoad() function mean load columns 0, 1, 2, 3, 4 where the data is comma-delimited, and lines beginning with "#" are comments to be ignored. The training y data in column [5] is loaded into a matrix and then converted to a one-dimensional vector using the MatToVec() helper function.
The 40-item test data is loaded into memory using the same pattern that was used to load the training data:
string testFile =
"..\\..\\..\\Data\\synthetic_test_40.txt";
double[][] testX =
MatLoad(testFile, colsX, ',', "#");
double[] testY =
MatToVec(MatLoad(testFile,
new int[] { 5 }, ',', "#"));
The first three training items are displayed with four decimals like so:
Console.WriteLine("First three train X: ");
for (int i = 0; i < 3; ++i)
VecShow(trainX[i], 4, 8);
Console.WriteLine("First three train y: ");
for (int i = 0; i < 3; ++i)
Console.WriteLine(trainY[i].ToString("F4").PadLeft(8));
In a non-demo scenario, you might want to display all the training data to make sure it was correctly loaded into memory.
Creating and Training the Model
The linear regression with two-way interactions is created like so:
Console.WriteLine("Creating Linear Regression" +
" with two-way interactions model ");
int seed = 0;
LinearRegressor model = new LinearRegressor(seed);
The constructor accepts only a seed value for an internal random number generator that's used by the Train() method. The value of the seed parameter can have a significant effect on the accuracy of the resulting model, but you shouldn't vary the seed value to train your model. Next, the demo prepares the training parameters and calls the Train() method:
double lrnRate = 0.001;
int maxEpochs = 100;
Console.WriteLine("Setting lrnRate = " +
lrnRate.ToString("F4"));
Console.WriteLine("Setting maxEpochs = " +
maxEpochs);
Console.WriteLine("Starting training ");
model.Train(trainX, trainY, lrnRate, maxEpochs);
Console.WriteLine("Done ");
The values of lrnRate and maxEpochs must be determined by trial and error. In a non-demo scenario, you can programmatically try different values using a grid search along the lines of:
double[] rates = new double[] { 0.001, 0.005, 0.010, 0.050 };
int[] epochs = new int[] { 50, 100, 500, 1000 };
foreach (double lrnRate in rates) {
foreach (int maxEpoch in epochs) {
LinearRegressor model = new LinearRegressor(0);
model.Train(trainX, trainY, lrnRate, maxEpoch);
// compute and display accuracy and error
}
}
After training has completed, the demo program displays the model weights:
Console.WriteLine("Model base weights: ");
VecShow(model.weights, 4, 8);
Console.WriteLine("Model bias/intercept: " +
model.bias.ToString("F4").PadLeft(8));
Console.WriteLine("Model interaction weights: ");
for (int i = 0; i < model.interactionWts.Length; ++i)
VecShow(model.interactionWts[i], 4, 8);
The demo program does not use regularization, which is one of several techniques that restrict the magnitude of the resulting model weights. Large model weights often lead to model overfitting when the model predicts well on the training data but poorly on new, previously unseen data. A simple form of regularization is weight decay. After every training epoch, all weights are reduced by multiplying each by a decay constant value like 0.995, where the decay constant must be determined by trial and error.
Evaluating and Using the Model
The demo program evaluates the trained model using these statements:
Console.WriteLine("Evaluating model ");
double accTrain = model.Accuracy(trainX, trainY, 0.15);
Console.WriteLine("Accuracy train (within 0.15) = " +
accTrain.ToString("F4"));
double accTest = model.Accuracy(testX, testY, 0.15);
Console.WriteLine("Accuracy test (within 0.15) = " +
accTest.ToString("F4"));
The Accuracy() method scores a prediction value as correct if it's within a specified percentage of the true target value. A reasonable closeness percentage to use will vary from problem to problem.
Accuracy is not a very granular metric. You can also use the public MeanSqError() method to display the more granular mean squared error of the model. Many of the regression modules in the Python language scikit-learn library use the coefficient of determination, R-squared, as the primary evaluation metric, but in my opinion mean squared error is easier to interpret. If you wish, you can easily implement a coefficient of determination method by using the MeanSqError() method as a template.
The demo concludes by using the trained model to predict the y value for the first training item, (-0.1660, 0.4406, -0.9998, -0.3953, -0.7065):
double[] x = trainX[0];
Console.WriteLine("Predicting for x = ");
VecShow(x, 4, 9);
double predY = model.Predict(x);
Console.WriteLine("Predicted y = " + predY.ToString("F4"));
The predicted y value, 0.5094, is reasonably close (within 6%) to the true target value of 0.4840. This is a good result because, as mentioned previously, the synthetic demo data was generated by a neural network, which has complex interactions between predictor variables.
Wrapping Up
Linear regression with two-way interactions has a nice balance of prediction power and interpretability. The model weights/coefficients are easy to interpret. If the predictor values have been normalized to the same scale, larger magnitudes mean larger effect, and the sign of the weights indicate the direction of the effect. You should be a bit careful interpreting the interaction weights. If an interaction weight is positive, it can indicate an increase in y when the corresponding pair of predictor values are both positive or both negative.
Linear regression with two-way interactions is not always effective -- if it were, it would have replaced basic linear regression. Put another way, linear regression with two-way interactions can sometimes provide a big improvement in model quality for a relatively small investment in effort, and so it's usually worth exploring.
About the Author
Dr. James McCaffrey directs the data science and research efforts at Quaetrix, a data analytics company located near Redmond, Washington. Before joining Quaetrix, James was a senior research engineer at Microsoft. James can be reached at [email protected].