The Data Science Lab

Quadratic Regression with SGD Training Using C#

Dr. James McCaffrey presents a complete end-to-end demonstration of quadratic regression, implemented from scratch, with SGD training, using C#. Compared to standard linear regression, quadratic regression is better able to handle data with a non-linear structure, and data with interactions between predictor variables.

The goal of a machine learning regression problem is to predict a single numeric value. For example, you might want to predict an employee's salary based on age, height, high school grade point average, and so on. There are approximately a dozen common regression techniques. The most basic technique is called linear regression, or sometimes multiple linear regression, where the "multiple" indicates two or more predictor variables.

The form of a basic linear regression prediction model is y' = (w0 * x0) + (w1 * x1) + . . + (wn * xn) + b, where y' is the predicted value, the xi are predictor values, the wi are weights, and b is the bias. Quadratic regression extends linear regression. The form of a quadratic regression model is y' = (w0 * x0) + . . + (wn * xn) + (wj * x0 * x0) + . . + (wk * x0 * x1) + . . . + b. There are derived predictors that are the square of each original predictor, and interaction terms that are the multiplication product of all possible pairs of original predictors.

Compared to basic linear regression, quadratic regression can handle more complex data. Compared to the most powerful regression techniques such as neural network regression, quadratic regression often has slightly worse prediction accuracy, but has much better model interpretability.

This article presents a demo of quadratic regression, implemented from scratch, trained with stochastic gradient descent (SGD), using the C# language. A good way to see where this article is headed is to take a look at the screenshot in Figure 1 . The demo program begins by loading synthetic training and test data into memory. The data looks like:

-0.1660,  0.4406, -0.9998, -0.3953, -0.7065,  0.4840
 0.0776, -0.1616,  0.3704, -0.5911,  0.7562,  0.1568
-0.9452,  0.3409, -0.1654,  0.1174, -0.7192,  0.8054
 0.9365, -0.3732,  0.3846,  0.7528,  0.7892,  0.1345
. . .

The first five values on each line are the x predictors. The last value on each line is the target y variable to predict. There are 200 training items and 40 test items.

The demo creates and trains a quadratic regression model, evaluates the model accuracy on the training and test data, and then uses the model to predict the target y value for the first training item x = [-0.1660, 0.4406, -0.9998, -0.3953, -0.7065].

After loading the data from file and displaying the first three training X predictors and training y targets, the demo shows how a quadratic regression model is created and trained:

Creating quadratic regression model
Setting lrnRate = 0.001
Setting maxEpochs = 1000
Starting SGD training
epoch =     0  MSE =   0.0957
epoch =   200  MSE =   0.0003
epoch =   400  MSE =   0.0003
epoch =   600  MSE =   0.0003
epoch =   800  MSE =   0.0003
Done

There are several ways to train a quadratic regression model, including stochastic gradient descent (SGD), pseudo-inverse training, closed form inverse training, L-BFGS optimization training and so on. The demo program uses SGD training, which is iterative and requires a learning rate and a maximum number of epochs. These two parameter values must be determined by trial and error.

The next part of the demo output shows the trained model weights and the bias:

Model base weights:
 -0.2630  0.0354 -0.0420  0.0341 -0.1124

Model quadratic weights:
  0.0655  0.0194  0.0051  0.0047  0.0243

Model interaction weights:
  0.0043  0.0249  0.0071  0.1081 -0.0012 -0.0093
  0.0362  0.0085 -0.0568  0.0016

Model bias/intercept:  0.3220

Because the demo data has 5 predictor variables, there are 5 base weights, 5 quadratic (squared term) weights, and (5 * 4) / 2 = 10 interaction weights.

Figure 1: Quadratic Regression in Action
[Click on image for larger view.] Figure 1: Quadratic Regression in Action

The demo concludes by evaluating the trained model and making a prediction:

Evaluating model

Accuracy train (within 0.10) = 0.8850
Accuracy test (within 0.10) = 0.9250

MSE train = 0.0003
MSE test = 0.0005

Predicting for x =
  -0.1660   0.4406  -0.9998  -0.3953  -0.7065
Predicted y = 0.4843

The model accuracy is quite good compared to many other regression techniques applied to the synthetic data -- 88.50% accuracy on the training data (177 out of 200 correct) and 92.50% accuracy on the test data (37 out of 40 correct). A prediction is scored as correct if it's within 10% of the true target value.

This article assumes you have intermediate or better programming skill but doesn't assume you know anything about quadratic regression. The demo is implemented using C#, but you should be able to refactor the demo code to another C-family language if you wish. All normal error checking has been removed to keep the main ideas as clear as possible.

The source code for the demo program is too long to be presented in its entirety in this article. The complete code and data are available in the accompanying file download, and they're also available online.

The Demo Data
The demo data is synthetic. It was generated by a 5-10-1 neural network with random weights and bias values. The idea here is that the synthetic data does have an underlying, but complex, non-linear structure which can be predicted.

All of the predictor values are between -1 and +1. When using quadratic regression, technically, it's not necessary to normalize/scale your data. But normalized data is easier to train and can lead to a better prediction model, especially if some raw predictor values are very large (such as employee income) and some values are small (such as employee age). Also, using normalized data allows you to interpret the model weights more easily (larger magnitudes mean more effect on the predicted y value).

Quadratic regression is most often used with data that has strictly numeric predictor variables. It is possible to use the technique with categorical data that has an inherent order using equal-interval encoding. For example, a predictor variable weight with possible values (light, medium, heavy) could be encoded as light = 0.25, medium = 0.50, heavy = 0.75.

However, for categorical data without inherent order, such as color with possible values (red, blue, green), there's no obvious way to encode the values so that when multiplied together you get a meaningful number. That said, I have seen examples where non-ordered categorical predictor values were equal-interval encoded and the resulting prediction model worked quite well. But there are no solid research results, that I'm aware of, that support this encoding technique for quadratic regression.

Understanding Quadratic Regression
Suppose, as in the demo data, there are five predictors, aka features, (x0, x1, x2, x3, x4). The prediction equation for basic linear regression is:

y' = (w0 * x0) + (w1 * x1) + (w2 * x2) + (w3 * x3) + (w4 * x4) + b

The wi are model weights (aka coefficients), and b is the model bias (aka intercept). The values of the weights and the bias must be determined by training, so that predicted y' values are close to the known, correct y values in a set of training data.

Basic linear regression is simple but it can't predict well for data that has an underlying non-linear structure, and basic linear regression can't deal with data that has hidden interactions between the xi predictors.

The prediction equation for quadratic regression with five predictors is:

y' = (w0 * x0) + (w1 * x1) + (w2 * x2) + (w3 * x3) + (w4 * x4) + 

     (w5 * x0*x0) + (w6 * x1*x1) + (w7 * x2*x2) +
     (w8 * x3*x3) + (w9 * x4*x4) + 

     (w10 * x0*x1) + (w11 * x0*x2) + (w12 * x0*x3) + (w13 * x0*x4) + 
     (w14 * x1*x2) + (w15 * x1*x3) + (w16 * x1*x4) + 
     (w17 * x2*x3) + (w18 * x2*x4) + 
     (w19 * x3*x4)

     + b

The squared (aka "quadratic") xi^2 terms handle non-linear structure. If there are n predictors, there are also n squared terms. The xi * xj terms between all possible pairs pf original predictors handle interactions between predictors. If there are n predictors, there (n * (n-1)) / 2 interaction terms.

Therefore, in general, if the are n original predictor variables, there are a total of n + n + (n * (n-1))/2 model weights and one bias. Behind the scenes, the derived xi^2 squared terms and the derived xi*xj interactions terms are computed programmatically on-the-fly, as opposed to explicitly creating an augmented dataset.

A quadratic regression model is trained on the implicit derived data. In spite of the multiplication of the base predictors, the model is still a linear model.

Quadratic regression is a subset of polynomial regression. For example, cubic regression includes xi^3 terms, quartic regression includes xi^4, and so on. In practice, it's rare to go beyond quadratic regression, in part because the number of derived predictors increases quickly. For example, for the demo data with just 5 predictors, cubic regression has 55 derived predictor variables, and quartic regression has 125 derived variables.

Quadratic regression is a superset of linear regression with two-way interactions. Linear regression with two-way interactions includes only the derived xi * xj terms, and omits the xi^2 terms.

Quadratic regression can seem strange if you're new to the technique. Suppose you have a predictor x0 = employee-age and a predictor x1 = employee-income. The x0 * x1 term has no physical meaning -- it's just a new math predictor variable that can deal with hidden interactions between age and income.

Training the Quadratic Regression Model
The demo program uses standard stochastic gradient descent (SGD) optimization to train the model. In high-level pseudo-code:

initialize weights and bias to small random values
loop max_epochs times
  shuffle the order of training data 
  loop each training item
    get training inputs X
    get training target value actual y
    compute predicted y using curr weights and bias
    for-each weight and the bias
      new_wt = old_wt * -1 * lrn_rate * (pred_y - actual_y) * x
    end-for
  end-loop
end-loop

The idea is to adjust each weight so that the predicted y gets closer to the actual y. The (pred_y - actual_y) is a simple measure of how far off the predicted value is and loosely represents a math gradient. If pred_y is larger than actual_y, you want to decrease the associated weight to make the pred_y smaller.

The learning rate is typically a value like 0.01 that lessens the amount of change in the weights so that there are no wild fluctuations. The learning rate must be determined by trial and error. If the learning rate is too large, SGD might jump past a good weight value. If the learning rate is too small, training might be too slow.

The -1 term controls the direction of the change in the weight being adjusted. If the difference term was (actual_y - pred_y), the -1 term could be dropped, but it's traditional to use (pred_y - actual_y). The x term in the update controls the direction of the weight change, depending on the sign of x, and the amount of change in the weight, depending on the magnitude of x.

The order in which the training data is processed is randomized in each epoch (one complete pass through all items). This prevents training from oscillating back and forth when one item's weight updates undoes the previous item's weight updates.

The Demo Program
I used Visual Studio 2022 (Community Free Edition) for the demo program. The demo program also works with Visual Studio 2026, which was in Preview Mode at the time this article was written. I created a new C# console application and checked the "Place solution and project in the same directory" option. I specified .NET version 8.0. I named the project QuadraticRegressionSGD. I checked the "Do not use top-level statements" option to avoid the strange program entry point shortcut syntax.

The demo has no significant .NET dependencies and any relatively recent version of Visual Studio with .NET (Core) or the older .NET Framework will work fine. You can also use the Visual Studio Code program if you like.

After the template code loaded into the editor, I right-clicked on file Program.cs in the Solution Explorer window and renamed the file to the more descriptive QuadraticRegressionProgram.cs. I allowed Visual Studio to automatically rename class Program.

The overall program structure is presented in Listing 1. All the control logic is in the Main() method in the Program class. The Program class also holds helper functions to load data from file into memory and display data. All of the quadratic regression functionality is in a QuadraticRegressor class. The QuadraticRegressor class exposes a constructor and four methods: Train(), Predict(), Accuracy(), and MSE().

Listing 1: Overall Program Structure

using System;
using System.IO;
using System.Collections.Generic;

namespace QuadraticRegressionSGD
{
  internal class QuadraticRegressionProgram
  {
    static void Main(string[] args)
    {
      Console.WriteLine("Begin C# quadratic regression" +
        " with SGD training ");

      // 1. load data
      // 2. create and train model
      // 3. show model weights and bias
      // 4. evaluate model accuracy and MSE
      // 5. use model to make a prediction

      Console.WriteLine("End demo ");
      Console.ReadLine();
    } 

    // helpers for Main()

    static double[][] MatLoad(string fn, int[] usecols,
      char sep, string comment) { . . }

    static double[] MatToVec(double[][] mat) { . . }

    static void VecShow(double[] vec, int dec, int wid) { . . }
  }

  // ===========================================================

  public class QuadraticRegressor
  {
    public double[] weights;  // regular, quad, interactions
    public double bias;
    private Random rnd;

    public QuadraticRegressor(int seed = 0)
    {
      this.weights = new double[0];  // empty, but not null
      this.bias = 0; // dummy value
      this.rnd = new Random(seed);  // shuffle order
    }

    public double Predict(double[] x) { . . }

    public void Train(double[][] trainX, double[] trainY,
      double lrnRate, int maxEpochs) { . . }

    public double Accuracy(double[][] dataX, double[] dataY,
      double pctClose) { . . }

    public double MSE(double[][] dataX, double[] dataY) { . . }
  } // class QuadraticRegressor

  // ===========================================================

} // ns

The QuadraticRegressor class declares the weights and bias class members with public scope so that they can be accessed directly to view them. The weights vector holds all the weights -- base weights, quadratic weights, and interaction weights. The 5 base weights are stored in cells [0] to [4]. The 5 quadratic weights are stored in cells [5] to [9]. The 10 interaction weights are stored in cells [10] to [19]. An alternative design is to use three separate vectors. The private Random object rnd is used by the Train() method, to shuffle the order in which training items are processed.

Loading the Data into Memory
The demo program starts by loading the 200-item training data into memory:

string trainFile =
  "..\\..\\..\\Data\\synthetic_train_200.txt";
int[] colsX = new int[] { 0, 1, 2, 3, 4 };
double[][] trainX =
  MatLoad(trainFile, colsX, ',', "#");
double[] trainY =
  MatToVec(MatLoad(trainFile,
  new int[] { 5 }, ',', "#"));

The training X data is stored into an array-of-arrays style matrix of type double. The data is assumed to be in a directory named Data, which is located in the project root directory. The arguments to the MatLoad() function mean load columns 0, 1, 2, 3, 4 where the data is comma-delimited, and lines beginning with "#" are comments to be ignored. The training y data in column [5] is loaded into a matrix and then converted to a one-dimensional vector using the MatToVec() helper function.

The 40-item test data is loaded into memory using the same pattern that was used to load the training data:

string testFile =
  "..\\..\\..\\Data\\synthetic_test_40.txt";
double[][] testX =
  MatLoad(testFile, colsX, ',', "#");
double[] testY =
  MatToVec(MatLoad(testFile,
  new int[] { 5 }, ',', "#"));

The first three training items are displayed with four decimals like so:

Console.WriteLine("First three train X: ");
for (int i = 0; i < 3; ++i)
  VecShow(trainX[i], 4, 8);
Console.WriteLine("First three train y: ");
for (int i = 0; i < 3; ++i)
  Console.WriteLine(trainY[i].ToString("F4").PadLeft(8));

In a non-demo scenario, you might want to display all the training data to make sure it was correctly loaded into memory.

Creating and Training the Model
The quadratic regression model is created like so:

Console.WriteLine("Creating quadratic regression model ");
QuadraticRegressor model = new QuadraticRegressor();

The constructor accepts no parameters but has hidden a default seed value of 0 for the internal Random object that's used by the Train() method. Because quadratic regression is a linear technique, the seed value has little to no effect on the prediction accuracy of the trained model.

Next, the demo prepares the training parameters and calls the Train() method:

double lrnRate = 0.001;
int maxEpochs = 1000;
Console.WriteLine("Setting lrnRate = " +
  lrnRate.ToString("F4"));
Console.WriteLine("Setting maxEpochs = " +
  maxEpochs);
Console.WriteLine("Starting SGD training ");
model.Train(trainX, trainY, lrnRate, maxEpochs);
Console.WriteLine("Done ");

The values of lrnRate and maxEpochs must be determined by trial and error. You can do this manually or programmatically ("grid search"). Good values for the learning rate and max epochs parameters are usually easy to find for quadratic regression.

After training has completed, the demo program displays the model base weights:

Console.WriteLine("Model base weights: ");
int dim = trainX[0].Length;
for (int i = 0; i < dim; ++i)
  Console.Write(model.weights[i].ToString("F4").PadLeft(8));
Console.WriteLine("");

And then the quadratic weights are displayed:

Console.WriteLine("Model quadratic weights: ");
for (int i = dim; i < dim + dim; ++i)
  Console.Write(model.weights[i].ToString("F4").PadLeft(8));
Console.WriteLine("");

And last, the interaction weights and the bias are displayed:

Console.WriteLine("Model interaction weights: ");
for (int i = dim + dim; i < model.weights.Length; ++i) {
  Console.Write(model.weights[i].ToString("F4").PadLeft(8));
  if (i > dim+dim && i % dim == 0) Console.WriteLine("");
}
Console.WriteLine("");
Console.WriteLine("Model bias/intercept: " +
  model.bias.ToString("F4").PadLeft(8));

The demo program does not use any kind of regularization, which is a technique that restricts the magnitude of the resulting model weights. Large model weights often lead to model overfitting when the model predicts well on the training data but poorly on new, previously unseen data. A simple form of regularization is weight decay. After every training epoch, the magnitudes of all weights are reduced by multiplying each by a decay constant value like 0.998, where the decay constant must be determined by trial and error.

Evaluating and Using the Model
The demo program evaluates the prediction accuracy of the trained model using these statements:

Console.WriteLine("Evaluating model ");
double accTrain = model.Accuracy(trainX, trainY, 0.10);
Console.WriteLine("Accuracy train (within 0.10) = " +
  accTrain.ToString("F4"));
double accTest = model.Accuracy(testX, testY, 0.10);
Console.WriteLine("Accuracy test (within 0.10) = " +
  accTest.ToString("F4"));

The Accuracy() method scores a prediction value as correct if it's within a specified percentage of the true target value. A reasonable closeness percentage to use will vary from problem to problem.

The model mean squared error (MSE) values are computed and displayed like so:

double mseTrain = model.MSE(trainX, trainY);
Console.WriteLine("MSE train = " + mseTrain.ToString("F4"));
double mseTest = model.MSE(testX, testY);
Console.WriteLine("MSE test = " + mseTest.ToString("F4"));

Mean squared error is a more granular metric than accuracy. MSE is useful when you want to compare two different models, for example, quadratic regression and nearest neighbors regression.

Another common evaluation metric is the coefficient of determination, R2. You can easily implement a coefficient of determination method by using the MSE() method as a template. See "Implementing a Coefficient of Determination Function Using C#."

The demo concludes by using the trained model to predict the y value for the first training item, (-0.1660, 0.4406, -0.9998, -0.3953, -0.7065):

double[] x = trainX[0];
Console.WriteLine("Predicting for x = ");
VecShow(x, 4, 9);
double predY = model.Predict(x);
Console.WriteLine("Predicted y = " + predY.ToString("F4"));

The predicted y value, 0.4843, is very close to the true target value of 0.4840. This is an especially good result because the synthetic demo data was generated by a neural network, which has complex interactions between predictor variables.

Wrapping Up
Quadratic regression has a nice balance of prediction power and interpretability. The model weights/coefficients are easy to interpret. If the predictor values have been normalized to the same scale, larger magnitudes mean larger effect, and the sign of the weights indicate the direction of the effect. But you should be a bit careful interpreting the interaction weights. If an interaction weight is positive, it can indicate an increase in y when the corresponding pair of predictor values are both positive or both negative.

Quadratic regression is not always effective -- if it were, it would be used far more often than it is. Compared to basic linear regression, quadratic regression can sometimes provide a big improvement in model prediction accuracy for a relatively small investment in effort, and so quadratic regression is usually worth exploring.

comments powered by Disqus

Featured

  • Developing Agentic Systems in .NET: From Concept to Code

    ZioNet founder Alon Fliess previews his Visual Studio Live! San Diego session on building true agentic systems in .NET -- covering the cognitive loop, MCP tool integration, multi-agent orchestration and enterprise hosting and governance with the Microsoft Agent Framework.

  • Mastering AI Development and Building AI Apps with GitHub Copilot

    Two Microsoft experts explain how GitHub Copilot is evolving from a coding assistant into a broader platform for building, customizing and testing AI-powered developer workflows.

  • VS Code 1.123 Adds Agent Session Sync, 1M Context Windows

    Microsoft released Visual Studio Code 1.123 on June 3, adding agent-focused features, larger model context support, integrated browser updates and a new delay for some automatic extension updates.

  • Copilot Billing Shock Hits Developers

    Developer complaints about GitHub Copilot's new usage-based billing model have centered on unexpectedly rapid AI credit consumption, and neither GitHub nor Microsoft has responded directly to the backlash, though they have previously published guidance to lessen model usage costs.

Subscribe on YouTube