The Data Science Lab

Linear Support Vector Regression from Scratch Using C# with Evolutionary Training

Dr. James McCaffrey from Microsoft Research presents a complete end-to-end demonstration of the linear support vector regression (linear SVR) technique, where the goal is to predict a single numeric value. A linear SVR model uses an unusual error/loss function and cannot be trained using standard simple techniques, and so evolutionary optimization training is used.

The goal of a machine learning regression problem is to predict a single numeric value, for example, predicting a person's income based on their age, height, years of education, and so on. There are approximately a dozen common regression techniques. The most basic regression techniques are called linear because they assume the data falls along a straight line when graphed.

Linear techniques include ordinary linear regression, L1 (lasso) and L2 (ridge) regression, and linear support vector regression (linear SVR). This article presents a demo of linear SVR, implemented from scratch, using the C# language, with an evolutionary training algorithm.

A good way to see where this article is headed is to take a look at the screenshot in Figure 1. The demo program begins by loading synthetic training and test data into memory. The data looks like:

-0.1660,  0.4406, -0.9998, -0.3953, -0.7065,  0.4840
 0.0776, -0.1616,  0.3704, -0.5911,  0.7562,  0.1568
-0.9452,  0.3409, -0.1654,  0.1174, -0.7192,  0.8054
 0.9365, -0.3732,  0.3846,  0.7528,  0.7892,  0.1345
. . .

The first five values on each line are the x predictors. The last value on each line is the target y variable to predict. The demo creates a linear support vector regression model, evaluates the model accuracy on the training and test data, and then uses the model to predict the target y value for x = [-0.1660, 0.4406, -0.9998, -0.3953, -0.7065].

The first part of the demo output shows how a linear support vector regression model is created:

Creating SVR linear model
Done
Setting SVR parameters:
C = 1.00
epsilon = 0.10

Linear support vector regression requires two parameters. The C value controls how much outlier data items are penalized in the loss function. The epsilon value controls which training data items are outliers. The C and epsilon values are free parameters and must be determined by trial and error.

The next part of the demo output shows how evolutionary training is prepared:

Setting evolutionary training parameters:
popSize = 100
maxGen = 50000
alpha (parent selection pct) = 0.50
sigma (child replacement pct) = 0.50
mRate (mutation rate) = 0.25

Evolutionary training is inspired by biological processes such as genetic recombination, mutation, and natural selection. The popSize (population size) is the number of potential solutions. The maxGen is the maximum number generations for solutions to evolve. The alpha, sigma, and mRate parameters will be explained shortly.

Figure 1: Linear Support Vector Regression Using Evolutionary Training in Action
[Click on image for larger view.] Figure 1: Linear Support Vector Regression Using Evolutionary Training in Action

The next part of the demo output shows how evolutionary training progresses and the resulting model definition:

Starting evolutionary training
generation =      0  loss = 651.8370  acc (0.15) = 0.0150
generation =   5000  loss =   0.1964  acc (0.15) = 0.6200
. . .
generation =  40000  loss =   0.1810  acc (0.15) = 0.6200
generation =  45000  loss =   0.1802  acc (0.15) = 0.6200
Done

Coefficients/weights:
-0.2585  0.0325  -0.0509  0.0269  -0.1030
Bias/constant/intercept: 0.3747

Error slowly decreases and prediction accuracy slowly increases. A prediction is scored as correct if it's within 15% of the true target value. After training, the demo displays the linear SVR model weights (one per predictor variable) and the single bias value, which define the model. The demo also displays the weights and bias values that were obtained by using the Python language scikit-learn library. The scikit linear SVR model values are very close to the ones generated by the demo program.

The demo concludes by evaluating the final linear SVR model and making a prediction:

Evaluating model
Accuracy train (within 0.15) = 0.6200
Accuracy test (within 0.15) = 0.6750

Predicting for x =
 -0.1660  0.4406 -0.9998 -0.3953 -0.7065
0.5449

The model accuracy isn't very good -- only 62.00% accuracy on the training data (124 out of 200 correct) and 67.50% accuracy on the test data (27 out of 40 correct). This is because the synthetic demo data was generated by a neural network with random weights and biases, and so the data isn't linear (which you wouldn't know in advance in a non-demo scenario).

This article assumes you have intermediate or better programming skill but doesn't assume you know anything about linear support vector regression. The demo is implemented using C#, but you should be able to refactor the demo code to another C-family language if you wish. All normal error checking has been removed to keep the main ideas as clear as possible.

The source code for the demo program is too long to be presented in its entirety in this article. The complete code and data are available in the accompanying file download, and they're also available online.

The Demo Data
The demo data is synthetic. It was generated by a 5-10-1 neural network with random weights and bias values. The idea here is that the synthetic data does have an underlying, but complex, non-linear structure which can be predicted.

All of the predictor values are between -1 and +1. There are 200 training data items and 40 test items. When using linear support vector regression, technically, it's not necessary to normalize/scale your data. But normalizing usually leads to a better prediction model, especially if some raw predictor values are very large (such as employee income) and some values are small (such as employee age).

Linear support vector regression is most often used with data that has strictly numeric predictor variables. It is possible to use linear SVR with mixed categorical and numeric data, by using one-over-n-hot encoding on categorical data (such as a variable color with possible values red, blue, green), and equal interval encoding on data that has inherent ordering (such as a variable size with possible values small, medium, large).

Understanding Linear Support Vector Regression
For regular (not support vector) linear regression, with five predictor variables, the prediction equation is y' = (w0 * x0) + (w1 * x1) + (w2 * x2) + (w3 * x3) + (w4 * x4) + b. The xi are the predictor values. The wi are the model weights, also called the coefficients. The b is the model bias, also called the constant or the intercept. Linear SVR works in the same way except that the values of the weights and bias are determined in a different way than standard linear regression.

For example, the demo trained SVR model weights are (-0.2585, 0.0325, -0.0509, 0.0269, -0.1030), and the bias is 0.3747. For an input x = (-0.1660, 0.4406, -0.9998, -0.3953, -0.7065), the predicted y is:

y' = (w0 * x0) + (w1 * x1) + (w2 * x2) +(w3 * x3) + (w4 * x4) + b
   = (-0.2585 * -0.1660) +
     ( 0.0325 *  0.4406) +
     (-0.0509 * -0.9998) +
     ( 0.0269 * -0.3953) +
     (-0.1030 * -0.7065) + 0.3747
   = 0.0429 + 0.0143 + 0.0509 + -0.0106 + 0.0728 + 0.3747
   = 0.5449

For regular linear regression, during training, all data items contribute to the error/loss equally via mean squared error. But for linear support vector regression, items with predicted y values that are close (within a small distance epsilon) to actual target y values do not directly contribute to the loss. Only items where the predicted y value is greater than epsilon more than the true target y value contribute to the loss. This is called epsilon-insensitive loss. This idea is shown in Figure 2.

Figure 2: Linear Support Vector Regression Epsilon-Insensitive Loss
[Click on image for larger view.] Figure 2: Linear Support Vector Regression Epsilon-Insensitive Loss

The diagram in Figure 2 gives you a rough idea of support vector regression for a scenario where there is just one predictor variable x. Each dot is a training data item. The red line is the linear prediction equation created using epsilon-insensitive loss. The epsilon value (like a lower-case script 'e', often about 0.10 or so) creates a tube around the prediction equation. Data items that fall within the tube do not contribute to the loss value. Each data item that falls outside the tube generates a loss value usually denoted by Greek xi (looks like an upper-case script 'E').

The loss function is (1/2 * ||w||^2) + (C * sum(xi values)). The ||w||^2 is the squared vector norm of the weights (including bias). For example, if there are just three predictors, the linear prediction equation is y' = (w0 * x0) + (w1 * x1) + (w2 * x2) + b and ||w||^2 = w0^2 + w1^2 + w2^2 + b^2. The C is a free parameter, usually about 1.0 or 2.0. A larger value of C penalizes outlier data items more than a smaller value. The idea of minimizing the magnitudes of the weights is very subtle: it creates a "flatter" prediction equation, meaning the weight values are relatively in the same magnitude-range rather than wildly different. In machine learning terminology, this is called regularization.

Now here's where the difficulty of linear support vector regression arises. The loss function is not calculus-differentiable, which means that normal techniques for finding the values of the weights and bias don't work, in particular the common stochastic gradient descent (SGD) optimization algorithm. This means you must either use extremely complex techniques like quadratic programming algorithms, or use an algorithm like evolutionary optimization.

Understanding Evolutionary Optimization
In high-level pseudo-code, evolutionary optimization is:

  
    create a population of several possible random solutions
    loop max_generations times
      pick 2 good parent solutions from population
      use parents to create a child solution
      mutate child solution slightly
      replace a bad solution in population with child
      keep track of best solution found so far
    end-loop
    return best solution found
  

The process is illustrated in Figure 3. For the demo problem with five predictor variables, a possible solution is a vector with six values: the five weights followed by the bias.

Figure 3: Evolutionary Optimization
[Click on image for larger view.] Figure 3: Evolutionary Optimization

Evolutionary optimization is a meta-heuristic rather than a specific algorithm, meaning there are many possible implementations.

To select two good parents, the demo system first picks a random alpha = 0.50 percent of the solutions in the population, and then finds the two best solutions. The larger alpha is, the more likely the very best parents will be selected. But having some variability prevents the optimization problem from getting stuck on a particular solution.

To create a child solution, the demo picks a random crossover point near the middle of a parent solution, and then combines the left half of parent 1 and the right half of parent 2.

To mutate a child solution, the mutation method walks through the child values. At each value, a random number between 0.0 and 1.0 is generated. If the random number is less than the mutation rate (relatively rarely), the current child value is replaced by a random value between -1.0 and +1.0.

To select a weak solution for replacement in the population, the demo system uses a technique similar to the one used to pick a good parent. The system picks a random sigma = 0.50 percent of the solutions in the population, and then finds the worst (largest loss/error) one in that subset. The larger sigma is, the more likely the system will pick the absolute worst solution in the current population. Having some variability prevents the population from stagnating.

The Demo Program
I used Visual Studio 2022 (Community Free Edition) for the demo program. I created a new C# console application and checked the "Place solution and project in the same directory" option. I specified .NET version 8.0. I named the project SupportVectorRegressionLinearEvo. I checked the "Do not use top-level statements" option to avoid the weird program entry point shortcut syntax.

The demo has no significant .NET dependencies and any relatively recent version of Visual Studio with .NET (Core) or the older .NET Framework will work fine. You can also use the Visual Studio Code program if you like.

After the template code loaded into the editor, I right-clicked on file Program.cs in the Solution Explorer window and renamed the file to the more descriptive SupportVectorRegressionEvoProgram.cs. I allowed Visual Studio to automatically rename class Program.

The overall program structure is presented in Listing 1. All the control logic is in the Main() method in the Program class. The Program class also holds helper functions to load data from file into memory and display data. All of the linear support vector regression functionality is in a SupportVectorRegressorLinear class. The SupportVectorRegressorLinear class exposes a constructor and four methods: TrainEvo(), Predict(), Accuracy(), and ErrorUsing().

Listing 1: Overall Program Structure

using System;
using System.IO;
using System.Collections.Generic;

namespace SupportVectorRegressionLinearEvo
{
  internal class SupportVectorRegressionEvoProgram
  {
    static void Main(string[] args)
    {
      Console.WriteLine("Begin linear support vector " +
        "regression with evolutionary training demo ");

      // 1. load data
      // 2. create and train linear SVR model
      // 2b. show trained model weights and bias
      // 3. evaluate model accuracy
      // 4. use model to predict

      Console.WriteLine("End demo ");
      Console.ReadLine();
    } // Main()

    // ------------------------------------------------------
    // helpers for Main()
    // ------------------------------------------------------

    static double[][] MatLoad(string fn, int[] usecols,
      char sep, string comment) { . . }

    static double[] MatToVec(double[][] mat) { . . }

    static void VecShow(double[] vec, int dec, int wid) { . . }
  }

  public class SupportVectorRegressorLinear
  {
    // --------------------------------------------
    public class Cell
    {
      public double[] chromo; // wts and bias
      public double loss;  // e-insensitive loss

      public Cell(int solnLen)
      {
        this.chromo = new double[solnLen];
        this.loss = double.MaxValue;
      }
    }
    // --------------------------------------------

    public double[] weights; // aka coefficients
    public double bias;  // aka constant, intercept
    private Random rnd;

    public SupportVectorRegressorLinear(int seed) { . . }

    public double Predict(double[] x) { . . }

    public double Accuracy(double[][] dataX, double[] dataY,
      double pctClose) { . . }

    public void TrainEvo(double[][] trainX, double[] trainY,
      double C, double epsilon,
      int popSize, int maxGen, double alpha = 0.50,
      double sigma = 0.50, double mRate = 0.10) { . . }

    public double LossUsing(double[] soln,
      double[][] dataX, double[] dataY,
      double C, double eps) { . . }

    private double AccuracyUsing(double[] soln,
      double[][] dataX, double[] dataY,
      double pctClose) { . . }

    private int SelectBadIdx(Cell[] pop, double pct) { . . }

    private void Mutate(Cell child, double mRate,
      double[][] dataX, double[] dataY,
      double C, double eps) { . . }

    private Cell MakeChild(Cell[] pop,
      int[] parents, double[][] dataX, double[] dataY,
      double C, double eps) { . . }

    private int[] SelectTwoGood(Cell[] pop, double pct) { . . }

    private int SelectGoodIdx(Cell[] pop, double pct) { . . }

    private void Shuffle(int[] arr) { . . }

    private Cell[] MakePopulation(int popSize,
      int solnLen, double[][] trainX, double[] trainY,
      double C, double eps) { . . }
  } 
} // ns

The demo system encapsulates a possible solution and its associated epsilon-insensitive error/loss in a local Cell class. The possible solution, five weights followed by a bias, are stored in a vector field named chromo (short for chromosome).

Loading the Data into Memory
The demo program starts by loading the 200-item training data into memory:

string trainFile =
  "..\\..\\..\\Data\\synthetic_train_200.txt";
int[] colsX = new int[] { 0, 1, 2, 3, 4 };
double[][] trainX =
  MatLoad(trainFile, colsX, ',', "#");
double[] trainY =
  MatToVec(MatLoad(trainFile,
  new int { 5 }, ',', "#"));

The training X data is stored into an array-of-arrays style matrix of type double. The data is assumed to be in a directory named Data, which is located in the project root directory. The arguments to the MatLoad() function mean load columns 0, 1, 2, 3, 4 where the data is comma-delimited, and lines beginning with "#" are comments to be ignored. The training y data in column [5] is loaded into a matrix and then converted to a one-dimensional vector using the MatToVec() helper function.

The 40-item test data is loaded into memory using the same pattern that was used to load the training data:

string testFile =
  "..\\..\\..\\Data\\synthetic_test_40.txt";
double[][] testX =
  MatLoad(testFile, colsX, ',', "#");
double[] testY =
  MatToVec(MatLoad(testFile,
  new int[] { 5 }, ',', "#"));

The first three training items are displayed like so:

Console.WriteLine("First three train X: ");
for (int i = 0; i < 3; ++i)
  VecShow(trainX[i], 4, 8);
Console.WriteLine("First three train y: ");
for (int i = 0; i < 3; ++i)
  Console.WriteLine(trainY[i].ToString("F4").PadLeft(8));

In a non-demo scenario, you might want to display all the training data to make sure it was correctly loaded into memory.

Creating the Model
The linear SVR model is created like so:

Console.WriteLine("Creating SVR linear model ");
SupportVectorRegressorLinear model = 
  new SupportVectorRegressorLinear(seed: 0);
Console.WriteLine("Done ");
double C = 1.0;
double epsilon = 0.10;
Console.WriteLine("Setting SVR parameters: ");
Console.WriteLine("C = " + C.ToString("F2"));
Console.WriteLine("epsilon = " +  epsilon.ToString("F2"));

The linear SVR constructor accepts only a seed value for an internal random number generator that's used by most of the evolutionary optimization routines. The C and epsilon values that are specific to the epsilon-insensitive error/loss calculation are declared separately and passed to the TrainEvo() method later. The idea here is to design the SupportVectorRegressorLinear class so that a different training method can be swapped in without changing the constructor signature.

Training the Model
The demo prepares and performs evolutionary training using these statements:

int popSize = 100;
int maxGen = 50000;
double alpha = 0.50;
double sigma = 0.50;
double mRate = 0.25;
Console.WriteLine("Starting evolutionary training ");
model.TrainEvo(trainX, trainY, C, epsilon,
  popSize, maxGen, alpha, sigma, mRate);
Console.WriteLine("Done");

The parent selection pressure, alpha, and the child replacement selection pressure, sigma are set to 0.50 which is a value that has worked well for me across a wide range of problems.

Evolutionary optimization is highly sensitive to the mutation rate. In practice, most of the time spent on finding good model parameters is dedicated to finding a good combination of population size and mutation rate.

The demo program displays the five model weights and the bias:

Console.WriteLine("Coefficients/weights: ");
for (int i = 0; i < model.weights.Length; ++i)
  Console.Write(model.weights[i].ToString("F4") + "  ");
Console.WriteLine("Bias/constant/intercept: " +
  model.bias.ToString("F4"));

Because the this.weights and this.bias are public members of the SupportVectorRegressorLinear class, they can be accessed directly without the need for accessor properties. Very large weight values are not good and indicate you should experiment with a different mutation rate.

Evaluating and Using the Model
The demo program evaluates the trained model accuracy using these statements:

Console.WriteLine("Evaluating model ");
double accTrain = model.Accuracy(trainX, trainY, 0.15);
Console.WriteLine("Accuracy train (within 0.15) = " +
  accTrain.ToString("F4"));
double accTest = model.Accuracy(testX, testY, 0.15);
Console.WriteLine("Accuracy test (within 0.15) = " +
  accTest.ToString("F4"));

Accuracy is not a very granular metric. You can also use the public LossUsing() method to display the epsilon-insensitive loss, but this value isn't easy to interpret, other than smaller values are better. You could implement an error method that computes mean squared error, but linear SVR is not designed to minimize MSE.

The demo concludes by using the trained linear SVR model to predict the y value for the first training item, (-0.1660, 0.4406, -0.9998, -0.3953, -0.7065):

double[] x = trainX[0];
Console.WriteLine("Predicting for x = ");
VecShow(x, 4, 8);
double y = model.Predict(x);
Console.WriteLine(y.ToString("F4"));

The predicted y value, 0.5449, is not very close to the true target value of 0.4840. But, as mentioned previously, linear SVR is not well-suited for complex, non-linear data.

Wrapping Up
Linear support vector regression is now rarely used in practice. The idea was proposed in a 1998 research paper, but even that paper was lukewarm about how useful linear SVR could be. However, this didn't stop inexperienced researchers from investigating and promoting linear SVR in the late 1990s.

During training, linear support vector regression penalizes outlier data items more than non-outlier items via the C and epsilon constants, and keeps the magnitudes of the model weight and bias values small via the ||w||^2 term in the epsilon-insensitive error/loss function. However, L1 regression (also called lasso regression) and L2 regression (also called ridge regression), do the same things in a much simpler way. This was widely understood by about the year 2002 and linear SVR quickly lost favor with researchers and practitioners. However, linear SVR is still used in some legacy systems.

In the late 1990s, linear support vector regression was extended to a technique called just support vector regression (without the "linear"). This extension technique uses something called the kernel trick to deal with non-linear data. However, kernelized SVR is now rarely used. A technique called kernel ridge regression (KRR) is simpler than kernelized SVR, easier to train than kernelized SVR, and based on my experience at least, KRR usually creates a more accurate prediction model than kernelized SVR.

comments powered by Disqus

Featured

  • Mastering Blazor Authentication and Authorization

    At the Visual Studio Live! @ Microsoft HQ developer conference set for August, Rockford Lhotka will explain the ins and outs of authentication across Blazor Server, WebAssembly, and .NET MAUI Hybrid apps, and show how to use identity and claims to customize application behavior through fine-grained authorization.

  • Linear Support Vector Regression from Scratch Using C# with Evolutionary Training

    Dr. James McCaffrey from Microsoft Research presents a complete end-to-end demonstration of the linear support vector regression (linear SVR) technique, where the goal is to predict a single numeric value. A linear SVR model uses an unusual error/loss function and cannot be trained using standard simple techniques, and so evolutionary optimization training is used.

  • Low-Code Report Says AI Will Enhance, Not Replace DIY Dev Tools

    Along with replacing software developers and possibly killing humanity, advanced AI is seen by many as a death knell for the do-it-yourself, low-code/no-code tooling industry, but a new report belies that notion.

  • Vibe Coding with Latest Visual Studio Preview

    Microsoft's latest Visual Studio preview facilitates "vibe coding," where developers mainly use GitHub Copilot AI to do all the programming in accordance with spoken or typed instructions.

  • Steve Sanderson Previews AI App Dev: Small Models, Agents and a Blazor Voice Assistant

    Blazor creator Steve Sanderson presented a keynote at the recent NDC London 2025 conference where he previewed the future of .NET application development with smaller AI models and autonomous agents, along with showcasing a new Blazor voice assistant project demonstrating cutting-edge functionality.

Subscribe on YouTube