The Data Science Lab

Logistic Regression with Batch SGD Training and Weight Decay Using C#

Dr. James McCaffrey from Microsoft Research presents a complete end-to-end program that explains how to perform binary classification (predicting a variable with two possible discrete values) using logistic regression, where the prediction model is trained using batch stochastic gradient descent with weight decay.

Logistic regression is a technique for binary classification -- predicting one of two discrete values. For example, you might want to predict the sex of a person (male = 0, female = 1) from their age, state of residence, annual income, and political leaning.

There are many other techniques for binary classification. Logistic regression is often considered the most fundamental. Other techniques include neural network binary classification, AdaBoost classification, Naive Bayes binary classification (for categorical data), k-nearest neighbors classification, and decision tree binary classification.

This article presents a complete demo program for logistic regression, using batch stochastic gradient descent training with weight decay. Compared to other binary classification techniques, logistic regression is easy to implement, works well with both small and large datasets, and the prediction results are highly interpretable.

There are several ways to train a logistic regression model. Compared to other training algorithms, batch stochastic gradient descent with weight decay is especially effective for large datasets.

Figure 1: Logistic Regression Using Batch Stochastic Gradient Descent Training in Action
[Click on image for larger view.] Figure 1: Logistic Regression Using Batch Stochastic Gradient Descent Training in Action

A good way to see where this article is headed is to take a look at the screenshot in Figure 1. The demo program begins by loading a 200-item set of training data and a 40-item set of test data into memory. The first three training predictor values and the corresponding first three target sex values are:

First 3 train X data:
0.24  1.0  0.0  0.0  0.2950  0.0  0.0  1.0
0.39  0.0  0.0  1.0  0.5120  0.0  1.0  0.0
0.63  0.0  1.0  0.0  0.7580  1.0  0.0  0.0

First 3 train Y data:
1 0 1

The demo creates a logistic regression model and trains it using these parameters:

Setting:
lrnRate = 0.0100
maxEpochs = 1000
batSize = 10
decay = 0.0001

The demo uses a technique called stochastic gradient descent (SGD) with batch processing and weight decay to train the model. The demo displays progress messages every 200 training epochs (an epoch is one pass through the training dataset):

Starting training
epoch =    0 |  loss = 0.2490
epoch =  200 |  loss = 0.2331
epoch =  400 |  loss = 0.2206
epoch =  600 |  loss = 0.2100
epoch =  800 |  loss = 0.2009
Done

The loss value is a measure of error. The loss values decrease which indicates that training is working. After training, the demo computes the crude accuracy of the model on the training data and on the test data:

Evaluating trained model
Accuracy on train data: 0.8150
Accuracy on test data: 0.7500

The demo computes more granular measures of accuracy for the test data in the form of a confusion matrix:

Confusion matrix for test data:
           -------------
actual 0   |  16   10  |  0.6154
actual 1   |   0   14  |  1.0000
           -------------
predicted      0    1

The model predicts well on class 1 data items (female), but doesn't do so well on male items (class 1), scoring just 16 out of 26 correct = 61.54% accuracy.

The demo program examines the trained model by displaying the model weights and the bias:

Model wts: 10.22  0.13  0.38  0.12 -10.23  0.61  0.11  -0.10
Model bias: 0.66

Because none of the weight and bias values are extremely large, the model looks reasonable. The largest weight values are associated with the age and income variables, which indicates they impact the prediction of a person's sex more than the state of residence and political leaning variables. The demo concludes by using the trained model to predict the sex of person who is age 33, lives in Nebraska, makes $50,000, and is a political conservative:

Predicting sex for [33, Nebraska, $50,000, conservative]:
p-val = 0.4752
class 0 (male)

The computed output is p = 0.4752 and because that value is less than 0.50, the prediction is that the person is class 0 = male.

This article assumes you have intermediate or better programming skill but doesn't assume you know anything about logistic regression. The demo is implemented using C#, but you should be able to refactor the demo code to another C-family language if you wish. All normal error checking has been removed to keep the main ideas as clear as possible.

The source code for the demo program is too long to be presented in its entirety in this article. The complete code and data are available in the accompanying file download, and are also available online.

The Demo Data
The 240-item synthetic raw demo data looks like:

F   24   michigan   29500.00   liberal
M   39   oklahoma   51200.00   moderate
F   63   nebraska   75800.00   conservative
M   36   michigan   44500.00   moderate
F   27   nebraska   28600.00   liberal
. . .

As is the case with most machine learning techniques, preparing the raw data is almost always the most time-consuming part of the analysis. When using logistic regression, you must encode the target variable using 0-1 encoding, you must encode categorical predictor data, and you should normalize numeric predictor data. The encoded and normalized data looks like:

1, 0.24, 1, 0, 0, 0.2950, 0, 0, 1
0, 0.39, 0, 0, 1, 0.5120, 0, 1, 0
1, 0.63, 0, 1, 0, 0.7580, 1, 0, 0
0, 0.36, 1, 0, 0, 0.4450, 0, 1, 0
1, 0.27, 0, 1, 0, 0.2860, 0, 0, 1
. . .

The target sex variable is encoded as M = 0, F = 1. The choice of which value is 0 and which is 1 is arbitrary. The two possible target values are sometimes called the "positive" class and the "negative" class. The value that is encoded as 0 is usually, but not always, called the negative value, and the 1-encoded value is usually called the positive value.

The age values are normalized by dividing by 100 so that all normalized values are between 0 and 1. This is called divide-by-constant normalization. Alternatives are min-max normalization and z-score normalization.

The state of residence variable is encoded using one-hot encoding where Michigan = 100, Nebraska = 010, Oklahoma = 001. The order is arbitrary. Except in very unusual scenarios, you should not encode categorical variables using ordinal encoding such as Michigan = 1, Nebraska = 2, Oklahoma = 3 because that implies Oklahoma is greater than Nebraska in a mathematical sense.

The income values are normalized by dividing by 100,000. The political leaning variable is encoded using one-hot encoding where conservative = 100, moderate = 010, liberal = 001. The demo data was split into a 200- item set for training and 40-item set for testing.

Understanding Logistic Regression
Suppose, as in the screenshot, a raw input data item to predict is x = (33, Nebraska, $50,000, conservative). The encoded and normalized version has eight values and is (0.33, 0,1,0, 0.5000, 1,0,0). And suppose the eight weights of the trained model are (10.22, 0.13, 0.38, 0.12, -10.23, 0.61, 0.11, -0.10) and the bias value is 0.66. Expressed mathematically, the computed output is:

z = (w0 * x0) + (w1 * x1) + (w2 * x2) + (w3 * x3) +
    (w4 * x4) + (w5 * x5) + (w6 * x6) + (w7 * x7) + b

p = 1.0 / (1.0 + exp(-z))

The w0 through w7 are the weights, and b is the bias. The x0 through x7 are the input x values. In words, the intermediate z value is the sum of the products of the weights times the corresponding input values, plus the bias. The final p-value (pseudo-probability) is 1 over 1 plus the exp() function applied to the negative of the z value. Although it's not immediately obvious, the result p-value will always be between 0 and 1. If the computed p-value is less than 0.5, the prediction is class 0 (male), and if the p-value is greater than 0.5 the prediction is class 1 (female).

For the demo data, the calculations, subject to rounding, are:

z = (10.22 * 0.33) + (0.13 * 0) + (0.38 * 1) + (0.12 * 0) +
    (-10.23 * 0.5000) + (0.61 * 1) + (0.11 * 0) + (-0.10 * 0) + 0.66
  = -0.0924

p = 1.0 / (1.0 + exp(+0.0924))
  = 1.0 / (1.0 + 1.0968)
  = 1.0 / 2.0968
  = 0.4752

OK, but where do the weights and bias values come from?

Training a Logistic Regression Model
Training a model is the process of finding the values of the weights and biases. The training data has known inputs and known, correct outputs (0 or 1). Training examines different combinations of values of the weights and the bias to find the values so that the computed outputs most closely match the known correct output values. Put another way, training is a numerical optimization problem -- finding the values of the weights and the bias that minimize the error between computed outputs and known correct outputs.

It's possible to randomly guess values of weights and the bias and track which set of values gives the best predictions. But this isn't feasible except with artificially tiny datasets. As it turns out, there is no "closed form" solution for finding optimal weights and bias values and so they must be estimated.

There are many different optimization algorithms that can be used to find good values for logistic regression weights and bias. Four of the most commonly used techniques are iterated Newton-Raphson, L-BFGS, dual coordinate descent, and stochastic gradient descent. Each of these techniques has many variations. If there was one clearly best technique for training, there would be only one technique. Each training technique has pros and cons.

The demo program uses stochastic gradient descent with batch training and weight decay. Compared to other training algorithms, SGD works well with both large and small datasets, is relatively robust to errors, and is relatively fast.

Understanding Batch Stochastic Gradient Descent
The classic SGD algorithm for the demo data looks like:

loop max_epochs = 1000 times
    loop each 200 train items in random order
      x = input item
      y = actual class (0 or 1)
      p = computed pseudo-probability
      loop each weight j
        new_wt[j] += -learn_rate * old_wt[j] * (p - y)
      end-loop
    end-loop
  end-loop

The bias is considered a special kind of weight. Suppose that for some input x, the target actual class from the training data is 1, and the computed pseudo-probability is 0.60. The prediction (0.60) is correct but not quite correct (1) enough. So the model weights should be increased so that the computed prediction value increases. The learning rate is a small value, typically about 0.01 or 0.05, that moderates the increase or decrease in weight values. The learning rate must be determined by trial and error.

In the pseudo-code above, the old_wt[j] * (p - y) term is called the gradient. It is a value that indicates in what direction, and by how much, the model weights should change so that predicted outputs get closer to known correct target outputs. The algorithm above, where weights are updated after every input value x is processed, is sometimes called online SGD for historical reasons. This approach is somewhat chaotic because an update from one x input can be immediately undone by the update from the next x input.

Instead of updating weights on every data item, a version of SGD accumulates gradients for say, 10 items, and then updates based on the accumulated gradients. The 10 is the batch size. This approach is called batch training, or sometimes, mini-batch training (again, the terminology is not consistent in machine learning resources). In pseudo-code:

loop max_epochs = 1000 times
    loop 200 / 10 = 20 times using random order data
      loop batch_size = 10 items
        x = input item
        y = actual class (0 or 1)
        p = computed pseudo-probability
        compute and accumulate gradient old_wt[j] * (p - y)
      end-loop
      loop each weight j
        new_wt[j] += -learn_rate * accumulated gradient[j]
      end-loop
      zero-out accumulated gradients
    end-loop
    shrink all weights slightly (weight decay)
  end-loop

Batch training smooths out the weight updates and often leads to a logistic regression model that predicts more accurately than a model trained using the online approach.

The Demo Program
I used Visual Studio 2022 (Community Free Edition) for the demo program. I created a new C# console application and checked the "Place solution and project in the same directory" option. I specified .NET version 8.0. I named the project PeopleLogisticRegressionBatch. I checked the "Do not use top-level statements" option to avoid the program entry point shortcut syntax.

The demo has no significant .NET dependencies and any relatively recent version of Visual Studio with .NET (Core) or the older .NET Framework will work fine. You can also use the Visual Studio Code program if you like.

After the template code loaded into the editor, I right-clicked on file Program.cs in the Solution Explorer window and renamed the file to the slightly more descriptive LogisticRegressionProgram.cs. I allowed Visual Studio to automatically rename class Program.

The overall program structure is presented in Listing 1. All the control logic is in the Main() method. All of the logistic regression functionality is in a LogisticRegression class. A Utils class holds helper functions for Main() to load data from file to memory, and functions to display vectors and matrices.

Listing 1: Overall Program Structure

using System;
using System.IO;
using System.Collections.Generic;

namespace PeopleLogisticRegressionBatch
{
  internal class LogisticRegressionProgram
  {
    static void Main(string[] args)
    {
      Console.WriteLine("Logistic regression with batch" +
        " training using raw C# demo ");

      // 1. load training and test data
      // 2. create model
      // 3. train model
      // 4. evaluate model 
      // 5. examine model
      // 6. use model

      Console.WriteLine("End logistic regression" +
        " raw C# demo ");
      Console.ReadLine();
    }

  } // class Program

  public class LogisticRegression
  {
    public double[] wts;
    public double bias;
    public Random rnd;

    public LogisticRegression(int seed) { . . }

    public double ComputeOutput(double[] x) { . . }

    private void Shuffle(int[] arr) { . . }

    public void Train(double[][] trainX, int[] trainY,
      double lrnRate, int maxEpochs, int batSize,
      double decay=0.0) { . . }

    public double MSELoss(double[][] dataX, int[] dataY) { . . }
 
    public double Accuracy(double[][] dataX, int[] dataY) { . . }

    public int[][] ConfusionMatrix(double[][] dataX,
      int[] dataY) { . . }

    public void ShowConfusion(int[][] cm) { . . }
  } // class LogisticRegression

  public class Utils // helpers for Main()
  {
    public static double[][] MatLoad(string fn,
      int[] usecols, char sep, string comment) { . . }

    private static double[][] MatCreate(int rows, int cols)

    public static void MatShow(double[][] m, int dec,
      int wid) { . . }

    public static void VecShow(int[] vec, int wid) { . . }

    public static void VecShow(double[] vec,
      int dec, int wid) { . . }

    public static double[][] MatExtractCols(double[][] mat,
      int[] cols) { . . }

    public static int[] MatExtractColAsIntVec(double[][] mat,
      int col) { . . }
  } // class Utils

} // ns

The demo starts by loading the 200-item training data into memory:

string fn = "..\\..\\..\\Data\\people_train.txt";
double[][] trainXY = Utils.MatLoad(fn,
  new int[] {0,1,2,3,4,5,6,7,8}, ',', "#");
double[][] trainX = Utils.MatExtractCols(trainXY,
  new int[] {1,2,3,4,5,6,7,8});
int[] trainY = Utils.MatExtractColAsIntVec(trainXY, 0);

The demo assumes that the data is stored in separate training and test files in a directory named Data, which is located in the project root directory. The combined predictors and targets comma-delimited test file is read into memory as an array-of-arrays style matrix using the program-defined MatLoad() helper function. The predictor columns are extracted as an array-of-arrays style matrix using the MatExtractCols() function. The call to MatLoad() uses the traditional array initialization syntax to specify the columns to extract. The newer list style (from C# version 12 onwards) would look like [0,1,2,3,4,5,6,7,8].

The target y (sex) values are extracted and converted to an integer array. In situations where the training and test data are combined into a single file, you'll need to either manually separate them into a training and test file, or programmatically separate the data.

The 40-item test data is loaded in the same way:

fn = "..\\..\\..\\Data\\people_test.txt";
double[][] testXY = Utils.MatLoad(fn,
  new int[] {0,1,2,3,4,5,6,7,8}, ',', "#");
double[][] testX = Utils.MatExtractCols(testXY,
  new int[] {1,2,3,4,5,6,7,8});
int[] testY = Utils.MatExtractColAsIntVec(testXY, 0);
Console.WriteLine("Done ");

The demo displays the first three input x vectors and the associated target y values as a sanity check:

Console.WriteLine("First 3 train X data: ");
for (int i = 0; i < 3; ++i)
  Utils.VecShow(trainX[i], 4, 8); // decimals 4, width 8

Console.WriteLine("First 3 train Y data: ");
for (int i = 0; i < 3; ++i)
  Console.Write(trainY[i] + " ");
Console.WriteLine("");

In a non-demo scenario, you might want to display all the data to make sure it has been correctly loaded into memory.

Creating and Training the Logistic Regression Model
The model is instantiated like so:

Console.WriteLine("Creating logistic regression model ");
LogisticRegression lr = new LogisticRegression(seed:0);
Console.WriteLine("Done ");

The seed argument controls how the weights and bias values are randomly initialized, and how the order of training data items is scrambled during training. Different seed values can give significantly different results. However, you shouldn't spend time fiddling with the seed value because it's not conceptually part of the model.

The statements that train the model are:
double lrnRate = 0.01;
int maxEpochs = 1000;
int batSize = 10;
double decay = 0.0001;

Console.WriteLine("Starting training ");
lr.Train(trainX, trainY, lrnRate, maxEpochs, batSize, decay);
Console.WriteLine("Done");

The lrnRate parameter controls how much the weight values changed in each iteration. It can be difficult to tune the lrnRate and maxEpochs values. Good learning rate values often range between 0.001 to 0.10. A good maximum epochs value can be determined by examining the training progress messages.

The batch size parameter is easier to tune. Good batch sizes are often 10, 20, 32. If possible, you should pick a batch size that evenly divides the number of training items so that all processed batches have the same number of items. When a batch size of 1 is used, training is sometimes called "online training" (for historical reasons). When a batch size of all training items is used (200 in the demo), training is sometimes called "full-batch" training. When a batch size between 1 and the number of training items is used, training is sometimes called "mini-batch training" or just "batch training."

Weight Decay
The decay parameter is a bit tricky to explain. The curse of many machine learning classification techniques is called model overfitting. This happens when the model is trained too much. The resulting model predicts the training data very well, but predicts new, previously unseen data poorly.

Model overfitting is characterized by one or more weights that are much larger in magnitude than the others. Therefore, it's common practice to try and keep model weights values from becoming too large.

There are several ways to limit the magnitudes of weight values. The most common technique is called L2 regularization. L2 regularization is simple conceptually (penalize the sum of the squared weight values) but is very tricky to implement correctly. A closely related technique to L2 regularization is called weight decay. At the end of each training epoch, all weights are shrunk by multiplying by a value that's very close to 1, such as 0.9999. It's common to use the complement so if the decay parameter is set to 0.0001, the code multiplies each weight by 1 - 0.0001 = 0.9999.

When using SGD, in a very surprising math result, it turns out that L2 regularization and weight decay are mathematically equivalent. Because of this fact, the terms L2 regularization and weight decay are sometimes used interchangeably. However, L2 and weight decay are not equivalent for most training algorithms other than SGD.

A previous Data Science Lab column explains implementing L2 regularization directly, instead of using the simpler but equivalent weight decay technique. See "Neural Network L2 Regularization Using Python."

If the decay parameter is set to 0, then the weights are not shrunk at the end of each epoch. Training is extremely sensitive to the weight decay parameter value. I recommend initially setting decay to 0 and only trying values like 0.0001 if overfitting seems to be occurring.

Evaluating the Trained Model
The demo computes overall measures of model accuracy like so:

double accTrain = lr.Accuracy(trainX, trainY);
Console.WriteLine("Accuracy on train data: " +
  accTrain.ToString("F4"));
double accTest = lr.Accuracy(testX, testY);
Console.WriteLine("Accuracy on test data: " +
  accTest.ToString("F4"));

Accuracy is just the number of correct predictions divided by the total number of predictions made. For binary classification models, it's standard practice to compute separate accuracy metrics for each class in the form of a confusion matrix:

Console.WriteLine("Confusion matrix for test data: ");
int[][] cm = lr.ConfusionMatrix(testX, testY);
lr.ShowConfusion(cm);

The four cells of the confusion matrix hold the number of "true positives" (correctly predicted as class positive), "true negatives" (correctly predicted as class negative), "false positives" (incorrectly predicted as positive, i.e., actual class is negative, predicted class is positive), and "false negatives" (incorrectly predicted as negative). However, because the labeling of target values as 0 or 1 is arbitrary, and the identification of which class is positive or negative is arbitrary, and because the definition of "false positive" and "false negative" can vary, you must be careful when interpreting any confusion matrix. The demo assumes class 0 = male = negative, and class 1 = female = positive.

The precision and recall metrics are best interpreted as just two measures of accuracy where larger values are better. Their exact meaning is ambiguous because they depend on which class is designated as positive, which is arbitrary. The F1 score is the harmonic mean (average of two fractions) of precision and recall and so it is also another accuracy measure.

Examining the Model
The demo examines the trained model by displaying the model weights and bias:

Console.WriteLine("Model wts: ");
Utils.VecShow(lr.wts, 2, 8);
Console.WriteLine("Model bias: " + 
  lr.bias.ToString(("F2")));

The idea here is to make sure that none of the weights has a much larger magnitude than the others, which is a sign of model overfitting. Additionally, one of the advantages of logistic regression compared to other binary classification techniques is that the trained model is highly interpretable. Larger weight values indicate the associated variable contributes more to the prediction than variables with smaller weights.

The demo concludes by predicting the sex for a new, previously unseen data item:

Console.WriteLine("Predicting sex for [33, Nebraska, " +
  "$50,000, conservative]: ");
double[] x = new double[] { 0.33, 0,1,0, 0.50000, 1,0,0 };
double pVal = lr.ComputeOutput(x);
Console.WriteLine("p-val = " + pVal.ToString("F4"));
if (pVal < 0.5)
  Console.WriteLine("class 0 (male) ");
else
  Console.WriteLine("class 1 (female) ");

A decision threshold of 0.5 is by far the most common, but for unusual scenarios you can set the decision threshold to anything between 0.0 and 1.0.

Wrapping Up
Logistic regression was developed starting in the late 1930s as an effort to improve a binary classification technique called probit regression. Even though several more recently developed techniques, such as AdaBoost and neural network classification, can often produce better prediction models, logistic regression is still considered one of the main workhorses of machine learning.

In machine learning, the term "regression" usually means predicting a single numeric value, and the term "classification" usually means predicting a single discrete value. Because logistic regression is a binary classification technique, it might better be called logistic classification, but because logistic regression produces a p-value that is used to determine a predicted discrete value, it was named logistic regression.

Logistic regression can be adapted for multi-class classification where there are three or more possible prediction values. This can be done either by modifying the underlying code directly, or indirectly by using a technique called one-versus-rest (OvR). However, OvR can produce misleading predictions when the number data items one or more of the classes are greatly different from the others.

comments powered by Disqus

Featured

Subscribe on YouTube