The Data Science Lab
Linear Regression with Pseudo-Inverse Training Using JavaScript
Dr. James McCaffrey presents a complete end-to-end demonstration of linear regression with pseudo-inverse training implemented using JavaScript. Compared to other training techniques, such as stochastic gradient descent, pseudo-inverse training does not require any parameters and so it is especially simple to use.
The goal of a machine learning regression problem is to predict a single numeric value. For example, you might want to predict the bank account balance of an employee based on annual income, age, and years of work experience. There are roughly a dozen main regression techniques, such as nearest neighbors regression, kernel ridge regression, neural network regression, and gradient boosting regression. Linear regression is the most fundamental technique.
The form of a linear regression prediction equation is y' = (w0 * x0) + (w1 * x1) + . . + (wn * xn) + b where y' is the predicted value, the xi are predictor values, the wi are constants called model weights, and b is a constant called the bias. For example, y' = predicted balance = (-0.54 * income) + (-0.38 * ago) + (0.17 * experience) + 0.62. Training the model is the process of finding the values of the weights and bias so that predicted y values are close to known correct target y values in a set of training data.
[Click on image for larger view.] Figure 1: Linear Regression with Pseudo-Inverse Training in Action
There are many different techniques to train a linear regression model. Three of the most common are 1.) stochastic gradient descent (SGD), 2.) relaxed Moore-Penrose pseudo-inverse, and 3.) pseudo-inverse via normal equations closed form training. This article explains how to implement the relaxed MP pseudo-inverse training technique.
A good way to see where this article is headed is to take a look at the screenshot of a demo program in Figure 1. The demo begins by loading a set of synthetic training and test data into memory. The data looks like:
-0.1660, 0.4406, -0.9998, -0.3953, -0.7065, 0.4840
0.0776, -0.1616, 0.3704, -0.5911, 0.7562, 0.1568
-0.9452, 0.3409, -0.1654, 0.1174, -0.7192, 0.8054
. . .
The first five values on each line are the predictors (sometimes called features). The last value on each line is the target y value to predict. There are 200 training items and 40 test items. Next, the demo trains the linear regression model and then displays the weights and the bias:
Creating and training model
Done
Model weights/coefficients:
-0.2656 0.0333 -0.0454 0.0358 -0.1146
Model bias/intercept: 0.3619
Model weights are sometimes called model coefficients. Because there are five predictor variables, there are five weights. The bias is sometimes called the intercept. Next, the demo evaluates the trained model:
Evaluating model
Train acc (within 0.10) = 0.4600
Test acc (within 0.10) = 0.6500
Train MSE = 0.0026
Test MSE = 0.0020
For accuracy, a prediction is scored correct if it's within 10% of the true target value. The 10% closeness is arbitrary and a good percentage will vary from problem to problem. The accuracy of the model on the training data is poor -- just 46% (92 out of 200 correct) and 65% (25 out of 40 correct) on the test data. The poor accuracy of the linear regression model is due to the fact that the data has complex, non-linear relations between variables.
The MSE values are mean squared error. They aren't too easy to interpret, other than smaller MSE values are better. They are useful mostly to compare different prediction models. Behind the scenes, linear regression training with relaxed MP pseudo-inverse minimizes MSE.
The demo program concludes by using the trained model to make a prediction:
Predicting for x =
-0.1660 0.4406 -0.9998 -0.3953 -0.7065
Predicted y = 0.5329
The input is the first training item. The predicted y value of 0.5329 is about 11% away from the correct target of 0.4840 and so the prediction would be scored as incorrect using the 10% closeness criterion.
This article assumes you have intermediate or better programming skill but doesn't assume you know anything about linear regression. The demo is implemented using JavaScript for a node.js runtime environment but you should be able to refactor the demo code to another C-family language if you wish. All normal error checking has been removed to keep the main ideas as clear as possible.
The source code for the demo program is a too long to be presented in its entirety in this article. The complete code and data are available in the accompanying file download, and they're also available online.
The Demo Data
The demo data is synthetic. It was generated by a 5-10-1 neural network with random weights and bias values. The idea here is that the synthetic data does have an underlying, but complex, structure which, in theory, can be predicted.
All of the predictor values are between -1 and +1. There are 200 training data items and 40 test items. With linear regression, it's not necessary to normalize the training data predictor values to the same range because no distance between data items is computed. However, normalizing the predictors doesn't hurt and it's useful just in case you want to send the data to other regression algorithms that require normalization, for example, nearest neighbors regression and kernel ridge regression. Normalization also makes the model weights easier to interpret.
Three common normalization techniques are min-max, z-score, and divide-by-constant. When possible, I recommend using divide-by-constant normalization because it's simplest. For example, for a variable that holds employee age values, you could divide all age values by 100 so that the normalized values are all between 0.0 and 1.0.
In practice, linear regression is most often used with data that has strictly numeric predictor variables. But it is possible to use linear regression with categorical data. For categorical data (both ordinal with order, and nominal without order), standard one-hot encoding works for SGD training but does not work for pseudo-inverse training. For pseudo-inverse training, you should use drop-first encoding, also called dummy encoding.
For example, for a predictor variable height with possible values (short, medium, tall), one-hot encoding for SGD training would set short = (1 0 0), medium = (0 1 0), tall = (0 0 1). But drop-first encoding for pseudo-inverse training would set short = (0 0), medium = (1 0), tall = (0 1).
For binary data, you can use zero-one encoding for any kind of training. For example, for a predictor variable sex, you could set male = 0, female = 1, or vice versa.
Understanding Training Using Pseudo-Inverse
The predict() method implementation is short and simple:
predict(x)
{
let sum = 0.0;
for (let i = 0; i < x.length; ++i) {
sum += x[i] * this.weights[i];
}
sum += this.bias;
return sum;
}
Training is the process of finding the values for the this.weights array/list/vector and the this.bias constant.
In theory, the weight values of a linear regression model can be solved using the equation w = inv(X) * y, where w is a vector of weights and the bias, X is a "design matrix." which is a matrix of training predictor data that has a leading column of 1.0 values prepended, inv() is the matrix inverse operation, * means matrix-to-vector multiplication, and y is a vector of the target y values. However, this equation won't work because matrix inverse applies only to square matrices that have the same number of rows and columns, which is almost never the case in machine learning scenarios.
One work-around is to use what's called the relaxed Moore-Penrose pseudo-inverse. The math equation is: w = pinv(X) * y. This technique works because the pseudo-inverse applies to any shape matrix. The technique is called relaxed because in a machine learning scenario, the pseudo-inverse implementation doesn't have to fulfill all the mathematical requirements of true MP pseudo-inverse.
The code for the pseudo-inverse demo train() method is short but a bit deceptive because most of the work is done by non-trivial helper functions:
train(trainX, trainY)
{
// w = pinv(DX) * y
let dim = trainX[0].length; // number predictors
this.weights = this.vecMake(dim, 0.0); // allocate wts
let DX = this.matToDesign(trainX); // design matrix
let Xpinv = this.matPinv(DX); // pinv of design X
let biasAndWts = this.matVecProd(Xpinv, trainY); // Xpinv * y
this.bias = biasAndWts[0]; // bias is in cell [0]
for (let i = 1; i < biasAndWts.length; ++i)
this.weights[i-1] = biasAndWts[i];
return;
}
A design matrix takes the bias into account. For example, if a set of training predictors stored in a matrix X is:
0.60 0.87 0.42
0.82 0.51 0.94
0.74 0.83 0.53
The associated design matrix is:
1.0 0.60 0.87 0.42
1.0 0.82 0.51 0.94
1.0 0.74 0.83 0.53
OK, so all that's needed is a function to compute the pseudo-inverse of a matrix. Unfortunately, implementing a pseudo-inverse function is very difficult. Many mathematicians worked for many years on the problem, and came up with dozens of solutions. The technique used by the demo program is to compute the pseudo-inverse of a matrix using QR decomposition via the Householder algorithm.
Most library implementations of MP pseudo-inverse use a more sophisticated technique involving singular value decomposition (SVD), a blisteringly complex algorithm. The standard SVD implementation in the LAPACK numerical programming library has over 10,000 lines of Fortran code.
The good news is that the demo matPinv() pseudo-inverse method and its associated matDecomposeQR() worker method can be considered black boxes. You won't have to modify the code except in very rare situations. The bad news is that because pseudo-inverse is complicated, it can fail due to numerical instability related to arithmetic overflow or underflow. But in practice, pseudo-inverse failure is relatively rare. The chance of failure increases as the size of the set of training data increases. As a crude rule of thumb, pseudo-inverse training for linear regression usually works well when the number of training item is less than 2,000. For larger datasets, stochastic gradient descent training can be used.
For any source matrix A, the QR decomposition is A = Q * R, where * is matrix multiplication. The Q and R stand for "orthonormal." with "O" changed to "Q" so it doesn't look like the digit zero, and "right." Taking the inverse of both side of the equation gives inv(A) = inv(Q * R), which simplifies to Ainv = inv(R) * inv(Q). The R matrix is square upper triangular (zeros below the main diagonal) and it's possible to compute the inverse for this special case easily. The inverse of the Q matrix is the transpose of Q (rows and columns switched), which is also easy to compute.
To recap, to train a linear regression model, you can find the pseudo-inverse of the design matrix. To find the pseudo inverse of the design matrix, you can compute the QR decomposition and then easily compute the inverses of the resulting Q and R matrices, and then matrix-multiply those inverses together. The weights and bias can be found by multiplying the inverse matrix by the vector of target y values.
An alternative to pseudo-inverse training for linear regression is stochastic gradient descent (SGD) training. Pseudo-inverse training directly finds the values of the weights and the bias that minimize mean squared error. SGD training estimates the values of the weights and bias using an iterative technique that starts with random values, and then slowly adjusts the weights and bias to get closer and closer to the optimal values. SGD requires a learning rate parameter that controls how quickly the values of the weights change, and a maximum iterations parameter that specifies how many iterations to perform. These two parameters must be determined by trial and error.
Unlike SGD training, linear regression pseudo-inverse training does not require any parameters -- it just works (usually). But pseudo-inverse training is tricky and can sometimes fail, especially with a very large set of training data.
A second alternative to using relaxed MP pseudo-inverse training is pseudo-inverse via normal equations closed form training. The equation for this training is w = inv(Xt * X) * Xt * y, where X is the design matrix, Xt is its transpose, inv() is any regular matrix inverse function (but usually Cholesky decomposition inverse) , * is matrix multiplication, and y is a vector of target values from the training data. Closed form training can fail for large matrices because the Xt * X operation uses many multiplications, and if the matrix values are very small or very large, arithmetic underflow or overflow can occur. I noticed that on many Internet resource, both relaxed MP pseudo-inverse training and pseudo-inverse via normal equations closed form training are both called just "pseudo-inverse" training, so that term is ambiguous.
To summarize, there are three common ways to train a linear regression model: 1.) stochastic gradient descent, 2.) relaxed MP pseudo-inverse, 3.) pseudo-inverse via normal equations closed form. For very large datasets, using SGD is usually the best approach. For medium-size datasets, MP pseudo-inverse often works well. For small datasets, closed form training often works well. Note: There are many other training techniques, such as L-BFGS optimization and simplex optimization, but SGD, pseudo-inverse, and closed form are the most commonly used, at least based on my experience.
The Demo Program
I used old-fashioned Notepad to edit the demo program. You will mostly likely want to use a more sophisticated editor. Visual Studio Code is a popular choice among my work colleagues.
The demo program was written for a node.js environment. The demo program was tested using version v22.16.0 but any relatively recent version of node.js will work.
The overall program structure is presented in Listing 1. All the control logic is in a main() function. All of the linear regression functionality is in a LinearRegressor class. The LinearRegressor class exposes a constructor and four methods: train(), predict(), accuracy(), and mse(). The demo program has 20 helper methods and functions, but most of them are short and simple.
Listing 1: Overall Program Structure
// linear_regression_pinverse.js
// Moore-Penrose pseudo-inverse (QR-Householder) training
// node.js
let FS = require("fs") // for matLoad()
// ----------------------------------------------------------
class LinearRegressor
{
constructor(seed)
{
this.weights; // allocated in train()
this.bias = 0.0; // supplied in train()
}
// core functionality
predict(x) { . . }
train(trainX, trainY) { . . }
accuracy(dataX, dataY, pctClose) { . . }
mse(dataX, dataY) { . . }
// helpers for train()
matVecProd(M, v) { . . }
matToDesign(M) { . . }
matPinv(M) { . . }
// helpers for matPinv()
matDecomposeQR(M, reduced) { . . }
matInvUpperTri(U) { . . }
matTranspose(M) { . . }
matProduct(matA, matB) { . . }
// misc helpers
matMake(nRows, nCols, val) { . . }
matCopy(M) { . . }
matIdentity(n) { . . }
vecMake(n, val) { . . }
vecToMat(vec, nRows, nCols) { . . }
vecNorm(vec) { . . }
vecDot(v1, v2)
// for debugging: matShow, vecShow
matShow(M, dec, wid) { . . }
vecShow(vec, dec, wid) { . . }
} // end class LinearRegressor
// helper functions for main()
function vecShow(vec, dec, wid, nl) { . . }
function matShow(A, dec, wid) { . . }
function matToVec(M) { . . }
function matLoad(fn, delimit, usecols, comment) { . . }
function main()
{
console.log("Linear regression with pseudo-inverse" +
" (QR-Householder) training using JavaScript ");
// 1. load data
// 2. create and train linear regression model
// 3. evaluate trained model
// 4. use model to make a prediction
console.log("End demo");
}
main();
The demo program starts by loading the 200-item training data into memory:
let trainFile = ".\\Data\\synthetic_train_200.txt";
let trainX = matLoad(trainFile, ",", [0,1,2,3,4], "#");
let trainY = matLoad(trainFile, ",", [5], "#");
trainY = matToVec(trainY);
The training X data is stored into an array-of-arrays style matrix. The data is assumed to be in a directory named Data, which is located in the project root directory. The arguments to the matLoad() function mean load columns 0, 1, 2, 3, 4 where the data is comma-delimited, and lines beginning with "#" are comments to be ignored. The training y data in column [5] is loaded into a matrix and then converted to a one-dimensional vector using the matToVec() helper function.
The 40-item test data is loaded into memory using the same pattern that was used to load the training data:
let testFile = ".\\Data\\synthetic_test_40.txt";
let testX = matLoad(testFile, ",", [0,1,2,3,4], "#");
let testY = matLoad(testFile, ",", [5], "#");
testY = matToVec(testY);
The first three training items are displayed with four decimals like so:
console.log("First three train X: ");
for (let i = 0; i < 3; ++i)
vecShow(trainX[i], 4, 8, true);
console.log("First three train y: ");
for (let i = 0; i < 3; ++i)
console.log(trainY[i].toFixed(4).toString().padStart(9, ' '));
In a non-demo scenario, you might want to display all the training data to make sure it was correctly loaded into memory. The demo uses data stored in a local text file. If your data is stored somewhere else, such as a SQL database, you'll have to modify the code that reads the data, but the linear regression code can be used as-is.
Creating and Training the Model
The demo creates and trains the model using these statements:
console.log("Creating and training model ");
let model = new LinearRegressor();
model.train(trainX, trainY);
console.log("Done ");
Unlike many other regression techniques, the LinearRegressor with pseudo-inverse training constructor does not require any arguments.
The train() method needs only the training data. Because the training algorithm computes the weights and bias values directly, without using an iterative approximation, pseudo-inverse training is sometimes called closed form training.
After training, the demo displays the model weights and the bias:
console.log("Model weights/coefficients: ");
vecShow(model.weights, 4, 8, true); // 4 decimals, 8 wide, add nl
console.log("Model bias/intercept: " +
model.bias.toFixed(4).toString());
An advantage of linear regression compared to many other regression techniques is that the model is very interpretable. Assuming predictor values have roughly the same range, larger magnitude weights indicate larger importance of the associated predictor variable, and the sign of a weight indicates the direction of a prediction that's created by an increase in the value of the associated predictor variable.
Evaluating the Model
The demo program evaluates the prediction accuracy of the trained model using these statements:
let trainAcc = model.accuracy(trainX, trainY, 0.10);
let testAcc = model.accuracy(testX, testY, 0.10);
console.log("Train acc (within 0.10) = " +
trainAcc.toFixed(4).toString());
console.log("Test acc (within 0.10) = " +
testAcc.toFixed(4).toString());
The accuracy() method accepts a percentage-close argument that is used to determine if a prediction is scored as correct or not. This is a minor downside of regression accuracy, but ultimately model accuracy is usually what you're most interested in.
The demo computes the mean squared error for the trained model:
let trainMSE = model.mse(trainX, trainY);
let testMSE = model.mse(testX, testY);
console.log("Train MSE = " + trainMSE.toFixed(4).toString());
console.log("Test MSE = " + testMSE.toFixed(4).toString());
Mean squared error is the average of the squared differences between predicted y values and the correct target y values in the training data. Smaller MSE values are better. For example, if three predicted y values are 2.5, 5.5, and 3.5 and the associated correct target y values are 2, 8, and 5 then the MSE is ((2.5 - 2)^2 + (5.5 - 8)^2 + (3.5 - 5)^2) / 3 = (0.25 + 6.25 + 2.25) / 3 = 8.75 / 3 = 2.92. If a model make perfect predictions, the MSE would be 0.
In addition to accuracy and MSE, another common model evaluation metric is the coefficient of determination, R2. See my blog post for details.
Using the Model to Make Predictions
The demo concludes by using the trained model to predict the y value for the first training data item:
let x = trainX[0];
console.log("Predicting for x = ");
vecShow(x, 4, 9, true); // 4 decimals, add newline
let predY = model.predict(x);
console.log("Predicted y = " + predY.toFixed(4).toString());
The predict() method just emits the result. An option to consider is adding a verbose parameter to predict() that, when set to true, displays how the prediction was computed using the model weights and bias.
Wrapping Up
Implementing linear regression using relaxed MP pseudo-inverse training from scratch requires a bit of effort, but compared to using a library function such as those in the Python language scikit-learn library, it allows you to easily integrate a prediction model with other Web-oriented systems. And a from-scratch implementation also allows you to easily modify the system. For example, you can add error checking that is relevant to your particular problem scenario, or you can add diagnostic console.log() statements to monitor training.
It's usually not possible to know in advance if a dataset can be modeled well by a linear regression model. However, it's often a good idea to start with linear regression, because even if it's not effective, linear regression establishes baseline MSE and accuracy results for comparison with more advanced regression techniques such as kernel ridge regression, quadratic regression, and neural network regression.