The Data Science Lab
Kernel Ridge Regression with Cholesky Inverse Training Using JavaScript
The goal of a machine learning regression problem is to predict a single numeric value. For example, you might want to predict a person's bank savings account balance based on age, annual income, and so on.
There are approximately a dozen common regression techniques. Examples include linear regression, nearest neighbors regression, quadratic regression, decision tree regression (several types, such as random forest and gradient boost), and neural network regression. Each technique has pros and cons. A technique that often produces accurate predictions for complex data is called kernel ridge regression. Note that "kernel ridge regression" is a completely different technique than the similarly named "ridge regression."
Kernel ridge regression uses a kernel function that computes a measure of similarity between two data items, and a ridge regularization technique to discourage model overfitting. Model overfitting occurs when a model predicts well on the training data, but predicts poorly on new, previously unseen data. Ridge regularization is also called L2 regularization.
This article presents a demo of kernel ridge regression, implemented from scratch, using the JavaScript language. A good way to see where this article is headed is to take a look at the screenshot in Figure 1. The demo program begins by loading synthetic training and test data into memory. The data looks like:
-0.1660, 0.4406, -0.9998, -0.3953, -0.7065, 0.4840
0.0776, -0.1616, 0.3704, -0.5911, 0.7562, 0.1568
-0.9452, 0.3409, -0.1654, 0.1174, -0.7192, 0.8054
0.9365, -0.3732, 0.3846, 0.7528, 0.7892, 0.1345
. . .
The first five values on each line are the x predictors. The last value on each line is the target y variable to predict. There are 200 training items and 40 test items.
[Click on image for larger view.] Figure 1:
Kernel Ridge Regression with Cholesky Inverse Training in Action
The demo creates and trains a kernel ridge regression model, evaluates the model accuracy on the training and test data, and then uses the model to predict the target y value for the first training item x = [-0.1660, 0.4406, -0.9998, -0.3953, -0.7065].
The first part of the demo output shows how a kernel ridge regression (KRR) model is created and trained:
Setting RBF gamma = 0.3
Setting alpha noise = 0.005
Creating and training KRR model using Cholesky inverse
Done
Kernel ridge regression requires you to specify which kernel function to use. The demo uses the radial basis function (RBF) as the kernel function. The RBF kernel requires a value for a parameter called gamma. The alpha parameter controls the ridge regularization. The values of gamma and alpha must be determined by trial and error.
Behind the scenes, the demo program trains the KRR model by constructing a kernel matrix and then computing the inverse of the matrix. Unlike iterative machine learning techniques such as stochastic gradient descent, training a KRR model using the kernel matrix inverse approach does not require learning rate and maximum epochs parameters.
The next part of the demo displays the trained model weights so they can be examined:
Model weights:
-2.0218 -1.1406 0.0758 -0.6265 . . .
0.3933 0.2223 0.0564 0.4282 . . .
. . .
-0.2014 -1.6270 -0.5825 -0.0487 . . .
If there are n training data items, a kernel ridge regression model has n weights. Because the demo has 200 training items, the model has 200 weights.
The demo concludes by evaluating the trained model and making a prediction:
Computing model accuracy
Train acc (within 0.10) = 0.9950
Test acc (within 0.10) = 0.9500
Train MSE = 0.0000
Test MSE = 0.0002
Predicting for x =
-0.1660 0.4406 -0.9998 -0.3953 -0.7065
Predicted y = 0.4941
The model accuracy is very good -- 99.5% accuracy on the training data (199 out of 200 correct) and 95.0% accuracy on the test data (38 out of 40 correct). A prediction is scored as correct if it's within 10% of the true target value.
The mean squared error (MSE) values on the training and test data are very small, which is good. MSE is a more granular measure of model goodness than accuracy. MSE values are a good way to compare models when searching for model parameter values (RBF gamma and ridge alpha).
The model prediction for the first training item (-0.1660, 0.4406, -0.9998, -0.3953, -0.7065) is 0.4941 which is within 2% of the true target value of 0.4840.
This article assumes you have intermediate or better programming skills but doesn't assume you know anything about kernel ridge regression. All normal error checking has been removed to keep the main ideas as clear as possible.
The source code for the demo program is too long to be presented in its entirety in this article. The complete code and data are available in the accompanying file download, and they're also available online.
The Demo Data
The demo data is synthetic. It was generated by a 5-10-1 neural network with small random weights and bias values. The idea here is that the synthetic data does have an underlying, but complex, non-linear structure which can be predicted.
All of the predictor values are between -1 and +1. When using kernel ridge regression, technically, it's not necessary to normalize/scale your data. But normalizing can sometimes lead to a better prediction model, especially if some raw predictor values are very large (such as employee income) and some values are small (such as employee age).
The three most common techniques to normalize numeric data are min-max normalization, z-score normalization, and divide-by-constant normalization. When possible, I recommend divide-by-constant normalization because it's simple and it retains the sign of each data item. For example, if you have a predictor variable employee age, you could divide all age values by 100.
Kernel ridge regression is most often used with data that has strictly numeric predictor variables. It is possible to use the technique with categorical predictors. If a categorical predictor variable has an inherent order, you can use equal-interval encoding. For example, a predictor variable rating with possible values (poor, average, good) could be encoded as poor = 0.25, average = 0.50, good = 0.75.
For categorical predictor variables without inherent order, such as color with three possible values (red, white, blue), you can use one-over-n-hot encoding. For instance, red = (0.3333, 0, 0), white = (0, 0.3333, 0), blue = (0, 0, 0.3333). Unlike basic one-hot encoding, one-over-n-hot encoding takes into account the number of possible values of the predictor variable.
Understanding the RBF Kernel Function
In order to understand kernel ridge regression, you must understand the radial basis function (RBF) kernel function. The RBF kernel computes a measure of similarity between two vectors. There are two RBF variants. The RBF version used by the demo program is called the gamma version. The function is defined as:
rbf(v1, v2) = exp( -1 * gamma * ||v1 - v2||^2 )
Here, v1 and v2 are two vectors, exp() is the math e constant (2.71828...) raised to a power, ||v1 – v2||^2 is squared Euclidean distance, and gamma is an arbitrary constant. Gamma is sometimes called the inverse bandwidth, or just plain bandwidth, or width, or scale, or length. Sheesh! Dealing with machine learning terminology can be somewhat annoying.
Suppose v1 = (2.40, -3.50, 1.30) and v2 = (2.0, -3.0, 1.0) and gamma = 0.60. The trailing squared Euclidean distance term is the sum of the squared differences between vector elements:
||v1 - v2||^2 = (2.40 - 2)^2 + (-3.50 - (-3))^2 + (1.30 - 1)^2
= (0.40)^2 + (-0.50)^2 + (0.30)^2
= 0.1600 + 0.2500 + 0.0900
= 0.5000
And therefore RBF(v1, v2) is:
rbf(v1, v2) = exp( -1 * gamma * ||v1 - v2||^2 )
= exp( -1 * 0.60 * 0.5000 )
= exp( -0.3000 )
= 0.7408
If two vectors are the same, then rbf(v1, v2) = 1.0 (maximum similarity). The more different v1 and v2 are, the smaller the value of RBF is. In short, the RBF of two vectors is a value between 0 and 1 where larger values mean more similar.
There is a second definition for the RBF kernel function. It is:
rbf(v1, v2) = exp( -1 * ||v1 - v2||^2 / (2 * sigma^2) )
Here, instead of multiplying by an arbitrary constant gamma, you divide by 2 times the square of an arbitrary constant sigma. This version of RBF is sometimes called the sigma version.
There are many different kernel functions. In machine learning scenarios, the RBF kernel function is the most common, at least based on my experience, and is the one used by the demo program. Other, less commonly used kernel functions include the polynomial kernel, the sigmoid kernel, and the cosine kernel.
To recap, kernel ridge regression uses a kernel function that computes a measure of similarity between two vectors. There are many different kernel functions. The most common is the radial basis function (RBF) kernel. The RBF kernel has two variations, the gamma version and the sigma version.
Kernel Ridge Regression Prediction
The kernel ridge regression prediction mechanism is best understood by looking at a concrete example. In words, to predict the y value for an input vector x, you compute the sum of the products of the model weights times the kernel function applied to x and each training item.
Suppose you have just three training data items, x0, x1, x2. A trained kernel ridge regression model will have three weights, w0, w1, w2. If the kernel function is the rbf() function, the predicted y for an input x is:
y' = (w0 * rbf(x, x0)) + (w1 * rbf(x, x1)) + (w2 * rbf(x, x2))
For example, suppose the input x to predict is (2.0, 4.0, 1.0). And suppose that for some value of gamma, the RBF kernel function values are:
rbf(x, x0) = 0.80
rbf(x, x1) = 0.50
rbf(x, x2) = 0.90
And suppose the trained model weights are w0 = 0.60, w1 = 0.70, w2 = -0.20. The predicted y is:
y' = (w0 * rbf(x, x0)) + (w1 * rbf(x, x1)) + (w2 * rbf(x, x2))
= (0.60 * 0.80) + (0.70 * 0.50) + (-0.20 * 0.90)
= 0.48 + 0.35 + -0.18
= 0.65
Many simple machine learning linear algorithms, such as linear regression, L1 (lasso) regression, and L2 (ridge) regression, require an additional weight, called the bias/intercept. For kernel ridge regression, the bias term is implicitly incorporated into the kernel function.
The demo predict() method implementation, without error-checking, is short and simple:
predict(x)
{
let n = this.trainX.length;
let sum = 0.0;
for (let i = 0; i < n; ++i) {
let xx = this.trainX[i];
let k = this.rbf(x, xx, this.gamma);
sum += this.wts[i] * k;
}
return sum;
}
Simple. But where do the model weights come from?
Training a Kernel Ridge Regression Model
Training a kernel ridge regression model is the process of finding values for the model weights (one per training item) so that predicted y values closely match the known correct target values in the training data.
There are two main ways to train a kernel ridge regression model. The first technique, and the one used by the demo program presented in this article, involves creating an n-by-n kernel matrix that compares all the training data items with each other. Then ridge regularization is applied by adding a small constant, usually named alpha, to the diagonal elements of the kernel matrix. Then the matrix inverse of the kernel matrix is computed. The inverse of the kernel matrix is multiplied by the vector of training y values, which gives the model weights.
The second technique to train a kernel ridge regression model uses stochastic gradient descent (SGD). SGD is an iterative process that loops through the training data multiple times, adjusting the model weights slowly so that the model reduces its error between predicted y values and target y values.
The matrix inverse training technique usually works well for small and medium size datasets. The matrix inverse technique has fewer training hyperparameters to deal with (no learning rate or maximum epochs) than the SGD technique. But the matrix inverse technique can't handle huge datasets, and matrix inverse computation is very complex and can fail, typically through arithmetic underflow or overflow.
Here's a concrete example of the kernel matrix inverse training technique.
Suppose there are just four training data X items:
-0.166 0.441 -1.000 -0.395 -0.707
0.078 -0.162 0.370 -0.591 0.756
-0.945 0.341 -0.165 0.117 -0.719
0.936 -0.373 0.385 0.753 0.789
And suppose the associated training target y values are: (0.48, 0.16, 0.81, 0.13).
The kernel matrix, K, holds values of the RBF kernel function applied to each pair of training items. Suppose for some value of gamma, the kernel matrix is:
1.000 0.639 0.854 0.480
0.639 1.000 0.653 0.772
0.854 0.653 1.000 0.495
0.480 0.772 0.495 1.000
The 0.653 value at K[2][1] is RBF applied to training item [2] and item [1]. Notice that the kernel values at cells [0][0], [1][1], [2][2], [3][3] are all 1.0 because they compare a data item with itself. Additionally, the kernel matrix is symmetric because rbf(v1, v2) = rbf(v2, v1).
Next, the alpha regularization constant is added to the diagonal elements. This is the "ridge" part of KRR. If alpha is set to 0.015, then the modified kernel matrix is:
1.015 0.639 0.854 0.480
0.639 1.015 0.653 0.772
0.854 0.653 1.015 0.495
0.480 0.772 0.495 1.015
Adding a small alpha value to the diagonal of the kernel matrix has two beneficial effects. First, it adds noise which prevents model overfitting. Second, it conditions the kernel matrix so that the matrix inverse computation is much less likely to fail.
Next, the inverse of the conditioned kernel matrix is computed. It is:
3.539 -0.567 -2.635 0.044
-0.567 3.128 -0.663 -1.787
-2.635 -0.663 3.642 -0.027
0.044 -1.787 -0.027 2.337
The last step to find the kernel ridge regression model weights is to matrix-multiply the vector of target y values by the inverse of the kernel matrix. The vector of y values has shape 1-by-4 and the inverse kernel matrix has shape 4-by-4, so the result is a vector with shape 1-by-4:
-0.492 -0.558 1.551 0.033
These are the model weights. Whew!
Understanding Matrix Inverse Using Cholesky Decomposition
Computing a matrix inverse is one of the most challenging problems in numerical programming. There are over a dozen algorithms to compute a matrix inverse, and each algorithm has several variations, and each variation has multiple implementation designs.
As it turns out, because the kernel matrix for kernel ridge regression has all positive values and is symmetric, it's possible to use a specialized matrix inverse algorithm called Cholesky decomposition. Cholesky decomposition inverse is simpler than general-purpose inverse algorithms.
In matrix algebra, you can decompose a matrix A into two matrices B and C so that A = B * C, where * is matrix multiplication. Cholesky decomposition decomposes a symmetric positive-valued matrix M into two matrices L and Lt, where Lt is the transpose of L. Matrix L will be a lower triangular matrix, and matrix Lt will be an upper triangular matrix.
If you can find L and Lt, then the inverse of a kernel matrix K can be computed by:
(1) K = L * Lt
(2) inv(K) = inv(L * Lt)
(3) inv(K) = inv(Lt) * inv(L)
Equation (1) is the definition of Cholesky decomposition. Equation (2) applies matrix inverse to both sides of equation (1). Equation (3) is a consequence of the matrix property that inv(A * B) = inv(B) * inv(A).
This approach seems strange but it's useful because the inverses of both L and Lt are relatively easy to compute. If you're new to computing a matrix inverse, using decomposition and then computing two matrix inverses seems overly complicated, but it turns out to be vastly simpler than computing a matrix inverse directly.
There are several algorithms that can be used to perform a Cholesky decomposition. The demo program uses a version called the Banachiewicz algorithm.
The Demo Program
I used Notepad to edit the demo program. You will probably want to use a more powerful editor. Visual Studio and Visual Studio Code are popular with my colleagues.
The demo program was written for a node.js environment. The demo program was run using version v22.16.0 but any relatively recent version will work.
The overall program structure is presented in Listing 1. All the control logic is in a main() function. All of the kernel ridge regression functionality is in a KRR class. The KRR class exposes a constructor and four methods: train(), predict(), accuracy(), and mse(). The rbf() method is used by the train() and predict() methods. The demo program has 10 helper functions.
Listing 1: Overall Program Structure
// krr_cholesky.js
// kernel ridge regression using Cholesky inverse training
// node.js environment
let FS = require("fs") // for loadTxt()
// ----------------------------------------------------------
class KRR
{
constructor(gamma, alpha)
{
this.gamma = gamma;
this.alpha = alpha;
this.trainX; // supplied in train()
this.trainY; // supplied in train()
this.wts; // supplied in train()
}
predict(x) { . . }
train(trainX, trainY) { . . }
rbf(v1, v2, gamma) { . . }
accuracy(dataX, dataY, pctClose) { . . }
mse(dataX, dataY) { . . }
}
// ----------------------------------------------------------
function main()
{
console.log("Begin Kernel Ridge " +
"Regression with Cholesky matrix inverse ");
// 1. load data
// 2. create and train KRR model
// 3. evaluate model
// 4. use model
console.log("End demo");
}
main()
// ----------------------------------------------------------
function matInverseCholesky(A) { . . }
function matDecompCholesky(M) { . . }
function vecMake(n, val) { . . }
function matMake(rows, cols, val) { . . }
function vecShow(vec, dec, wid, nl) { . . }
function matShow(A, dec, wid) { . . }
function matToVec(M) { . . }
function vecMatProd(v, A) { . . }
function matIdentity(n) { . . }
function loadTxt(fn, delimit, usecols, comment) { . . }
// ----------------------------------------------------------
The KRR class has five fields (gamma, alpha, trainX, trainY, wts) with public scope and so they can be accessed directly. Unlike some regression techniques, kernel ridge regression needs access to the X training predictor data because the predict() method uses that data. The training target y values are not needed by the demo implementation, but they are stored anyway for use in debugging.
The KRR constructor accepts gamma, which is used by the internal rbf() method, and alpha, which is used by the train() method. Because the training data is needed by both train() and predict(), a reasonable design alternative is to pass the training data to the constructor.
accepts gamma, which is used by the internal rbf() method, and alpha, which is used by the train() method. Because the training data is needed by both train() and predict(), a reasonable design alternative is to pass the training data to the constructor.
Loading the Data into Memory
The demo program starts by loading the 200-item training data into memory:
let trainFile = ".\\Data\\synthetic_train_200.txt";
let trainX = loadTxt(trainFile, ",", [0,1,2,3,4], "#");
let trainY = loadTxt(trainFile, ",", [5], "#");
trainY = matToVec(trainY);
The training X data is stored into an array-of-arrays style matrix. The data is assumed to be in a directory named Data, which is located in the project root directory. The arguments to the loadTxt() function mean load columns 0, 1, 2, 3, 4 where the data is comma-delimited, and lines beginning with "#" are comments to be ignored. The training y data in column [5] is loaded into a matrix and then converted to a one-dimensional vector using the matToVec() helper function.
The 40-item test data is loaded into memory using the same pattern that was used to load the training data:
let testFile = ".\\Data\\synthetic_test_40.txt";
let testX = loadTxt(testFile, ",", [0,1,2,3,4], "#");
let testY = loadTxt(testFile, ",", [5], "#");
testY = matToVec(testY);
The first three training items are displayed with four decimals like so:
console.log("First three train X: ");
for (let i = 0; i < 3; ++i)
vecShow(trainX[i], 4, 8, true);
console.log("First three train y: ");
for (let i = 0; i < 3; ++i)
console.log(trainY[i].toFixed(4).toString().padStart(9, ' '));
In a non-demo scenario, you might want to display all the training data to make sure it was correctly loaded into memory. The demo uses data stored in a local text file. If your data is stored somewhere else, such as a SQL database, you'll have to modify the code that reads the data, but the kernel ridge regression code can be used as-is.
Creating and Training the Model
The kernel ridge regression model is created and trained using these statements:
let gamma = 0.3; // RBF param
let alpha = 0.005; // regularization
let krr = new KRR(gamma, alpha);
krr.train(trainX, trainY);
console.log("Done ");
Good values for gamma and alpha must be determined by trial and error. You can do this manually, or programmatically ("grid search").
After training has completed, the demo program displays and analyzes the model weights:
console.log("Model weights: ");
vecShow(krr.wts, 4, 9, true);
Very large weight values, or many zero values, indicate possible model overfitting. If you have many hundreds of weights, instead of visually examining the weights, you can write a helper function to programmatically count the number of zero values, flag very large values, and so on.
Evaluating and Using the Model
The demo program evaluates the trained model prediction accuracy using these statements:
let trainAcc = krr.accuracy(trainX, trainY, 0.10);
let testAcc = krr.accuracy(testX, testY, 0.10);
console.log("Train acc (within 0.10) = " +
trainAcc.toFixed(4).toString());
console.log("Test acc (within 0.10) = " +
testAcc.toFixed(4).toString());
The accuracy() method scores a prediction value as correct if it's within a specified percentage of the true target value. A reasonable closeness percentage to use will vary from problem to problem. Next, the MSE metrics are computed and displayed.
let trainMSE = krr.mse(trainX, trainY);
let testMSE = krr.mse(testX, testY);
console.log("Train MSE = " +
trainMSE.toFixed(4).toString());
console.log("Test MSE = " +
testMSE.toFixed(4).toString());
The mse() method computes mean squared error, which is a more granular measure of model goodness than accuracy(). However, accuracy is ultimately what you are usually most interested in.
An additional model evaluation metric is the coefficient of determination, R2. You can easily implement a coefficient of determination method by using the mse() method as a template. See my blog
post for details.
The demo concludes by using the trained model to predict the y value for the first training item, (-0.1660, 0.4406, -0.9998, -0.3953, -0.7065):
let x = trainX[0];
console.log("Predicting for x = ");
vecShow(x, 4, 9, true); // add newline
let predY = krr.predict(x);
console.log("Predicted y = " + predY.toFixed(4).toString());
The predicted y value, 0.4941, is close to the true target value of 0.4840. This is a good result considering that the synthetic demo data was generated by a neural network, which has complex interactions between predictor variables.
Wrapping Up
To recap,
- Kernel ridge regression (KRR) is a machine learning technique to predict a numeric value.
- Kernel ridge regression requires a kernel function that computes a measure of similarity between two training items.
- The most common kernel function is the radial basis function (RBF).
- There are two ways to train a KRR model, kernel matrix inverse and stochastic gradient descent (SGD).
- Both training techniques require an alpha constant for ridge (aka L2) regularization to discourage model overfitting.
- For KRR matrix inverse training, you must compute the inverse of a kernel matrix of RBF applied to all pairs of training items.
- There are many techniques to compute a matrix inverse. Cholesky decomposition is a specialized, relatively simple technique that can be used for kernel matrices.
There is no single best machine learning regression technique. But when kernel ridge regression prediction works, it is often highly effective.