The Data Science Lab

Kernel Ridge Regression with Cholesky Inverse Training Using C#

Dr. James McCaffrey presents a complete end-to-end demonstration of the kernel ridge regression technique to predict a single numeric value. The demo uses the kernel matrix inverse (Cholesky decomposition) technique for model training. There is no single best machine learning regression technique, but when kernel ridge regression prediction works, it is often highly accurate.

The goal of a machine learning regression problem is to predict a single numeric value. For example, you might want to predict a person's bank savings account balance based on their age, years of work experience, and so on.

There are approximately a dozen common regression techniques. Examples include linear regression, k-nearest neighbors regression, decision tree regression (several types, such as random forest), and neural network regression. Each technique has pros and cons. A technique that often produces accurate predictions for complex data is called kernel ridge regression. Note: "kernel ridge regression" is very different from the similarly named "ridge regression."

Kernel ridge regression uses a kernel function that computes a measure of similarity between two data items, and a ridge regularization technique to limit model overfitting. Model overfitting occurs when a model predicts well on the training data, but predicts poorly on new, previously unseen data. Ridge regularization is also known as L2 regularization.

This article presents a demo of kernel ridge regression, implemented from scratch, using the C# language. A good way to see where this article is headed is to take a look at the screenshot in Figure 1. The demo program begins by loading synthetic training and test data into memory. The data looks like:

-0.1660,  0.4406, -0.9998, -0.3953, -0.7065,  0.4840
 0.0776, -0.1616,  0.3704, -0.5911,  0.7562,  0.1568
-0.9452,  0.3409, -0.1654,  0.1174, -0.7192,  0.8054
 0.9365, -0.3732,  0.3846,  0.7528,  0.7892,  0.1345
. . .

The first five values on each line are the x predictors. The last value on each line is the target y variable to predict. There are 200 training items and 40 test items.

Figure 1: Kernel Ridge Regression with Cholesky Inverse Training in Action
[Click on image for larger view.]Figure 1: Kernel Ridge Regression with Cholesky Inverse Training in Action.

The demo creates and trains a kernel ridge regression model, evaluates the model accuracy on the training and test data, and then uses the model to predict the target y value for the first training item x = [-0.1660, 0.4406, -0.9998, -0.3953, -0.7065].

The first part of the demo output shows how a kernel ridge regression (KRR) model is created and trained:

Setting RBF gamma = 0.3
Setting alpha noise = 0.005
Creating and training KRR model using Cholesky
Done

Kernel ridge regression requires you to specify which kernel function to use. The demo uses the radial basis function (RBF) as the kernel function. The RBF kernel requires a value for a parameter called gamma. The alpha parameter controls the ridge regularization. The values of gamma and alpha must be determined by trial and error.

Behind the scenes, the demo program trains the KRR model by constructing a kernel matrix and then computing the inverse of the matrix. Unlike many machine learning techniques, training a KRR model using the kernel matrix inverse approach does not require learning rate and maximum epochs parameters.

The next part of the demo displays the trained model weights so they can be examined:

Model weights:
 -2.0218  -1.1406   0.0758  -0.6265  . . .
  0.3933   0.2223   0.0564   0.4282  . . .
. . .
 -0.2014  -1.6270  -0.5825  -0.0487  . . .

If there are n training data items, a kernel ridge regression model has n weights. Because the demo has 200 training items, the model has 200 weights.

The demo concludes by evaluating the trained model and making a prediction:

Computing model accuracy
Train acc (within 0.10) = 0.9950
Test acc (within 0.10) = 0.9500

Train MSE = 0.0000
Test MSE = 0.0002

Predicting for x =
  -0.1660   0.4406  -0.9998  -0.3953  -0.7065
Predicted y = 0.4941

The model accuracy is very good -- 99.5% accuracy on the training data (199 out of 200 correct) and 95.0% accuracy on the test data (38 out of 40 correct). A prediction is scored as correct if it's within 10% of the true target value.

The model mean squared error (MSE) values on the training and test data are very small, which is good. MSE is a more granular measure of model goodness than accuracy. MSE values are a good way to compare models when searching for model parameter values (RBF gamma and ridge alpha). The model's prediction for the first training item (-0.1660, 0.4406, -0.9998, -0.3953, -0.7065) is 0.4941 which is reasonably close to the true target value of 0.4840.

This article assumes you have intermediate or better programming skill but doesn't assume you know anything about kernel ridge regression. The demo is implemented using C# but you should be able to refactor the demo code to another C-family language if you wish. All normal error checking has been removed to keep the main ideas as clear as possible.

The source code for the demo program is too long to be presented in its entirety in this article. The complete code and data are available in the accompanying file download, and they're also available online.

The Demo Data
The demo data is synthetic. It was generated by a 5-10-1 neural network with random weights and bias values. The idea here is that the synthetic data does have an underlying, but complex, non-linear structure which can be predicted.

All of the predictor values are between -1 and +1. When using kernel ridge regression, technically, it's not necessary to normalize/scale your data. But normalizing usually leads to a better prediction model, especially if some raw predictor values are very large (such as employee income) and some values are small (such as employee age).

The three most common techniques to normalize numeric data are min-max normalization, z-score normalization, and divide-by-constant normalization. When possible, I recommend divide-by-constant normalization. For example, if you have a predictor variable employee age, you could divide all age values by 100.

Kernel ridge regression is most often used with data that has strictly numeric predictor variables. It is possible to use the technique with categorical predictors. If a categorical predictor variable has an inherent order, you can use equal-interval encoding. For example, a predictor variable height with possible values (short, medium, tall) could be encoded as short = 0.25, medium = 0.50, tall = 0.75.

For categorical predictor variables without inherent order, such as fruit with three possible values (apple, banana, cherry), you can use one-over-n-hot encoding. For instance, apple = (0.3333, 0, 0), banana = (0, 0.3333, 0), cherry = (0, 0, 0.3333). Unlike basic one-hot encoding, one-over-n-hot encoding takes into account the number of possible values of the predictor variable.

Understanding the RBF Kernel Function
In order to understand kernel ridge regression, you must understand the radial basis function (RBF) kernel function. The RBF kernel computes a measure of similarity between two vectors. There are two RBF versions. The RBF version used by the demo program is called the gamma version. The function is defined as:

rbf(v1, v2) = exp( -1 * gamma * ||v1 - v2||^2 )

Here, v1 and v2 are two vectors, exp() is the math e constant (2.71828...) raised to a power, ||v1 - v2||^2 is squared Euclidean distance, and gamma is an arbitrary constant. Gamma is sometimes called the inverse bandwidth, or just plain bandwidth, or width, or scale, or length. Dealing with many different names for the same variable is a significant challenge in machine learning.

Suppose v1 = (2.50, -3.25, 1.20) and v2 = (2.0, -3.0, 1.0) and gamma = 0.40. The trailing squared Euclidean distance term is the sum of the squared differences between vector elements:

||v1 - v2||^2 = (2.50 - 2)^2 + (-3.25 - (-3))^2 + (1.20 - 1)^2
              = (0.50)^2 + (-0.25)^2 + (0.20)^2
              = 0.2500 + 0.0625 + 0.0400
              = 0.3525

And then RBF(v1, v2) is:

rbf(v1, v2) = exp( -1 * gamma * ||v1 - v2||^2 )
            = exp( -1 * 0.40 * 0.3525 )
            = exp( -0.1410 )
            = 0.8685

If two vectors are the same, then rbf(v1, v2) = 1.0 (maximum similarity). The more different v1 and v2 are, the smaller the value of RBF is. Put another way, the RBF of two vectors is a value between 0 and 1 where larger values mean more similar.

Somewhat confusingly, there is a second definition for the RBF kernel function. It is:

rbf(v1, v2) = exp( -1 * ||v1 - v2||^2 / (2 * sigma^2) )

Here, instead of multiplying by an arbitrary constant gamma, you divide by 2 times the square of an arbitrary constant sigma. This version of RBF is sometimes called the sigma version.

Notice that gamma = 1 / (2 * sigma^2), and sigma = sqrt(1 / 2 * gamma). For the example vectors above, with gamma = 0.40, then sigma = sqrt(1 / 2*0.40) = sqrt(1.25) = 1.1180. If you use this value in the sigma version of the RBF definition, you will get the same 0.8685 result.

There are many different kernel functions. In machine learning scenarios, the RBF kernel function is the most common (at least, based on my experience), and is the one used by the demo program. Other, less commonly used kernel functions include the polynomial kernel, the sigmoid kernel, and the cosine kernel.

In math and research literature, a general kernel function is often written as K(x, x') to indicate any kernel function can be used. Sometimes kernel functions are defined so that they accept a just single vector, and you see K(x - x') to indicate you perform vector subtraction and then feed the result to the single-parameter kernel function.

To recap, kernel ridge regression uses a kernel function that compares two vectors. There are many different kernel functions. The most common is the radial basis function (RBF) kernel. The RBF kernel has two variations, the gamma version and the sigma version.

Kernel Ridge Regression Prediction
The kernel ridge regression prediction mechanism is best understood by looking at a concrete example. In words, to predict the y value for an input vector x, you compute the sum of the products of the model weights times the kernel function applied to x and each training item.

Suppose you have just three training data items, x0, x1, x2. A trained kernel ridge regression model will have three weights, w0, w1, w2. If the kernel function is the rbf() function, the predicted y for an input x is:

y' = (w0 * rbf(x, x0)) + (w1 * rbf(x, x1)) + (w2 * rbf(x, x2))

For example, suppose the input x to predict is (2.0, 4.0, 1.0). And suppose that for some value of gamma, the RBF kernel function values are:

rbf(x, x0) = 0.80
rbf(x, x1) = 0.50
rbf(x, x2) = 0.90

And suppose the trained model weights are w0 = 0.60, w1 = 0.70, w2 = -0.20. The predicted y is:

y' = (w0 * rbf(x, x0)) + (w1 * rbf(x, x1)) + (w2 * rbf(x, x2))
   = (0.60 * 0.80) + (0.70 * 0.50) + (-0.20 * 0.90)
   = 0.48 + 0.35 + -0.18
   = 0.65

Many simple machine learning linear algorithms, such as linear regression, L1 (lasso) regression, and L2 (ridge) regression, require an additional weight, called the bias/intercept. For kernel ridge regression, the bias term is implicitly incorporated into the kernel function.

The demo Predict() method implementation, without error-checking, is short and simple:

public double Predict(double[] x)
{
  int N = this.trainX.Length;
  double sum = 0.0;
  for (int i = 0; i < N; ++i) {
    double[] xx = this.trainX[i];
    double k = this.Rbf(x, xx);
    sum += this.wts[i] * k;
  }
  return sum;
}

OK, fine. But where do the model weights come from?

Training a Kernel Ridge Regression Model
Training a kernel ridge regression model is the process of finding values for the n model weights (one per training item) so that predicted y values closely match the known correct target values in the training data.

There are two main ways to train a kernel ridge regression model. The first technique, and the one used by the demo program presented in this article, involves creating an n-by-n kernel matrix that compares all the training data items with each other. Then ridge regularization is applied by adding a small alpha constant to the diagonal elements of the kernel matrix. Then the matrix inverse of the kernel matrix is computed. The inverse of the kernel matrix is multiplied by the vector of training y values, which gives the model weights.

The second technique to train a kernel ridge regression model uses stochastic gradient descent (SGD). SGD is an iterative process that loops through the training data multiple times, adjusting the model weights slowly so that the model reduces its error between predicted y values and target y values.

The matrix inverse training technique usually works well for small and medium size datasets. The matrix inverse technique has fewer training hyperparameters to deal with (no learning rate or maximum epochs) than the SGD technique. But the matrix inverse technique can't handle huge datasets, and matrix inverse computation is very complex and can fail.

Here's a concrete example of the kernel matrix inverse training technique.

Suppose there are just four training data X items:

 -0.166  0.441 -1.000 -0.395 -0.707
  0.078 -0.162  0.370 -0.591  0.756
 -0.945  0.341 -0.165  0.117 -0.719
  0.936 -0.373  0.385  0.753  0.789

And suppose the associated training y values are: (0.48, 0.16, 0.81, 0.13)

The kernel matrix, K, holds values of the RBF kernel function applied to each pair of training items. Suppose for some value of gamma, the kernel matrix is:

  1.000  0.639  0.854  0.480
  0.639  1.000  0.653  0.772
  0.854  0.653  1.000  0.495
  0.480  0.772  0.495  1.000

The 0.653 value at K[2][1] is RBF applied to training item [2] and item [1]. Notice that the kernel values at cells [0][0], [1][1], [2][2], [3][3] are all 1.0 because they compare a data item with itself. Additionally, the kernel matrix is symmetric because rbf(v1, v2) = rbf(v2, v1).

Next, the alpha regularization constant is added to the diagonal elements. This is the "ridge" part of KRR. If alpha is set to 0.015, then the modified kernel matrix is:

  1.015  0.639  0.854  0.480
  0.639  1.015  0.653  0.772
  0.854  0.653  1.015  0.495
  0.480  0.772  0.495  1.015

Adding a small alpha value to the diagonal of the kernel matrix has two beneficial effects. First, it adds noise which prevents model overfitting. Second, it conditions the kernel matrix so that the matrix inverse computation is less likely to fail.

Next, the inverse of the conditioned kernel matrix is computed. It is:

  3.539 -0.567 -2.635  0.044
 -0.567  3.128 -0.663 -1.787
 -2.635 -0.663  3.642 -0.027
  0.044 -1.787 -0.027  2.337

The last step to find the kernel ridge regression model weights is to matrix-multiply the vector of target y values by the inverse of the kernel matrix. The vector of y values has shape 1-by-4 and the inverse kernel matrix has shape 4-by-4, so the result is a vector with shape 1-by-4:

 -0.492 -0.558  1.551  0.033

These are the model weights. Whew! Expressed mathematically:

(1) Y = w * K(X,X)
(2) Y * inv(K(X,X)) = w * K(X,X) * inv(K(X,X))
(3) Y * inv(K(X,X)) = w

The w is the vector of the weights. X is a matrix of training items. Y is a vector of training target values. K(X,X) is a kernel matrix that consists of RBF values applied to all possible pairs of training items. The * indicates matrix multiplication and inv() indicates matrix inversion.

Equation (1) is the starting assumption. In equation (2) both sides have been matrix-multiplied by inv(K(X,X)). If A is any square matrix, A * inv(A) = I (identity), and so the result simplifies to equation (3). In words, to calculate the model prediction weights, compute the kernel matrix of all pairs of training X data, then find the inverse of that matrix, then multiply the inverse by the target Y values.

Understanding Matrix Inverse Using Cholesky Decomposition
Computing a matrix inverse is one of the most challenging problems in numerical programming. There are over a dozen algorithms to compute a matrix inverse, and each algorithm has several variations, and each variation has multiple implementation designs.

As it turns out, because the kernel matrix for kernel ridge regression has all positive values and is symmetric, it's possible to use a specialized matrix inverse algorithm called Cholesky decomposition.

In matrix algebra, you can decompose a matrix A into two matrices B and C so that A = B * C, where * is matrix multiplication. Cholesky decomposition decomposes a symmetric positive-valued matrix M into two matrices L and Lt where Lt is the transpose of L. Matrix L will be a lower triangular matrix, and matrix Lt will be an upper triangular matrix.

If you can find L (and Lt), then the inverse of a kernel matrix K can be computed by:

(1) K = L * Lt
(2) inv(K) = inv(L * Lt)
(3) inv(K) = inv(Lt) * inv(L)

Equation (1) is the definition of Cholesky decomposition. Equation (2) applies matrix inverse to both sides of equation (1). Equation (3) is a consequence of the matrix property that inv(A * B) = inv(B) * inv(A).

This process useful because the inverses of both L and Lt are relatively easy to compute. If you're new to computing a matrix inverse, using decomposition and then computing two matrix inverses seems overly complicated, but it turns out to be vastly simpler than computing a matrix inverse directly.

There are several algorithms that can be used to perform a Cholesky decomposition. The demo program uses a version called the Banachiewicz algorithm.

The Demo Program
I used Visual Studio 2022 (Community Free Edition) for the demo program. I created a new C# console application and checked the "Place solution and project in the same directory" option. I specified .NET version 8.0. I named the project KernelRidgeRegressionCholesky. I checked the "Do not use top-level statements" option to avoid the strange program entry point shortcut syntax.

The demo has no significant .NET dependencies and any relatively recent version of Visual Studio with .NET (Core) or the older .NET Framework will work fine. You can also use the Visual Studio Code program if you like.

After the template code loaded into the editor, I right-clicked on file Program.cs in the Solution Explorer window and renamed the file to the more descriptive KRRCholeskyProgram.cs. I allowed Visual Studio to automatically rename class Program.

The overall program structure is presented in Listing 1. All the control logic is in the Main() method in the Program class. All of the kernel ridge regression functionality is in a KRR class. The KRR class exposes a constructor and four methods: Train(), Predict(), Accuracy(), and MSE(). The private Rbf() method is used by the Train() and Predict() methods. The demo program has a utility class named MatUtils that contains static methods to load data from a text file into a matrix, compute a matrix inverse, and display data.

Listing 1: Overall Program Structure

using System;
using System.IO;
using System.Collections.Generic;

namespace KernelRidgeRegressionCholesky
{
  internal class KRRCholeskyProgram
  {
    static void Main(string[] args)
    {
      Console.WriteLine("Begin Kernel Ridge " +
        "Regression with Cholesky matrix inverse training ");

      // 1. load data from file into memory
      // 2. create and train KRR model
      // 3. display and analyze model weights
      // 4. evaluate trained model
      // 5. use trained model to make a prediction

      Console.WriteLine("End demo ");
      Console.ReadLine();
    } // Main()

  } // class Program

  // ========================================================

  public class KRR
  {
    public double gamma;  // for RBF kernel
    public double alpha;  // regularization noise
    public double[][] trainX;  // need for any prediction
    public double[] trainY;  // not necessary
    public double[] wts;  // one per trainX item

    public KRR(double gamma, double alpha) { . . }

    public void Train(double[][] trainX, double[] trainY) { . . }

    public double Predict(double[] x) { . . }
 
    public double Accuracy(double[][] dataX, double[] dataY,
      double pctClose) { . . }

    private static double Rbf(double[] v1, double[] v2,
      double gamma) { . . }

    public double MSE(double[][] dataX, double[] dataY) { . . }

  } // class KRR

  // ========================================================

  public class MatUtils
  {
    public static double[][] MatLoad(string fn,
      int[] usecols, char sep, string comment) { . . }

    public static double[] MatToVec(double[][] A) { . . }

    public static void VecAnalyze(double[] vec) { . . }

    public static double[][] MatMake(int nRows, int nCols) { . . }

    public static double[][] MatMult(double[][] matA,
      double[][] matB) { . . }

    public static double[] VecMatProd(double[] v,
      double[][] A) { . . }
 
    public static double[][] MatIdentity(int n) { . . }

    public static double[][] MatInverseCholesky(double[][] A)
    {
      double[][] L = MatDecompCholesky(A);
      double[][] result = MatInverseFromCholesky(L);
      return result;

      // ----------------------------------------------------
      static double[][] MatDecompCholesky(double[][] M) { . . }
      static double[][] MatInverseFromCholesky(double[][] L) { . . }
      // ----------------------------------------------------
    }

    public static void MatShow(double[][] m, int dec, int wid) { . . }

    public static void VecShow(double[] vec, int dec, int wid) { . . }

  } // class MatUtils

} // ns

The KRR class declares its five fields with public scope so that they can be accessed directly. Unlike some regression techniques, kernel ridge regression needs access to the X training predictor data because the Predict() method uses that data. The training target y values are not needed by the demo implementation, but they are stored anyway.

The KRR constructor accepts gamma, which is used by the internal Rbf() method, and alpha, which is used by the Train() method. Because the training data is needed by both Train() and Predict(), a reasonable design alternative is to pass the training data to the constructor.

Loading the Data into Memory
The demo program starts by loading the 200-item training data into memory:

string trainFile = "..\\..\\..\\Data\\synthetic_train_200.txt";
double[][] trainX = MatUtils.MatLoad(trainFile,
  new int[] { 0, 1, 2, 3, 4 }, ',', "#");
double[] trainY = MatUtils.MatToVec(MatUtils.MatLoad(trainFile,
  new int[] { 5 }, ',', "#"));

The training X data is stored into an array-of-arrays style matrix of type double. The data is assumed to be in a directory named Data, which is located in the project root directory. The arguments to the MatLoad() function mean load columns 0, 1, 2, 3, 4 where the data is comma-delimited, and lines beginning with "#" are comments to be ignored. The training y data in column [5] is loaded into a matrix and then converted to a one-dimensional vector using the MatToVec() helper function.

The 40-item test data is loaded into memory using the same pattern that was used to load the training data:

string testFile = "..\\..\\..\\Data\\synthetic_test_40.txt";
double[][] testX = Utils.MatLoad(testFile,
  new int[] { 0, 1, 2, 3, 4 }, ',', "#");
double[] testY = Utils.MatToVec(Utils.MatLoad(testFile,
  new int[] { 5 }, ',', "#"));

The first three training items are displayed with four decimals like so:

Console.WriteLine("First three X predictors: ");
for (int i = 0; i < 3; ++i)
  Utils.VecShow(trainX[i], 4, 8);
Console.WriteLine("First three target y: ");
for (int i = 0; i < 3; ++i)
  Console.WriteLine(trainY[i].ToString("F4").PadLeft(8));

In a non-demo scenario, you might want to display all the training data to make sure it was correctly loaded into memory.

Creating and Training the Model
The kernel ridge regression model is created and trained like so:

double gamma = 0.3;    // RBF param
double alpha = 0.005;  // regularization
KRR krr = new KRR(gamma, alpha);
krr.Train(trainX, trainY);
Console.WriteLine("Done ");

Good values for gamma and alpha must be determined by trial and error. You can do this manually, or programmatically along the lines of:

double[] gammas = new double[] { 0.01, 0.10, 0.50, 1.00 };
double[] alphas = new double[] { 0.0001, 0.001, 0.01, 0.015 };
foreach (double gamma in gammas) {
  foreach (double alpha in alphas) {
    KRR model = new KRR(gamma, alpha);
    model.Train(trainX, trainY);
    // compute and display accuracy and error
  }
}

After training has completed, the demo program displays and analyzes the model weights:

Console.WriteLine("Model weights: ");
MatUtils.VecShow(krr.wts, 4, 9);

Very large weight values, or many zero values, indicates possible model overfitting. If you have many hundreds of weights, instead of visually examining the weights, you can write a helper function to programmatically analyze for number of zero values, flag large values, and so on.

Evaluating and Using the Model
The demo program evaluates the trained model prediction accuracy using these statements:

double trainAcc = krr.Accuracy(trainX, trainY, 0.10);
double testAcc = krr.Accuracy(testX, testY, 0.10);
Console.WriteLine("Train acc (within 0.10) = " + trainAcc.ToString("F4"));
Console.WriteLine("Test acc (within 0.10) = " + testAcc.ToString("F4"));

The Accuracy() method scores a prediction value as correct if it's within a specified percentage of the true target value. A reasonable closeness percentage to use will vary from problem to problem. Next:

double trainMSE = krr.MSE(trainX, trainY);
double testMSE = krr.MSE(testX, testY);
Console.WriteLine("Train MSE = " + trainMSE.ToString("F4"));
Console.WriteLine("Test MSE = " + testMSE.ToString("F4"));

The MSE() method computes mean squared error, a more granular measure of model goodness than Accuracy(). However, accuracy is easier to interpret and ultimately what you are usually most interested in. Most of the regression modules in the Python language scikit-learn library use the coefficient of determination, R2, as the primary evaluation metric. But in my opinion, mean squared error is preferable. You can easily implement a coefficient of determination method by using the MSE() method as a template. See my recent post.

The demo concludes by using the trained model to predict the y value for the first training item, (-0.1660, 0.4406, -0.9998, -0.3953, -0.7065):

double[] x = trainX[0];
Console.WriteLine("Predicting for x = ");
MatUtils.VecShow(x, 4, 9);
double predY = krr.Predict(x);
Console.WriteLine("Predicted y = " + predY.ToString("F4"));

The predicted y value, 0.4941, is close to the true target value of 0.4840. This is a good result considering that the synthetic demo data was generated by a neural network, which has complex interactions between predictor variables.

In some scenarios, you might want to save the trained model weights so that the model can be used by other systems. The easiest way to do this is to implement a SaveWeights() method that writes a single line of comma-delimited model weight values to a specified text file. The weights can be loaded by implementing a LoadWeights() method. Because kernel ridge regression prediction requires the training data items, you'd need to save them too.

Wrapping Up

This article presents a lot of information. To recap:

  • Kernel ridge regression (KRR) is a machine learning technique to predict a numeric value.
  • Kernel ridge regression requires a kernel function that computes a measure of similarity between two training items.
  • The most common kernel function is the radial basis function (RBF).
  • There are two forms of the RBF function, the gamma and the sigma.
  • There are two ways to train a KRR model, kernel matrix inverse and stochastic gradient descent (SGD).
  • Both training techniques require an alpha constant for ridge (aka L2) regularization to deter model overfitting.
  • For KRR matrix inverse training, you must compute the inverse of a kernel matrix of RBF applied to all pairs of training items.
  • For KRR matrix inverse training, alpha is added to the diagonal elements of the kernel matrix, which prevents model overfitting and also conditions the matrix so that computing the inverse is less likely to fail.
  • There are many techniques to compute a matrix inverse. Cholesky decomposition is a specialized, relatively simple technique that can be used for kernel matrices.
  • The matrix inverse training technique often works well for small and medium size datasets, but it is complex and can fail.
  • The SGD training technique can be used with any size dataset, but it requires a learning rate and a maximum epochs, which must be determined by trial and error.

There is no single best machine learning regression technique. But when kernel ridge regression prediction works, it is often highly effective.

comments powered by Disqus

Featured

  • Mastering AI Development and Building AI Apps with GitHub Copilot

    Two Microsoft experts explain how GitHub Copilot is evolving from a coding assistant into a broader platform for building, customizing and testing AI-powered developer workflows.

  • VS Code 1.123 Adds Agent Session Sync, 1M Context Windows

    Microsoft released Visual Studio Code 1.123 on June 3, adding agent-focused features, larger model context support, integrated browser updates and a new delay for some automatic extension updates.

  • Copilot Billing Shock Hits Developers

    Developer complaints about GitHub Copilot's new usage-based billing model have centered on unexpectedly rapid AI credit consumption, and neither GitHub nor Microsoft has responded directly to the backlash, though they have previously published guidance to lessen model usage costs.

  • Hands On with GitHub Copilot App Technical Preview: Turning a Blazor Issue into a PR

    GitHub's brand-new Copilot desktop app, in technical preview, handled a small Blazor issue from planning through pull request creation, but the hands-on test also showed why developers still need to verify agent work in the running app before merging.

Subscribe on YouTube