The Data Science Lab

Data Dimensionality Reduction Using a Neural Autoencoder with C#

Dr. James McCaffrey of Microsoft Research presents a full-code, step-by-step tutorial on creating an approximation of a dataset that has fewer columns.

Imagine that you have a dataset that has many columns (dimensions). In some scenarios it's useful to create an approximation of the dataset that has fewer columns. This is called dimensionality reduction. The two most common techniques for dimensionality reduction are using PCA (principal component analysis) and using a neural autoencoder. This article explains how to perform dimensionality reduction using a neural autoencoder implemented with the C# language.

Compared to using PCA for dimensionality reduction, using a neural autoencoder has the big advantage that it works with source data that contains both numeric and categorical data, while PCA works only with strictly numeric data. An autoencoder is a specific type of neural network. The main disadvantage of using a neural autoencoder is that you must fine-tune the training parameters (max epochs, learning rate, batch size) and the number of nodes in the hidden layer.

Figure 1: Neural Autoencoder Dimensionality Reduction in Action
[Click on image for larger view.] Figure 1: Neural Autoencoder Dimensionality Reduction in Action

A good way to see where this article is headed is to take a look at the screenshot of a demo program in Figure 1. The demo uses a synthetic dataset that has 240 items. The raw data looks like:

F  24  michigan  29500.00  lib
M  39  oklahoma  51200.00  mod
F  63  nebraska  75800.00  con
M  36  michigan  44500.00  mod
F  27  nebraska  28600.00  lib
. . .

Each line of data represents a person. The fields are sex (male, female), age, state of residence (Michigan, Nebraska, Oklahoma), income, and political leaning (conservative, moderate, liberal). Notice that autoencoder dimensionality reduction can deal with any type of data: Boolean, integer, categorical/text, and floating point. The dataset is split into a 200-item set to be reduced and a 40-item set to act as training validation data.

Neural networks accept only numeric data and so the source data has been normalized and encoded, and looks like:

 1.0000  0.2400  1.0000  0.0000  0.0000  0.2950  0.0000  0.0000  1.0000
-1.0000  0.3900  0.0000  0.0000  1.0000  0.5120  0.0000  1.0000  0.0000
 1.0000  0.6300  0.0000  1.0000  0.0000  0.7580  1.0000  0.0000  0.0000
-1.0000  0.3600  1.0000  0.0000  0.0000  0.4450  0.0000  1.0000  0.0000
 1.0000  0.2700  0.0000  1.0000  0.0000  0.2860  0.0000  0.0000  1.0000
. . .

Sex is encoded as M = -1 and F = 1. Age is normalized by dividing by 100. State is one-hot encoded as Michigan = 100, Nebraska = 010, Oklahoma = 001. Income is normalized by dividing by 100,000. Political leaning is one-hot encoded as conservative = 100, moderate = 010, liberal = 001.

The demo instantiates a 9-6-9 neural autoencoder that has tanh() hidden layer activation and tanh() output layer activation. Then, training parameters are set to maxEpochs = 1000, lrnRate = 0.010, and batSize = 10. The autoencoder Train() method is called, and progress is monitored every 100 epochs:

Starting training
epoch:    0  MSE = 2.3826
epoch:  100  MSE = 0.0273
. . .
epoch:  800  MSE = 0.0019
epoch:  900  MSE = 0.0018
Done

The MSE (mean squared error) values decrease which indicates that training is working properly -- something that doesn't always happen.

The trained neural autoencoder is subjected to a sanity check by computing the MSE for the 40-item validation dataset. The MSE is 0.0017 which is very close to the MSE of the dataset being reduced, and which indicates that the autoencoder is not overfitted.

The trained neural autoencoder model is used to reduce the 200 data items. The reduced data has six columns:

 0.0102   0.2991  -0.0517   0.0154  -0.8028   0.9672
-0.2268   0.8857   0.0029  -0.2421   0.7477  -0.9319
 0.0697  -0.9168   0.2438   0.9212   0.4091   0.2533
-0.0505   0.2831   0.5931  -0.9208   0.6399  -0.2666
 0.5075   0.1818   0.0889   0.9078  -0.8808   0.3985
. . .

This reduced data can be used as a surrogate for the original data. Common use-cases include data visualization in a 2D graph (if the data is reduced to just two columns instead of the six columns in the demo), use in machine learning algorithms (such as k-means clustering) that only work with numeric data, use in machine learning algorithms that can only handle a relatively small number of columns (such as those that compute a matrix inverse), and use in data cleaning (because the reduced data removes statistical noise).

This article assumes you have intermediate or better programming skill but doesn't assume you know anything about neural autoencoders and dimensionality reduction. The demo is implemented using C#, but you should be able to refactor the demo code to another C-family language if you wish. All normal error checking has been removed to keep the main ideas as clear as possible.

The source code for the demo program is too long to be presented in its entirety in this article. The complete code and data are available in the accompanying file download, and are also available online.

Understanding Neural Autoencoders
The diagram in Figure 2 illustrates a neural autoencoder. The autoencoder has the same number of inputs and outputs (9) as the demo program, but for simplicity the illustrated autoencoder has architecture 9-2-9 (just 2 hidden nodes) instead of the 9-6-9 architecture of the demo autoencoder.

A neural autoencoder is essentially a complex mathematical function that predicts its input. All input must be numeric so categorical data must be encoded. Although not theoretically necessary, for practical reasons, numeric input should be normalized so that all values have roughly the same range, typically between -1 and +1.

Figure 2: Neural Autoencoder System
[Click on image for larger view.] Figure 2: Neural Autoencoder System

Each thin blue arrow represents a neural weight, which is just a number, typically between about -2 and +2. Weights are sometimes called trainable parameters. The small red arrows are special weights called biases. The 9-2-9 autoencoder in the diagram has (9 * 2) + (2 * 9) = 36 weights, and 2 + 9 = 11 biases.

The values of the weights and biases, together with an input vector, determine the values of the output nodes. Finding the values of the weights and biases is called training the model. Put another way, training a neural autoencoder finds the values of the weights and biases so that the output values closely match the input values.

After training, when a data item is fed to the autoencoder, the values in the hidden nodes are a reduced version of the input data. The reduced form is sometimes called an embedding, or a latent vector.

Normalizing and Encoding Source Data for an Autoencoder
In practice, preparing the source data for an autoencoder is the most time-consuming part of the dimensionality reduction process. To normalize numeric variables, I recommend using the divide-by-constant technique so that all normalized values are between -1 and +1. For the demo data, the age values are divided by 100. If you had a column of temperature values that range from -40 degrees to +130 degrees, you could divide each value by 200. The only significant alternative to divide-by-constant normalization is min-max normalization, but divide-by-constant is simpler and retains the sign of the original value.

To encode a categorical variable, you should use one-hot encoding. The only time you'll run into trouble is when the variable can take on many possible values. For example, if the state of residence variable could be any one of the 50 U.S. states, each one-hot encoded vector would have forty-nine 0s and one 1. The standard workaround is to use ordinal encoding (1 through 50) and then programmatically convert to one-hot encoding when the data is read into memory.

To encode a binary variable, such as sex in the demo data, you can use either zero-one encoding or minus-one-plus-one encoding. In theory, minus-one-plus-one is better, but in practice there is rarely any significant difference between the two encoding techniques. I personally prefer minus-one-plus-one encoding but most of my colleagues use zero-one encoding.

The demo program assumes that the raw data has been normalized and encoded in a preprocessing step. It's possible to programmatically normalize and encode raw data on the fly, but this is a bit more difficult than you might expect.

The Demo Program
I used Visual Studio 2022 (Community Free Edition) for the demo program. I created a new C# console application and checked the "Place solution and project in the same directory" option. I specified .NET version 8.0. I named the project NeuralNetworkDimReduction. I checked the "Do not use top-level statements" option to avoid the program entry point shortcut syntax.

The demo has no significant .NET dependencies and any relatively recent version of Visual Studio with .NET (Core) or the older .NET Framework will work fine. You can also use the Visual Studio Code program if you like.

After the template code loaded into the editor, I right-clicked on file Program.cs in the Solution Explorer window and renamed the file to the slightly more descriptive NeuralDimReductionProgram.cs. I allowed Visual Studio to automatically rename class Program.

The overall program structure is presented in Listing 1. All the control logic is in the Main() method. All of the neural autoencoder functionality is in a NeuralNet class. A Utils class holds helper functions for Main() to load data from file to memory, and functions to display vectors and matrices.

Listing 1: Overall Program Structure

using System;
using System.IO;
using System.Collections.Generic;

namespace NeuralNetworkDimReduction
{
  internal class NeuralDimReductionProgram
  {
    static void Main(string[] args)
    {
      Console.WriteLine("Begin neural dim reduction demo ");

      // load data into memory
      // create autoencoder
      // train autoencoder
      // validate trained autoencoder
      // reduce the data

      Console.WriteLine("End demo ");
      Console.ReadLine();
    } // Main

  } // Program

  // --------------------------------------------------------

  public class NeuralNet
  {
    private int ni; // number input nodes
    private int nh; // hidden
    private int no; // output

    private double[] iNodes;
    private double[][] ihWeights; // input-hidden
    private double[] hBiases;
    private double[] hNodes;

    private double[][] hoWeights; // hidden-output
    private double[] oBiases;
    private double[] oNodes;

    // gradients
    private double[][] ihGrads;
    private double[] hbGrads;
    private double[][] hoGrads;
    private double[] obGrads;

    private Random rnd; // init wts, scramble train order

    public NeuralNet(int numIn, int numHid,
      int numOut, int seed) { . . }
 
    private void InitWeights() { . . }
    public void SetWeights(double[] wts) { . . }
    public double[] GetWeights() { . . }

    public double[] ComputeOutput(double[] x) { . . }
    private static double HyperTan(double x) { . . }
    private void ZeroOutGrads() { . . }
    private void AccumGrads(double[] y) { . . }
    private void UpdateWeights(double lrnRate) { . . }

    public void Train(double[][] dataX, double[][] dataY,
      double lrnRate, int batSize, int maxEpochs) { . . }

    public double[] ReduceVector(double[] x) { . . }
    public double[][] ReduceMatrix(double[][] X) { . . }
 
    private void Shuffle(int[] sequence) { . . }
    public double Error(double[][] dataX, 
      double[][] dataY) { . . }
    public void SaveWeights(string fn) { . . }
    public void LoadWeights(string fn) { . . }
    private static double[][] MatCreate(int nRows,
      int nCols) { . . }
  }

  // --------------------------------------------------------

  public class Utils
  {
    public static string[] FileLoad(string fn,
      string comment) { . . }
    public static double[][] MatLoad(string fn,
      int[] usecols, char sep, string comment) { . . }
    public static void MatShow(double[][] m,
      int dec, int wid) { . . }
  }
} // ns

The demo program is complex. However, the only code you'll need to modify is the calling code in the Main() method. The demo starts by loading the raw data into memory:

string rawFile = "..\\..\\..\\Data\\people_raw.txt";
string[] rawData = Utils.FileLoad(rawFile, "#");
Console.WriteLine("First 5 raw data: ");
for (int i = 0; i < 5; ++i)
  Console.WriteLine(rawData[i]);

The people_raw.txt file is read into memory as an array of type string. Reading and displaying the raw data isn't necessary because the dimensionality reduction is performed on the normalized and encoded data. But displaying the raw data helps to make the ideas of dimensionality reduction a bit easier to understand.

The 200-item dataset to reduce is loaded into memory:

Console.WriteLine("Loading encoded and normalized data ");
string dataFile = "..\\..\\..\\Data\\people_data.txt";
double[][] dataX = Utils.MatLoad(dataFile,
  new int[] { 0, 1, 2, 3, 4, 5, 6, 7, 8 }, ',', "#");

The arguments to the MatLoad() function mean load columns 0 through 8 inclusive of the comma-delimited file, where lines beginning with # indicate a comment. The return value is an array-of-arrays style matrix.

The 40-item validation dataset is loaded in the same way:

string validationFile = 
  "..\\..\\..\\Data\\people_validation.txt";
double[][] validationX = Utils.MatLoad(validationFile,
  new int[] { 0, 1, 2, 3, 4, 5, 6, 7, 8 }, ',', "#");
Console.WriteLine("Done "); 

Part of the normalized and encoded data is displayed as a sanity check:

Console.WriteLine("First 5 data items: ");
Utils.MatShow(dataX, 4, 9, 5);

The arguments to MatShow() means display using 4 decimals, with field width 9, just the first 5 rows.

Creating and Training the Autoencoder
The demo program creates a 9-6-9 autoencoder using these statements:

Console.WriteLine("Creating 9-6-9 tanh-tanh autoencoder ");
NeuralNet nn = new NeuralNet(9, 6, 9, seed: 0);
Console.WriteLine("Done ");

The number of input nodes and output nodes, 9 in the case of the demo, is entirely determined by the normalized and encoded source data. The number of hidden nodes, 6 in the demo, is a hyperparameter that must be determined by trial and error. If too few hidden nodes are used, the autoencoder doesn't have enough power to model the source data well. If too many hidden nodes are used, the autoencoder will essentially memorize the source data -- overfitting the data, and the model won't generalize when previously unseen data is encountered.

As a general rule of thumb, a good value for the number of hidden nodes is often (but not always) between 60 percent and 80 percent of the number of input nodes.

The seed value is used to initialize the autoencoder weights and biases to small random values. The seed is also used to scramble the order in which data items are processed during training. Different seed values can give significantly different results, but you shouldn't try to fine-tune the model by adjusting the seed parameter.

Behind the scenes, the autoencoder uses tanh() activation on the hidden nodes and tanh() activation on the output nodes. The result of the tanh() function is always between -1 and +1. Therefore, the reduced form of a data item stored in the hidden nodes will contain only values between -1 and +1.

The autoencoder is trained using these statements:

int maxEpochs = 1000;
double lrnRate = 0.01;
int batSize = 10;
Console.WriteLine("Starting training ");
nn.Train(dataX, dataX, lrnRate, batSize, maxEpochs);
Console.WriteLine("Done ");

Behind the scenes, the demo system trains the autoencoder using a clever algorithm called back-propagation. The maxEpochs, lrnRate, and batSize parameters are hyperparameters that must be determined by trial and error. If too few training epochs are used, the model will underfit the source data. Too many training epochs will overfit the data.

The learning rate controls how much the weights and biases change on each update during training. A very small learning rate will slowly but surely approach optimal weight and bias values, but training could be too slow. A large learning rate will quickly converge to a solution but could skip over optimal weight and bias values.

The batch size specifies how many data items to group together during training. A batch size of 1 is sometimes called "online training," but is rarely used. A batch size equal to the number of data items (200 in the demo) is sometimes called "full batch training," but is rarely used. In practice, it's a good idea to specify a batch size that evenly divides the number of data items so that all batches are the same size.

To recap, to use the demo program as a template, after you normalize and encode your source data, the number of input and output nodes are determined by the data. You must experiment with the number of hidden nodes, the maxEpochs, lrnRate, and batSize parameters. You don't have to modify the underlying methods.

Reducing the Data
After the neural autoencoder has been trained, the trained model is applied to the 40-item validation dataset as a sanity check:

double validationErr = nn.Error(validationX, validationX);
Console.WriteLine("MSE on validation data = " +
  validationErr.ToString("F4"));

Notice the call to the Error() method accepts the validation data as both inputs and outputs because an autoencoder predicts its input. The smaller the value of mean squared error, the closer the output values match the input values. If the MSE of the validation data is significantly different from the MSE of the data being reduced, the model is most likely underfitted or overfitted.

The demo program concludes by using the trained autoencoder to reduce the source data:

Console.WriteLine("Reducing data ");
double[][] reduced = nn.ReduceMatrix(dataX);
Console.WriteLine("First 5 reduced: ");
Utils.MatShow(reduced, 4, 9, 5);

In a non-demo scenario you might want to programmatically write the reduced values to a comma-delimited text file.

Wrapping Up
A minor limitation of the autoencoder architecture presented in this article is that it only has a single hidden layer. Neural autoencoders with multiple hidden layers are called deep autoencoders. Implementing a deep autoencoder is possible but requires a lot of effort. A result from the Universal Approximation Theorem (sometimes called the Cybenko Theorem) states, loosely speaking, that a neural network with a single hidden layer and enough hidden nodes can approximate any function that can be approximated by a deep autoencoder.

comments powered by Disqus

Featured

Subscribe on YouTube