The Data Science Lab

How to Work with C# Vectors and Matrices for Machine Learning

Here's a hands-on tutorial from bona-fide data scientist Dr. James McCaffrey of Microsoft Research to get you up to speed with machine learning development using C#, complete with code listings and graphics.

A working knowledge of machine learning (ML) is becoming an increasingly important part of many C# developers' skill sets. And virtually every significant ML technique uses vectors and matrices. In this article I get you up to speed with the fundamental knowledge you need to create and modify ML code written using the C# language.

A good way to see where this article is headed is to take a look at the screenshot of a demo program in Figure 1. Informally, a vector is an array of numeric values. A matrix is conceptually a two-dimensional data structure of numeric values. The demo program begins by creating and displaying a vector with four cells, each initialized to 3.5. Next, the demo creates a 3x4 (3 rows, 4 columns) matrix. The demo program concludes by reading 12 values from a text file and storing them into a 4x3 matrix.

Demonstration of C# Vectors and Matrices
[Click on image for larger view.] Figure 1. Demonstration of C# Vectors and Matrices

This article assumes you have intermediate or better skill with C# but doesn't assume you know anything about vectors and matrices or about ML. The complete code for the demo program shown running in Figure 1 is presented in this article. The code is also available in the file download that accompanies this article.

The Demo Program
To create the demo program, I launched Visual Studio 2019. I used the Community (free) edition but any relatively recent version of Visual Studio will work fine. From the main Visual Studio start window I selected the "Create a new project" option. Next, I selected C# from the Language dropdown control and Console from the Project Type dropdown, and then picked the "Console App (.NET Core)" item.

The code presented in this article will run as a .NET Core console application or as a .NET Framework application. Many of the newer Microsoft technologies, such as the ML.NET code library, specifically target .NET Core so it makes sense to develop most C# ML code in that environment.

I entered "Matrices" as the Project Name, specified C:\VSM on my local machine as the Location (you can use any convenient directory), and checked the "Place solution and project in the same directory" entry.

After the template code loaded into the Visual Studio editor, at the top of the editor window I removed all using statements to unneeded namespaces, leaving just the reference to the top-level System namespace. Next I added a using statement that references the System.IO namespace so the program can read data from a text file.

In the Solution Explorer window, I renamed file Program.cs to the more descriptive MatricesProgram.cs and then in the editor window I renamed class Program to class MatricesProgram to match the file name. Next, in the Solution Explorer window, I right-clicked on the bold-font Matrices project name and selected Add | New Item | Text File and entered "dummy_data.tsv" in the Name field. After clicking on the Add button, I entered values 1.0 through 12.0, three per line, separated by tab characters, as shown in Figure 1. I also added a header line that begins with "//" characters, which specifies the file name.

Listing 1. Matrices Demo Program Structure
using System;
using System.IO;
namespace Matrices
{
  class MatricesProgram
  {
    static void Main(string[] args)
    {
      Console.WriteLine("Begin vectors and matrices demo");

      Console.WriteLine("Creating a vector with 4 cells");
      double[] v = Utils.VecCreate(4, 3.5);
      Utils.VecShow(v, 3, 6);

      Console.WriteLine("Creating a 3x4 matrix");
      double[][] m = Utils.MatCreate(3, 4);
      Utils.MatShow(m, 2, 6);

      Console.WriteLine("Loading 4x3 matrix from file");
      string fn = "..\\..\\..\\dummy_data.tsv";
      double[][] d =
        Utils.MatLoad(fn, 4, new int[] { 0,1,2 }, '\t');
      Utils.MatShow(d, 1, 6);

      Console.WriteLine("End demo ");
      Console.ReadLine();
    } // Main
  } // Program class

  public class Utils
  {
    public static double[] VecCreate(int n,
      double val = 0.0) { . . }
    public static void VecShow(double[] vec,
      int dec, int wid) { . . }
    public static double[][] MatCreate(int rows,
      int cols) { . . }
    public static void MatShow(double[][] mat,
      int dec, int wid) { . . }
    public static double[][] MatLoad(string fn,
      int nRows, int[] cols, char sep) { . . }
  } // Utils class
} // ns

The structure of the demo program, with a few minor edits to save space, is shown in Listing 1. All of the vector and matrix functionality is implemented by static functions in a class named Utils. Alternative design possibilities are to place the functions in a C# class library or to place the functions inside the Program class.

Vectors
The simplest way to create a vector is with code like:

double[] v = new double[4];

which would create an array with length of four cells, each holding a type double value initialized to 0.0 by default. Type int and type float vectors can be created using the same pattern. Most ML techniques use type int and 64-bit type double, except for neural networks which often use type int and 32-bit type float.

It's possible to create vectors that are initialized to specific values with code like:

double[] v = new double[] { 1.0, 2.0, 3.0 };

With this pattern you can omit the explicit length of 3 if you wish because the compiler will infer the length of the array from the number of values supplied.

It's possible to create a vector programmatically. The demo program defines the following method to create a vector with n cells all initialized to a specific value:

public static double[] VecCreate(int n, double val = 0.0)
{
  double[] result = new double[n];
  for (int i = 0; i < n; ++i)
    result[i] = val;
  return result;
}

The method can be called like:

double[] v = Utils.VecCreate(4, 3.50); // 4 cells all 3.5
double[] w = Utils.VecCreate(5, 2.78); // 5 cells all 2.78
double[] t = Utils.VecCreate(6);  // 6 cells all 0.0

The structure of the C# vector v is shown in the top part of Figure 2. Technically, the name of a vector is a reference to the first cell in the array. Conceptually, the name refers to the entire array.

nAatomy of C# Vectors and Matrices
[Click on image for larger view.] Figure 2. Anatomy of C# Vectors and Matrices

The demo program shows how to traverse a vector in method VecShow():

public static void VecShow(double[] vec, int dec, int wid)
{
  for (int i = 0; i < vec.Length; ++i)
    Console.Write(vec[i].ToString("F" + dec).PadLeft(wid));
  Console.WriteLine("");
}

When using a for-loop to traverse a vector, the choice of using pre-increment (++i) or post-increment (i++) is purely a matter of style because the increment is a standalone statement.

Notice that VecShow() assumes it is being called from a console application. The method can be called like so:

Utils.VecShow(v, 4, 8);

This call displays vector v with 4 decimals for each value and a total of 8 spaces per value (with blank space padding on the left, if necessary).

A common source of errors is forgetting that vectors are reference objects rather than value objects. For example, consider these three statements:

double[] v1 = new double[3] { 4.0, 5.0, 6.0 };
double[] v2 = v1;
v2[0] = 9.0;

The second statement creates a reference named v2 that points to the same location as v1. The third statement changes cell [0] of v2 to 9.0 but because both v1 and v2 point to the same object, the value of v1[0] has also changed.

If you want to make an independent copy of a vector, you can do so with a function like this:

static double[] Duplicate(double[] v)
{
  int n = v.Length;
  double[] result = new double[n];
  for (int i = 0; i < n; ++i)
    result[i] = v[i];
  return result;
}

double[] v1 = new double[3] { 4.0, 5.0, 6.0 };
double[] v2 = Duplicate(v1);
v2[0] = 9.0;  // no effect on v1

A useful consequence of vectors being references is that a function can modify its array parameter. For example:

static void TripleIt(double[] v)
{
  for (int i = 0; i < v.Length; ++i)
    v[i] = 3 * v[i];
}

double[] v = new double[] { 7.0, 9.0 };
TripleIt(v);  // v is now (21.0, 27.0)

Developers who prefer a functional style of programming can avoid using the reference mechanism to modify a vector and could write code like this:

static double[] Triple(double[] v)
{
  int n = v.Length;
  double[] result = new double[n];
  for (int i = 0; i < n; ++i)
    result[i] = 3 * v[i];
  return result;
}

double[] v = new double[] { 7.0, 9.0 };
v = Triple(v);  // v is now (21.0, 27.0)

Alternatively, the .NET System.Array class has an Array.Copy() method that you can use to make an independent copy of a vector.

Matrices
The most common way to create a C# matrix is to use an array of arrays. For example:

double[][] m = new double[4][];  // 4 rows
for (int i = 0; i < 4; ++i)
  m[i] = new double[3];  // 3 columns per row

Conceptually, the code creates a matrix named m that has four rows and three columns. Technically, the matrix is an array named m that has four cells, and each cell is a reference to an array with three cells. The anatomy of a 3x4 array-of-arrays style matrix named d is shown in the bottom part of Figure 2.

The demo program defines a helper method in class Utils to create a matrix:

public static double[][] MatCreate(int rows, int cols)
{
  double[][] result = new double[rows][];
  for (int i = 0; i < rows; ++i)
    result[i] = new double[cols];
  return result;
}

The method can be called like so:

double[][] m = Utils.MatCreate(4, 3);  // 4x3

All cells of a matrix created in this way will be initialized by default to 0.0 values. Individual cells are referenced like so:

m[0][1] = -1.2345;  // row 0, col 1
m[3][2] = 9.99;  // row 3, col 2

Note that unlike most languages, C# supports a true built-in matrix object. For example:

double[,] m = new double[4,3];  // 4x3
m[0, 1] = -1.2345;  // row 0, col 1

However, this technique is rarely used in ML scenarios. Another approach is to implement a matrix using a program-defined class along the lines of:

class MyMatrix
{
  private int rows;
  private int cols;
  private double[][] values;
. . .

This approach is quite common but in my opinion it's an example of OOP run amok and is unnecessarily complicated for most ML systems.

The demo program defines a method MatShow() to display a matrix to a console shell. The code is shown in Listing 2. Notice that because a matrix is an array-of-arrays, the inner for-loop across column values could have been replaced by a call to VecShow() like so:

for (int i = 0; i < nRows; ++i)
  Utils.VecShow(mat[i], dec, wid);

In general I prefer to avoid method dependencies like this because then I can make a change to VecShow(), without there being an unintended side effect on MatShow().

Listing 2. Methods to Display and Load a Matrix

public static void MatShow(double[][] mat,
  int dec, int wid)
{
  int nRows = mat.Length;
  int nCols = mat[0].Length;
  for (int i = 0; i < nRows; ++i)
  {
    for (int j = 0; j < nCols; ++j)
    {
      double x = mat[i][j];
      Console.Write(x.ToString("F" + dec).PadLeft(wid));
    }
    Console.WriteLine("");
  }
}

public static double[][] MatLoad(string fn, int nRows,
  int[] cols, char sep)
{
  int nCols = cols.Length;
  double[][] result = MatCreate(nRows, nCols);
  string line = "";
  string[] tokens = null;
  FileStream ifs = new FileStream(fn, FileMode.Open);
  StreamReader sr = new StreamReader(ifs);

  int i = 0;
  while ((line = sr.ReadLine()) != null)
  {
    if (line.StartsWith("//") == true)
      continue;
    tokens = line.Split(sep);
    for (int j = 0; j < nCols; ++j)
    {
      int k = cols[j];  // into tokens
      result[i][j] = double.Parse(tokens[k]);
    }
    ++i;
  }
  sr.Close(); ifs.Close();
  return result;
}

Many ML techniques require you to read data from a text file into a matrix. The demo program defines a method MatLoad() to perform this operation. A call to MatLoad() looks like:

string fn = "C:\\Somewhere\\some_file.tsv";
int numRows = 12;
int[] cols = new int[] {0, 2, 4 };
char sep = '\t';
double[][] m = Utils.MatLoad(fn, numRows, cols, sep);

The MatLoad() method requires you to specify the number of rows to read. It'd be possible for MatLoad() to perform a preliminary scan of the target text file to programmatically determine the number of rows. Such a helper function could be defined as:

public static int NumLines(string fn)
{
  int ct = 0;
  FileStream ifs = new FileStream(fn, FileMode.Open);
  StreamReader sr = new StreamReader(ifs);
  while (sr.ReadLine() != null) ++ct;
  sr.Close(); ifs.Close();
  return ct;
} 

Method MatLoad() allows you to specify which columns to use which is useful in situations where you only want a few of the columns. Similarly, method MatLoad() parameterizes the character that separates values on each line in the source text file. The most common separators are single blank space character, tab character, and comma character.

Method MatLoad() is hard-coded to interpret any line in the source text file that starts with two forward slash characters as a comment line. You might want to parameterize this string value to make MatLoad() a bit more general.

Wrapping Up
If you work with .NET technologies you have several options if you want to create a ML prediction system such as a naive Bayes classifier or a logistic regression classifier. At a low level of abstraction, you can create your ML system using raw C# with the .NET Core framework. At a high level of abstraction, you can use the new (it's still in preview mode as I write this) ML.NET library.

Advantages of using the raw C# approach include having complete control over your system, avoiding any license issues, ease of customization, and simplified integration into existing .NET systems. The primary disadvantage of using the raw C# approach is that creating a working ML prediction system sometimes (but not always) takes a bit longer than using the ML.NET library. In many scenarios the decision to use raw C# or ML.NET is not Boolean -- you can use both approaches together.

About the Author

Dr. James McCaffrey works for Microsoft Research in Redmond, Wash. He has worked on several Microsoft products including Azure and Bing. James can be reached at [email protected].

comments powered by Disqus

Featured

Subscribe on YouTube