The Data Science Lab

Clustering Mixed Categorical and Numeric Data Using k-Means with C#

Dr. James McCaffrey of Microsoft Research presents a full-code, step-by-step tutorial on a "very tricky" machine learning technique.

Data clustering is the process of grouping data items together so that similar items are in the same group/cluster. For strictly numeric data, the k-means clustering technique is relatively simple, effective, and is the technique most commonly used. For strictly non-numeric (categorical) data, there are somewhat more complicated techniques that use entropy or Bayesian probability or categorical utility. But clustering mixed categorical and numeric data is very tricky.

This article presents a technique for clustering mixed categorical and numeric data using standard k-means clustering implemented using the C# language. Briefly, the source mixed data is preprocessed so that all the numeric and categorical values are between 0.0 and 1.0 and therefore k-means clustering can be applied without any modification.

Figure 1: Clustering Mixed Data Using k-Means in Action
[Click on image for larger view.] Figure 1: Clustering Mixed Data Using k-Means in Action

The best way to see where this article is headed is to take a look at a screenshot of a demo program in Figure 1. The demo clusters a synthetic dataset of 240 items. The source data looks like:

F  short   24  arkansas  29500  liberal
M  tall    39  delaware  51200  moderate
F  short   63  colorado  75800  conservative
M  medium  36  illinois  44500  moderate
F  short   27  colorado  28600  liberal
. . .

Each line represents a person. The column variables are sex, height, age, State, income and political leaning. The demo data was created so that there are numeric variables (age and income), a binary categorical variable (sex), an ordinal categorical variable (height) and nominal categorical variables (State and political leaning).

The raw data is normalized and encoded and looks like:

0.5, 0.25, 0.12, 0.25, 0.00, 0.00, 0.00, 0.1496, 0.0000, 0.0000, 0.3333
0.0, 0.75, 0.42, 0.00, 0.00, 0.25, 0.00, 0.5024, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.90, 0.00, 0.25, 0.00, 0.00, 0.9024, 0.3333, 0.0000, 0.0000
0.0, 0.50, 0.36, 0.00, 0.00, 0.00, 0.25, 0.3935, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.18, 0.00, 0.25, 0.00, 0.00, 0.1350, 0.0000, 0.0000, 0.3333
. . .

The normalized and encoded data is fed to a standard k-means clustering system where k is set to 3. The resulting clustering is:

Result clustering:
  0  1  2  0  0  2  2  0  0  1  0  0  0  1  2  2 . . .
Result WCSS = 49.3195 

This indicates data item [0] is assigned to cluster 0, item[1] is assigned to cluster 1, item [2] is assigned to cluster 2, item [3] is assigned to cluster 0 and so on. The WCSS (within-cluster sum of squares) is the value that k-means attempts to minimize, so smaller values are better and indicate a better clustering.

The demo displays the clustering results by cluster ID:

cluster 0 | count = 89 :
   0    3    4    7    8   10   11   12   17   18   19   24  . . .

cluster 1 | count = 77 :
   1    9   13   16   21   30   31   36   37   38   42   44  . . .

cluster 2 | count = 74 :
   2    5    6   14   15   20   22   23   25   26   27   29  . . .

This means data items [0], [3], [4] and so on are assigned to cluster 0, and that 89 of the 240 data items belong to cluster 0. The demo concludes by displaying the source data by cluster ID and the cluster means/centroids:

cluster 0:
[  0]  F short  24  arkansas  29500  liberal
[  3]  M medium 36  illinois  44500  moderate
[  4]  F short  27  colorado  28600  liberal
 . . .

Means:
[ 0]  0.3 0.40 0.19 0.09 0.06 0.09 0.02 0.2568 0.1049 0.1123 0.1161
[ 1]  0.0 0.63 0.66 0.08 0.06 0.08 0.03 0.6758 0.0390 0.1645 0.1299
[ 2]  0.5 0.32 0.69 0.07 0.08 0.05 0.05 0.6542 0.1576 0.1531 0.0225

This article assumes you have intermediate or better programming skill with a C-family language and a basic familiarity with k-means clustering, but does not assume you know anything about normalizing and encoding data. The source code is too long to be presented in its entirety in this article, however the complete source code and data are available in the accompanying file download, and are also available online.

Normalizing and Encoding Mixed Data for k-Means Clustering
Because k-means clustering computes Euclidean distance between data items, all data values must be numeric and normalized so that the values have roughly the same magnitude. Normalization prevents variables with large magnitudes (such as income) from overwhelming variables with smaller magnitudes (such as age).

Naive approaches for encoding categorical data don't work well for clustering. For example, the State variable in the demo data has four possible values: Arkansas, Colorado, Delaware, Illinois. If you use ordinal encoding where Arkansas = 0, Colorado = 1, Delaware = 2, Illinois = 3, then the mathematical distance between Arkansas and Delaware is larger than the distance between Arkansas and Colorado.

One-hot encoding is used on categorical variables for neural network systems, but one-hot encoding is not a good approach for k-means clustering. For example, if Arkansas = (1,0,0,0), Colorado = (0,1,0,0), Delaware = (0,0,1,0), Illinois = (0,0,0,1), then the Euclidean distance between Arkansas and Colorado is sqrt((1 - 0)^2 + (0 - 1)^2 + (0 - 0)^2 + (0 - 0)^2) = sqrt(2) = 1.41. Furthermore, the distance between any two one-hot encoded States is 1.41. Suppose that instead of just four possible values, a variable had 10 possible values. With one-hot encoding, the Euclidean distance between any two values is still 1.41, which doesn't make sense. The intuition is that for a variable with many possible values, a mismatch doesn't give as much information as a mismatch for a variable with only a few possible values.

The demo program uses one-over-n-hot encoding for nominal categorical variables. Instead of using a 1 in the identifying location, you use 1/n where n is the number of possible values for the variable. For the demo data, State has four possible values and so the one-over-n-hot encoding is Arkansas = (0.25, 0, 0, 0), Colorado = (0, 0.25, 0, 0) and so on. The political leaning variable has three possible values so the one-over-n-hot encoding is conservative = (0.3333, 0, 0), moderate = (0, 0.3333, 0), liberal = (0, 0, 0.3333). The Euclidean distance between any two State values is sqrt((0.25)^2 + (0.25)^2) = sqrt(0.125) = 0.35. The difference between any two political leaning values is sqrt((0.3333)^2 + (0.3333)^2) = sqrt(0.222) = 0.47, which is greater than the State distance. This seems reasonable because a mismatch in political leaning provides more information than a mismatch in State.

Binary categorical data is a special case of one-over-n-hot encoding. For the demo sex variable, you could encode using plain one-over-n-hot encoding as M = (0.50, 0) and F = (0, 0.50). The distance between M and F is sqrt((0.50)^2 + (0.50)^2) = sqrt(0.25 + 0.25) = 0.71. However, the demo program uses an attenuated encoding of M = 0.0 and F = 0.5. The distance between M and F is sqrt((0.5)^2) = 0.5. In some experiments, the plain one-over-n-hot encoding for binary categorical data did not work as well as the attenuated encoding. But results were not conclusive and how best to encode binary categorical data is an open research question.

Ordinal categorical data has an inherent ordering. The demo uses equal-interval encoding. For the height variable with three possible values, the encoding is short = 0.25, medium = 0.50, tall = 0.75. In general, for an ordinal categorical variable with n possible values, the encoded values are 1/(n+1), 2/(n+1) and so on. For example, for an ordinal categorical variable with nine possible values, the encoding would be 0.10, 0.20, . . 0.90.

Because the encoding for categorical variables results in all encoded values being between 0.0 and 1.0, it makes sense that normalized numeric values should be in that same range. The easiest way to do this is to apply min-max normalization. The age variable data has min_age = 18 and max_age = 68. To normalize an age, the calculation is age' = (age - min_age) / (max_age - min_age). For example, the first person's age is 24 so the normalized age is (24 - 18) / (68 - 18) = 6 / 50 = 0.12.

The income variable is normalized in the same way, where min_income = $20,300 and the max_income is $81,800. So the first person's income of $29,500 is normalized to (29,500 - 20,300) / (81,800 - 20,300) = 0.1496.

The Overall Program Structure
The overall structure of the demo program is presented in Listing 1. All the control logic is in the Main() method which calls several helper functions. The KMeans class contains all the clustering functionality.

Listing 1: Demo Program Structure

using System;
using System.IO;
using System.Collections.Generic;

namespace ClusterMixedKMeans
{
  internal class ClusterMixedProgram
  {
    static void Main(string[] args)
    {
      Console.WriteLine("Begin mixed data k-means" +
        " using C# ");

      // load raw data
      // load encoded data
      // apply k-means to encoded data
      // display clustering results

      Console.WriteLine("End demo ");
      Console.ReadLine();
    } // Main

    // helper functions:
    static double[][] NormAndEncode(string fn, char delim,
      string comment) { . . }

    static List<int>[] ItemsByCluster(int[] clustering,
      int k)

    static void ShowItemIndicesByCluster(List<int>[] arr,
      int nItemsPerCluster) { . . }

    static void ShowItemsByCluster(List<int>[] arr,
      string[] rawData, int nItemsPerCluster) { . . }

    static void MatShow(double[][] m, int[] decs,
      int nRows, bool showIndices) { . . }

    static void VecShow(int[] vec, int wid,
      int nItems) { . . }

    static string[] FileLoad(string fn,
      string comment) { . . }

    static double[][] MatLoad(string fn, int[] usecols,
      char sep, string comment) { . . }
  } // Program

  public class KMeans { . . }

} // ns

The demo program begins by loading the raw source data into memory as an array of string values:

string rf =
  "..\\..\\..\\Data\\people_raw_space.txt";
string[] rawFileArray = FileLoad(rf, "#");

Console.WriteLine("Raw source data: ");
for (int i = 0; i < 4; ++i) {
  Console.Write("[" + i.ToString().PadLeft(3) + "] ");
  Console.WriteLine(rawFileArray[i]);
}
Console.WriteLine(" . . . ");

Having the source data in memory allows you to display the data by cluster so that you can examine it visually for any interesting patterns. Next, the demo loads the normalized and encoded data into memory:

// preprocessed data version
string fn =
  "..\\..\\..\\Data\\people_encoded.txt";
double[][] X = MatLoad(fn,
new int[] { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 },
  ',', "#");
Console.WriteLine("Encoded data: ");
// decimals to display
int[] decs = new int[] { 1, 2,2,2,2,2,2, 4,4,4,4 };
MatShow(X, decs, 4, true);

The arguments passed to the program-defined MatLoad() helper function mean load columns 0 through 10 of the encoded data file, which is comma-delimited, and comments are indicated by lines that begin with "#".

This assumes that the data has been normalized and encoded in a manual preprocessing step, for example by dropping the source data into Excel and then manipulating it. An alternative is to programmatically normalize and encode the data:

// programmatic version
string rf =
  "..\\..\\..\\Data\\people_raw_space.txt";
double[][] X = NormAndEncode(rf, ' ', "#");

The program-defined NormAndEncode() function is hard-coded with values specific to the demo data. For example, a snippet of the code is:

string heightStr = tokens[1].Trim();
if (heightStr == "short") result[i][1] = 0.25;
else if (heightStr == "medium") result[i][1] = 0.50;
else if (heightStr == "tall") result[i][1] = 0.75;

Writing a general-purpose library function to normalize and encode any dataset is possible but is surprisingly difficult and not worth the effort unless you are clustering several different datasets.

Clustering the Data
After the normalized and encoded data is read into memory as an array-of-arrays style matrix, it is clustered like so:

Console.WriteLine("Clustering with k=3 seed=0");
KMeans km = new KMeans(X, k:3, seed:0);
int[] clustering = km.Cluster();
Console.WriteLine("Done ");
      
Console.WriteLine("Result clustering: ");
VecShow(clustering, 3, 16);
Console.WriteLine("Result WCSS = " + 
  km.bestWCSS.ToString("F4"));

The seed value passed to the KMeans constructor method controls the initial random cluster assignments. If you use different seed values you should get very similar, but not necessarily identical, clustering results. If changing the random seed value results in significantly different clustering results, it usually means that your data is not well-suited for k-means clustering.

The demo program concludes by displaying the clustering results in two additional formats, and displaying the cluster means/centroids:

List<int>[] clusterLists = ItemsByCluster(clustering, k:3);
Console.WriteLine("Item indices by cluster ID: ");
ShowItemIndicesByCluster(clusterLists, 12);

Console.WriteLine("Source data by cluster ID: ");
ShowItemsByCluster(clusterLists, rawFileArray, 3);

Console.WriteLine("Means: ");
MatShow(km.bestMeans, decs, nRows:3, showIndices:true);

Modifying the Demo Code
The demo code can be used as-is in most situations. The KMeans class uses random partition initialization where all data items are initially assigned to a randomly selected cluster ID. An alternative is to use Forgy initialization where the k initial means/centroids are selected as random data items. In practice there is rarely any difference between using random partition initialization and Forgy initialization. Another initialization algorithm is called k-means++, but this is a bit more difficult to implement.

The KMeans class uses Euclidean distance between data items. An alternative is to use Manhattan distance. Manhattan distance uses the sum of absolute values of the differences between data item elements instead of the square root of the sum of squared differences. Therefore, it is sometimes said that, in theory, Manhattan distance may be preferable for data that has a very high dimensionality. I'm mildly skeptical of this claim and Euclidean distance seems to work fine in practice.

The KMeans class clustering algorithm uses two loops. In pseudo-code:

  
    loop exactly trials times
      wcss, clustering = cluster using at most maxIter iterations
      if wcss < best_wcss then
        best_wcss = current wcss
        best_clustering = current clustering
        best_means = current means
      end_if
    end_loop
    return best_clustering
  

The KMeans constructor sets the outer loop number of trials value to N * 5 where N is the number of data items being clustered. For the demo data this is 240 * 5 = 1200 iterations. This N * 5 value has worked well in practice, but I can imagine there might be scenarios where you need a larger value or a smaller value, or possibly a fixed value for doing performance evaluations. For example, the scikit-learn Python KMeans library module uses a default fixed value of 300 for the number of trials. Because the trials variable is a public class member, you can set it directly, like km.trials = 250, before calling the Cluster() method.

The value of the max iterations used in one clustering operation is set to N * 2. The k-means algorithm typically converges quickly to a result where there is no change in clustering -- typically about 15 iterations for the demo data -- so it's unlikely that you'll need to modify the maxIter value.

Wrapping Up
As is often the case with machine learning, when using the k-means clustering technique presented in this article, the most time-consuming part of the process is preparing the raw data by normalizing and encoding it. For relatively small datasets (typically less than N = 1000 and fewer than 30 columns), I usually just drop the raw data into an Excel spreadsheet and process the data there. For larger datasets it's usually worth the effort to implement a hard-coded specialized function to do the work, like the NormAndEncode() helper function in the demo program.

The technique presented in this article is all about encoding and normalizing any dataset so that it can be clustered using standard k-means. For example, I passed the encoded and normalized data to the scikit-learn Python language KMeans module and got identical results as the C# KMeans implementation presented in this article.

The technique presented in this article to cluster mixed categorical and numeric data can also be used to cluster strictly categorical data using k-means.

The demo uses k = 3 clusters, which is more or less arbitrary. To find a good value for k, you can cluster using values from 1 to 10, and track the resulting WCSS values. The WCSS values will typically get smaller quickly but start to level out at some point. The k-value at that point is often a good choice. This is called the "elbow" technique.

An alternative for clustering mixed categorical and numeric data is to use an old technique called k-prototypes clustering. One disadvantage of using k-prototypes clustering is that it's not widely implemented, as is the case with k-means clustering routines. Another disadvantage of k-prototypes clustering is that it's relatively crude because it uses a dissimilarity function that simply counts the number of mismatched ordinal-encoded categorical values between two items. In a set of limited experiments, the clustering technique presented in this article beat or matched k-prototypes clustering results on all datasets.

comments powered by Disqus

Featured

Subscribe on YouTube