The Data Science Lab

Multi-Class Classification Using PyTorch, Part 1: New Best Practices

Dr. James McCaffrey of Microsoft Research updates previous tutorials with new, cutting-edge deep neural machine learning techniques.

A multi-class classification problem is one where the goal is to predict a discrete value where there are three or more possibilities. For example, you might want to predict the political leaning (conservative, moderate, liberal) of a person based on their sex, age, state where they live and annual income.

Previous articles in Visual Studio Magazine, starting here, have explained multi-class classification using PyTorch. But machine learning with deep neural techniques has advanced quickly. This article updates multi-class classification techniques and best practices based on experience over the past two years.

A good way to see where this article is headed is to take a look at the screenshot of a demo program in Figure 1. The demo begins by loading a 200-item file of training data and a 40-item set of test data. Each tab-delimited line represents a person. The fields are sex, age, state of residence, annual income and politics type (0 = conservative, 1 = moderate and 2 = liberal). The goal is to predict politics type from sex, age, state and income.

Figure 1: Multi-Class Classification Using PyTorch Demo Run
[Click on image for larger view.] Figure 1: Multi-Class Classification Using PyTorch Demo Run

After the training data is loaded into memory, the demo creates a 6-(10-10)-3 neural network. This means there are six input nodes, two hidden neural layers with 10 nodes each and three output nodes.

The demo prepares to train the network by setting a batch size of 10, stochastic gradient descent (SGD) optimization with a learning rate of 0.01 and maximum training epochs of 1,000 passes through the training data. The meaning of these values and how they are determined will be explained shortly.

The demo program monitors training by computing and displaying the loss value for one epoch. The loss value slowly decreases, which indicates that training is probably succeeding. The magnitude of the loss values isn't directly interpretable; the important thing is that the loss decreases.

After 1,000 training epochs, the demo program computes the accuracy of the trained model on the training data as 81.50 percent (163 out of 200 correct). The model accuracy on the test data is 75.00 percent (30 out of 40 correct).

After evaluating the trained network, the demo predicts the politics type for a person who is male, 30 years old, from Oklahoma, who makes $50,000 annually. The prediction is [0.6905, 0.3049, 0.0047]. These values are pseudo-probabilities. The largest value (0.6905) is at index [0] so the prediction is class 0 = conservative.

The demo concludes by saving the trained model to file so that it can be used without having to retrain the network from scratch. There are two different ways to save a PyTorch model. The demo uses the save-state approach.

This article assumes you have a basic familiarity with Python and intermediate or better experience with a C-family language but does not assume you know much about PyTorch or neural networks. The complete demo program source code and data can be found in my Sept. 1 post, "Multi-Class Classification Using PyTorch 1.12.1 on Windows 10/11."

Installing PyTorch
The demo program was developed on a Windows 10/11 machine using the Anaconda 2020.02 64-bit distribution (which contains Python 3.7.6) and PyTorch version 1.12.1 for CPU. Installing PyTorch is like riding a bicycle -- easy once you know how but difficult if you haven't done it before.

I work at a large tech company and one of my job responsibilities is to deliver training classes to software engineers and data scientists. By far the biggest hurdle for people who are new to PyTorch is installation.

There are dozens of different ways to install PyTorch on Windows. The configuration I strongly recommend for beginners is to use the Anaconda distribution of Python and install PyTorch using the pip package manager. The Anaconda distribution of Python contains a base Python engine plus over 500 add-in packages that have been tested to be compatible with each other.

After you have a Python distribution installed, you can install PyTorch in several different ways. I recommend using the pip utility (which is installed as part of Anaconda). Briefly, you download a .whl ("wheel") file to your local machine, open a command shell, and issue the command "pip install (whl-file-name)".

You can find detailed step-by-step instructions for installing Anaconda Python for Windows 10/11 in my post, "Installing Anaconda3 2020.02 with Python 3.7.6 on Windows 10/11." You can find detailed instructions for downloading and installing PyTorch 1.12.1 for Python 3.7.6 on a Windows CPU machine in my post, "Installing PyTorch 1.10.0 on Windows 10/11."

Preparing the Data
The raw demo data looks like:

F   24   michigan   29500.00   lib
M   39   oklahoma   51200.00   mod
F   63   nebraska   75800.00   con
M   36   michigan   44500.00   mod
. . .

There are 240 lines of data. Each line represents a person. The five fields are sex (M, F), age, state of residence (Michigan, Nebraska, Oklahoma), annual income and politics type (conservative, moderate, liberal). The data is artificial. The raw data was split into a 200-item set for training and a 40-item set for testing.

The raw data must be encoded and normalized. The result is:

 1   0.24   1   0   0   0.2950   2
-1   0.39   0   0   1   0.5120   1
 1   0.63   0   1   0   0.7580   0
-1   0.36   1   0   0   0.4450   1
. . .

Because neural networks only understand numbers, the sex and state predictor values (often called features in neural network terminology) must be encoded. The sex values are encoded as male = -1 and female = 1. It is possible to encode variables that have only two values as 0 and 1, but using minus-one-plus-one encoding often gives better results.

The state values are one-hot encoded as Michigan = (1 0 0), Nebraska = (0 1 0) and Oklahoma = (0 0 1). The order of the encoding is arbitrary. If the state variable had four possible values, then the encodings would be (1 0 0 0), (0 1 0 0) and so on.

The demo data normalizes the numeric age and annual income values. The age values are divided by 100, for example age = 24 is normalized to age = 0.24. The income values are divided by 100,000, for example income = $55,000.00 is normalized to 0.5500. The resulting normalized age and income values are all between 0.0 and 1.0.

The technique of normalizing numeric data by dividing by a constant does not have a standard name. Two other normalization techniques are called min-max normalization and z-score normalization. I recommend using the divide-by-constant technique whenever possible. There is convincing (but currently unpublished) research that indicates divide-by-constant normalization usually gives better results than min-max normalization or z-score normalization. The topic is quite complex. For details see my post, "Why I Don't Use Min-Max or Z-Score Normalization For Neural Networks."

The variable to predict (often called the class or the label) is politics type, which has possible values of conservative, moderate or liberal. For PyTorch multi-class classification you must encode the variable to predict using ordinal encoding. The demo sets conservative = 0, moderate = 1 and liberal = 2. The order of the encoding is arbitrary.

The demo preprocesses the raw data by normalizing numeric values and encoding categorical values. It is possible to normalize and encode training and test data on the fly, but preprocessing is usually a simpler approach.

Overall Program Structure
The overall structure of the demo program is presented in Listing 1. The demo program is named The program imports the NumPy (numerical Python) library and assigns it an alias of np. The program imports PyTorch and assigns it an alias of T. Most PyTorch programs do not use the T alias, but my work colleagues and I often do so to save space. The demo program indents using two spaces rather than the more common four spaces, again to save space.

Listing 1: Overall Program Structure

# predict politics type from sex, age, state, income
# PyTorch 1.12.1-CPU Anaconda3-2020.02  Python 3.7.6
# Windows 10/11 

import numpy as np
import torch as T
device = T.device('cpu')

class PeopleDataset( . . .

class Net(T.nn.Module): . . .

def accuracy(model, ds): . . .

def main():
  # 0. get started
  print("Begin People predict politics type ")
  # 1. create DataLoader objects
  # 2. create network
  # 3. train model
  # 4. evaluate model accuracy
  # 5. make a prediction
  # 6. save model (state_dict approach)
  print("End People predict politics demo")

if __name__ == "__main__":

The global device is set to "cpu." If you are working with a machine that has a GPU processor, the device string is "cuda." Most of my colleagues and I develop neural networks on a local CPU machine, then if necessary (huge amount of training data or huge neural network), push the program to a GPU machine and train it there.

The demo has a program-defined PeopleDataset class, which stores training and test data. Data in a Dataset object can be served up in batches for training by using the built-in DataLoader object. It is possible to use training and test data directly instead of using a Dataset, but such problem scenarios are rare and you should use a Dataset for most problems.

The multi-class neural network classifier is implemented in a program-defined Net class. The Net class inherits from the built-in torch.nn.Module class which supplies most of the network functionality. Instead of using a class to define a PyTorch neural network, it is possible to create a neural network directly using the torch.nn.Sequential class. Using Sequential is simpler but less flexible than using a program-defined class. The fact that there are two completely different ways to define a PyTorch neural network can be confusing for beginners.

In a multi-class neural network classification problem, you must implement a program-defined function to compute classification accuracy of the trained model. The demo program defines an accuracy() function, which accepts a network and a Dataset object.

All of the demo program control logic is contained in a program-defined main() function. The demo program begins by setting the seed values for the NumPy random number generator and the PyTorch generator. Setting seed values is helpful so that demo runs are mostly reproducible. However, when working with complex neural networks such as Transformer networks, exact reproducibility cannot always be guaranteed because of separate threads of execution.

The Dataset Definition
The demo Dataset definition is presented in Listing 2. A Dataset inherits from the class, and you must implement three methods: __init__(), __len__() and __getitem__(). The __init__() method loads the data from file into memory as PyTorch tensors. The __len__() method tells the DataLoader object that uses the Dataset how many items there so the DataLoader knows when all items have been processed during training. The __getitem__() method returns a single data item, rather than a batch of items as you might have expected.

Listing 2: Dataset Definition

class PeopleDataset(
  # -1   0.27   0  1  0   0.7610   2

  def __init__(self, src_file):
    all_xy = np.loadtxt(src_file, usecols=range(0,7),
      delimiter="\t", comments="#", dtype=np.float32)
    tmp_x = all_xy[:,0:6]   # cols [0,6) = [0,5]
    tmp_y = all_xy[:,6]     # 1-D

    self.x_data = T.tensor(tmp_x, 
    self.y_data = T.tensor(tmp_y,
      dtype=T.int64).to(device)  # 1-D

  def __len__(self):
    return len(self.x_data)

  def __getitem__(self, idx):
    preds = self.x_data[idx]
    trgts = self.y_data[idx] 
    return preds, trgts  # as a Tuple

Defining a PyTorch Dataset is not trivial. You must define a custom Dataset for each problem/data scenario. The __init__() method accepts a src_file parameter, which tells the Dataset where the file of training data is located. The entire file is read into memory as a NumPy 2-dimensional array using the NumPy loadtxt() function. Commonly used alternatives include the NumPy genfromtxt() function and the Pandas read_csv() function.

The call to loadtxt() specifies argument comments="#" to indicate that lines beginning with "#" are comments and should be ignored. The "#" character is the default for comments and so the argument could have been omitted.

The syntax all_xy[:,0:6] means all rows in matrix all_xy, columns [0] to [5] inclusive. The syntax all_xy[:,6] means all rows, just column [6]. The data is read in as type float32, which is the default data type for PyTorch predictor values.

The data is converted from NumPy arrays to PyTorch tensors. Notice that the class labels-to-predict in self.y_data are cast from type float32 to type int64. This is required for multi-class classification.

The self.x_data and self.y_data are loaded into memory using the .to(device) method which in this case is "cpu." Because the default device type for a newly created PyTorch Tensor object is None, it's good practice to explicitly use .to(device) whenever a new Tensor is instantiated.

The __len___() function returns the number of rows in the self.x_data Tensor matrix. You could also use len(self.y_data) because the self.y_data vector has the same number of values.

The __getitem__() method accepts an index parameter, idx. The predictor values for item [idx] are pulled from self.x_data using normal array indexing. The class labels are also pulled using normal indexing syntax. The return value is a Python Tuple object where a set of predictors values is at tuple position [0] and the single associated class label is at tuple position [1].

An alternative approach is to return the predictors and label as a Python Dictionary object. That code would look like:

def __getitem__(self, idx):
  preds = self.x_data[idx]
  trgts = self.y_data[idx]
  sample = { 'predictors' : preds, 'targets' : trgts }
  return sample  # as Dictionary

The Dictionary approach allows you to access predictors and labels by name rather than by indexing. However, the Dictionary approach creates "magic strings" and so the Tuple approach is more common.

The demo Dataset definition assumes that the predictor values and the class labels are in the same source file. In some situations, the predictors and label are defined in separate files. In such situations you must pass two file paths instead of just one to the __init__() method.

A Dataset must be able to store all data in memory. This usually isn't a problem. But for huge sets of data, you must create a streaming data loader. This is difficult. See the article, "How To: Create a Streaming Data Loader for PyTorch" for an example.

Defining the Network
The neural network definition is presented in Listing 3. The network architecture is 6-(10-10)-3 with tanh() hidden node activation. The number of input nodes and number of output nodes for a multi-class classifier are determined by the training data. The number of hidden layers and the number of nodes in each layer are hyperparameters that must be determined by trial and error.

The __init__() method sets up the layers and optionally specifies how to initialize the layer weights and biases. The demo uses explicit initialization, but it's more common to use default weight and bias initialization. Weight and bias initialization is a surprisingly complex topic, and the documentation on the topic is a weak point of PyTorch. The choice of initialization algorithm often has a large effect on the behavior of a neural network.

The advantage of using default weight and bias initialization is simplicity. The disadvantage is that the default initialization algorithm can, and has, changed several times. My recommendation is to use explicit weight and bias initialization for simple multi-class classifiers with just one or two hidden layers, but use default initialization for classifiers with three or more hidden layers.

Listing 3: Neural Network Class Definition

class Net(T.nn.Module):
  def __init__(self):
    super(Net, self).__init__()
    self.hid1 = T.nn.Linear(6, 10)  # 6-(10-10)-3
    self.hid2 = T.nn.Linear(10, 10)
    self.oupt = T.nn.Linear(10, 3)


  def forward(self, x):
    z = T.tanh(self.hid1(x))
    z = T.tanh(self.hid2(z))
    z = T.log_softmax(self.oupt(z), dim=1)  # NLLLoss() 
    return z

The demo network uses tanh() activation on the hidden nodes. In the early days of neural networks, sigmoid() activation was common, but it's now rarely used. For deep neural networks, relu() activation is often used. There are essentially no good rules of thumb for deciding which hidden layer activation to use. In all common scenarios you should try both tanh() and relu() and see which seems to work better in combination with all your other hyperparameters.

A common source of confusion for those who are new to PyTorch multi-class classification is the output layer activation function. There is a strong coupling between output activation and the loss function used during training. The demo program uses log_softmax() activation. This approach assumes that you use NLLLoss() ("negative log-likelihood loss") during training. The dim=1 argument is a bit mysterious and means "collapse on dim 1, which is the columns, and therefore process the output by rows."

The underlying theory and mechanics are complex, but you don't need to understand them to create a multi-class classifier -- just use log_softmax() activation on the output nodes and use NLLLoss() during training.

However, sometime around PyTorch version 0.3 a shortcut syntax was introduced. For a multi-class classifier, if you use the CrossEntropyLoss() function during training, then log_softmax() activation will be automatically applied to the output layer and so you don't explicitly apply any output activation in the forward() method. For reasons which aren't clear to me, the new shortcut no-activation with CrossEntropyLoss() style is a bit more common than the older original log_softmax() activation with NLLLoss() style. I prefer the original log_softmax with NLLLoss() style.

To add to the mild confusion, a third approach for multi-class classification is to apply sigmoid() activation to the output nodes and use MSELoss() during training. The old log_softmax() activation with NLLLoss() is functionally equivalent to the shortcut no-activation with CrossEntropyLoss() but the sigmoid() activation with MSELoss() gives different results -- sometimes better, sometimes worse.

The network dimensions of 6, 10, 10 and 3 are hard-coded. It is possible to pass these values as parameters to the __init__() function, but the hard-coded approach is simpler and easier to understand, which trumps the minor loss of flexibility in my opinion.

Wrapping Up
The demo code presented in this article can be used as a guide to prepare training data and a template to define a neural network for most multi-class classification problems. Part 2 will explain how to train the network, compute the trained network's classification accuracy, use the network to make predictions and save the network for use by other programs.

comments powered by Disqus


Subscribe on YouTube