Neural Network Lab

Neural Network Back-Propagation using Python

Python is James's preferred language for hybrid environments. Here's how to implement neural network back-propagation training using it.

When I'm working in a pure Microsoft technology environment, C# is my go-to programming language. But when I'm working in a hybrid environment, Python is my preferred language. I've seen a big increase in the use of Python for data-related programming. In this article I'll explain to how implement neural network back-propagation training using Python. If you don't currently use Python, examining the code in this article can be an excellent introduction to the language. And if you are a Python user, the code here can be a useful addition to your personal software tool kit.

Take a look at the screenshot of a demo run in Figure 1. The goal of the demo program is to create a neural network model that predicts the species of an iris flower based on the flower's color, petal length, and petal width. The raw demo data consists of 30 items. The first three raw items are:

blue, 1.4, 0.3, setosa
pink, 4.9, 1.5, versicolor
teal, 5.6, 1.8, virginica
[Click on image for larger view.] Figure 1. Neural Network Back-Propagation using Python

The dependent, y-variable to predict, species, is in the last column. Species can take one of three values: setosa, versicolor, or virginica. There are three independent, x-variable features: color in the first column, petal length in the second column, and petal width in the third column. The 30-item raw data set is artificial but is based on a well-known benchmark data set called Fisher's Iris Data. The real data set has 150 items and does not include a color variable, but has sepal (a green leaf-like structure) length and width values for each data item.

The demo program splits the 30-item source data into a 24-item training set used to generate the neural network model, and a six-item test set used to evaluate the accuracy of the model. Because neural networks can only work directly with numeric data, the string values for color and species must be encoded. The independent predictor values are encoded using 1-of-(N-1) encoding so that blue is (1, 0), pink is (0, 1), and teal is (-1, -1). The dependent y-values are encoded using 1-of-N encoding so that setosa is (1, 0, 0), versicolor is (0, 1, 0), and virginica is (0, 0, 1).

In most situations, numeric input values are normalized so that all the values have relatively similar magnitudes to prevent very large numeric values (for example, an employee's annual salary in dollars) from dominating smaller values (for example, the employee's number of years with the company). For simplicity, the demo does not normalize the petal length and width values, which is feasible because the magnitudes of their values are roughly similar.

After splitting and encoding, the first three training items are:

 1,  0, 1.4, 0.3, 1, 0, 0
 0,  1, 4.9, 1.5, 0, 1, 0
-1, -1, 5.6, 1.8, 0, 0, 1

The demo program instantiates a neural network with four input nodes (one for each numeric input), five hidden processing nodes (the number of hidden nodes must be determined using trial and error), and three output nodes (one for each numeric output). A 4-5-3 fully connected neural network with have (4 * 5) + 5 + (5 * 3) + 3 = 43 weights and bias values.

Training a neural network is the process of finding a set of weight and bias values so that, for a given set of training inputs, the computed outputs produced by the neural network are very close to the known output values. Once you have these weight and bias values, you can use them to make predictions for new data that has unknown output values.

By far the most common algorithm used to train feed-forward neural networks is called back-propagation. Back-propagation compares neural network computed outputs with the target output values, determines the magnitude and direction of the difference between actual and target values, then adjusts a neural network's weights and bias values so that the new outputs will be closer to the target values. This process is repeated until the actual output values are close enough to the target values, or some maximum number of iterations has been reached.

Back-propagation requires a learning rate parameter. The learning rate controls how quickly weights and bias values change. Back-propagation often uses an optional parameter called momentum. Momentum acts to increase the speed of training. In the demo, the learning rate is set to 0.08 which is artificially large. Values between 0.01 and 0.05 are more common. The momentum value is set to 0.01. The maximum training iterations is set to 70, which is artificially small.

In the demo, after training completes, the values of the 43 weights and bias values are displayed:

-0.1435, 0.0411, . . . -0.0892 

Using these values, the demo applies the model to the training data and computes an accuracy of 21 correctly predicted iris species out of 24 = 87.50%. When the model is applied to the test set, the computed accuracy is 4 correct out of 6 = 66.67%. The accuracy of the model on the test data is a very rough estimate of how accurate the model will be when presented with new data with unknown output values.

This article assumes you understand basic neural network architecture and the feed-forward mechanism, and that you have at least intermediate level programming skills with a C-family language. I have removed all normal-error checking to keep the main ideas of back-propagation as clear as possible. The complete code for the demo program is too long to present in this article so I focus on the back-propagation algorithm. The complete Python source code is available in the code download that accompanies this article.

Demo Program Structure
The structure of the demo program shown in Figure 1, with some minor edits and print statements removed to save space, is presented in Listing 1. I used my favorite text editor, Notepad, but there are many excellent Python editing tools available, including the Python Tools for Visual Studio plugin. I used Python version 2.7.8 rather than the newer, but not backwards-compatible Python 3.

I named the demo program At the top of the source code I added import statements to bring the random and math libraries into scope.

Listing 1: The Demo Program Structure
# uses Python version 2.7.8

import random
import math

# ------------------------------------

def show_data(matrix, num_first_rows):
def show_vector(vector):

# ------------------------------------

class NeuralNetwork:
  def __init__(self, num_input, num_hidden, num_output):
  def make_matrix(self, rows, cols):
  def set_weights(self, weights):
  def get_weights(self):
  def initialize_weights(self):
  def compute_outputs(self, x_values):
  def hypertan(self, x):
  def softmax(self, o_sums):
  def train(self, train_data, max_epochs, learn_rate, momentum):
  def accuracy(self, data):

# ------------------------------------

print "Begin neural network using Python demo"
print "Goal is to predict species from color, length, width"
print "The 30-item raw data looks like:"
print "[0]  blue, 1.4, 0.3, setosa"
print "[1]  pink, 4.9, 1.5, versicolor"

train_data  = ([[0 for j in range(7)]
 for i in range(24)]) # 24 rows, 7 cols
train_data[0] = [ 1, 0, 1.4, 0.3, 1, 0, 0 ]
. . .
train_data[23] = [ -1, -1, 5.8, 1.8, 0, 0, 1 ]

test_data  = ([[0 for j in range(7)]
 for i in range(6)]) # 6 rows, 7 cols
test_data[0] = [ 1, 0, 1.5, 0.2, 1, 0, 0 ]
. . . 
test_data[5] = [ 1, 0, 6.3, 1.8, 0, 0, 1 ]

print "First few lines of training data are:"
show_data(train_data, 4)

print "The encoded test data is:"
show_data(test_data, 5)

print "Creating a 4-5-3 neural network"
print "Using tanh and softmax activations"
num_input = 4
num_hidden = 5
num_output = 3
nn = NeuralNetwork(num_input, num_hidden, num_output)

max_epochs = 70    # artificially small
learn_rate = 0.08  # artificially large
momentum = 0.01
print "Setting max_epochs = " + str(max_epochs)
print "Setting learn_rate = " + str(learn_rate)
print "Setting momentum = " + str(momentum)

print "Beginning training using back-propagation"
weights = nn.train(train_data, max_epochs, learn_rate, momentum)
print "Training complete"
print "Final neural network weights and bias values:"

print "Model accuracy on training data =",
acc_train = nn.accuracy(train_data)
print "%.4f" % acc_train

print "Model accuracy on test data     =",
acc_test = nn.accuracy(test_data)
print "%.4f" % acc_test

print "End back-prop demo"

Because Python does not have a predefined main entry point, what C# programmers would normally consider code that belongs in a Main method typically comes at the end of Python source code rather than at the beginning. Functions show_data and show_vector are simple display helpers. All of the neural network logic is contained in a single NeuralNetwork class.

The training and test data has been pre-encoded and placed directly into lists train_data and test_data. In non-demo scenarios the source data would likely be stored in a text file and processed using helper functions with names like load_data, encode_data, normalize_data, and split_data.

The Neural Network Class
Because Python class members do not have to be explicitly declared, the __init__ (two leading and trailing underscore characters) function, which is somewhat similar to a C# constructor, typically defines class member variables and objects. The neural network class __init__ function is presented in Listing 2.

Listing 2: The __init__ Function
def __init__(self, num_input, num_hidden, num_output):
  self.num_input = num_input
  self.num_hidden = num_hidden
  self.num_output = num_output
  self.inputs = [0 for i in range(num_input)]
  self.ih_weights = self.make_matrix(num_input, num_hidden)
  self.h_biases = [0 for i in range(num_hidden)]
  self.h_outputs = [0 for i in range(num_hidden)]
  self.ho_weights = self.make_matrix(num_hidden, num_output)
  self.o_biases = [0 for i in range(num_output)]
  self.outputs = [0 for i in range(num_output)]
  # random.seed(0) # hidden function is 'normal' approach
  self.rnd = random.Random(0)

The input values are stored in a list named self.inputs. The keyword "self" is loosely analogous to the "this" keyword in C#. The list is instantiated using a curious Python syntax mechanism called a comprehension. The input-to-hidden weights are stored in a list-of-lists style matrix named ih_weights. The matrix is instantiated using helper function make_matrix.

Unlike many neural network implementations, which treat hidden node and output node bias variables as special weights that have a dummy input node with constant value 1.0, I prefer to explicitly store bias values, here in lists named h_biases and o_biases.

After allocating lists to hold inputs, weights, biases, and outputs, the __init__ function initializes all weights and biases to small, random values by calling helper function initialize_weights.

Understanding Back-Propagation
The back-propagation algorithm presented in this article involves six steps. First, the so-called gradients for each output layer node are calculated. Gradients are values that measure of how far off, and in what direction (positive or negative) the current actual neural network output values are, compared to the target (sometimes called "desired") values. Second, the output gradient values are used to calculate gradients for each hidden layer node. Hidden node gradients are calculated differently from the output node gradients. Third, the hidden node gradient values are used to calculate a delta value to be added to input-to-hidden weights. Fourth, hidden node gradient values are used to calculate a delta value for each input-to-hidden bias value. Fifth, the output layer node gradients are used to calculate a delta value for each hidden-to-output weight value. And sixth, the output layer node gradients are used to calculate a delta value for each hidden-to-output bias value.

The image in Figure 2 will help clarify how back-propagation works. The image shows a dummy 3-4-2 neural network that is not the same as the demo neural network shown in Figure 1. In this example the desired output values are not shown but are (0.1234, 0.8766). Currently, the three input values are (1.0, -2.0, 3.0) and using the weight bias values, the two computed outputs are (0.5070, 0.5073).

[Click on image for larger view.] Figure 2. Computing Back-Propagation Gradients and Deltas

The gradient of an output layer node with a softmax activation function is equal to (1 - y)(y) * (t - y) where y is the computed output value of the node and t is the desired target value from the training data. In Figure 2, the gradient for the top-most output node is (1 - 0.5070)(0.5070) * (0.1234 - 0.5070) = -0.0954. A negative value of the gradient means the output value is larger than the target value and weights and bias values must be adjusted to make the output smaller. The gradient of the other output layer node is (1 - 0.5073)(0.5073) * (0.8766 - 0.5073) = 0.0923. The (1 - y)(y) term is the calculus derivative of the softmax function. If you use an activation function that is different from the softmax, you must use the derivative of that function.

Computing the gradient of a hidden node is more complicated than computing the gradient of an output layer node. The gradient of a hidden node depends on the just-computed gradients of all output layer nodes. The gradient of a hidden node that uses the hyperbolic tangent activation function is equal to (1 - y)(1 + y) * Sum(each output gradient * weight from the hidden node to the output node). Here y is the output of the hidden node. For example, for the bottom-most hidden node in Figure 2 the gradient is (1 - 0.0400)(1 + 0.0400) * (-0.0954 * 0.023) + (0.0923 * 0.024) = about 0.00003 (rounded).

Because computing the back-propagation hidden layer gradients requires the values of the output layer gradients, the algorithm computes "backwards" (right-to-left in the diagram) which is why the back-propagation algorithm is named as it is. The (1 - y)(1 + y) term is the derivative of the hyperbolic tangent.

After all the gradient values have been computed, those values are used to update the weights and bias values. Unlike the gradients which must be computed from right to left in Figure 2, weights and bias values can be computed in any order. And all weights and bias value are computed in the same way, with the minor exception that bias values are computed slightly differently than weight values.

Observe that any neural network weight has an associated from-node and to-node. For example, in Figure 2, weight 0.004 is associated with input layer node [0] and hidden layer node [3]. For any input-to-hidden or hidden-to-output weight, a delta value (that is, a value which will be added to the weight to give the new weight) is computed as (gradient of to-node * output of from-node * learning rate). So, the delta for weight 0.004 would be computed as 0.00003 * 3.0 * 0.5 = about 0.000005 (rounded). Here the 3.0 is the output of the from-node, which is the same as input[0]. The 0.5 is the learning rate. The new value for the weight from input[0] to hidden[3] would be 0.004 + 0.000005.

Notice that the increase in the weight is very small. A small value of the learning rate makes neural network training slow because weights change very little each time through the back-propagation algorithm. A larger value for the learning rate would create a larger delta, which would create a larger change in the weight. But a too-large value for the learning rate runs the risk of the algorithm shooting past the optimal value for a weight or bias.

Most back-propagation algorithms use an optional technique called momentum. The idea is to add an additional increment to the weight in order to speed up training. The momentum term is typically a fixed constant with a value like 0.05 times the value of the previously used delta. If you use momentum in back-propagation, you must store each computed delta value for use in the next training iteration. In Figure 2, suppose the previous weight delta for the weight from input[0] to hidden[3] was 0.000008. Then after adding the computed delta of 0.000005 to the current weight value of 0.004, an additional (0.000008 * 0.10) = 0.0000008 would be added.

Because the neural network class in this article treats and stores bias values as actual biases rather than as special weights with dummy constant 1.0 input values, the bias values must be updated separately. The only difference between updating a bias value and updating a weight value is that there is no output-of-from-node term in the delta.

Implementing Back-Propagation
The back-propagation algorithm is used in class function train. The definition of function train begins:

def train(self, train_data, max_epochs, learn_rate, momentum):
  o_grads = [0 for i in range(self.num_output)] # gradients
  h_grads = [0 for i in range(self.num_hidden)]
  ih_prev_weights_delta = self.make_matrix(num_input, num_hidden) # momentum
  h_prev_biases_delta = [0 for i in range(self.num_hidden)]
  ho_prev_weights_delta = self.make_matrix(num_hidden, num_output)
  o_prev_biases_delta = [0 for i in range(self.num_output)]
. . .

The o_grads and h_grads lists hold the hidden node and output node gradients. To use momentum, the deltas calculated in each iteration must be saved. List-of-lists ih_previous_weights_delta holds the previous deltas for the input-to-hidden weights, and so on.

Next, the main loop is prepared:

epoch = 0
x_values = [0 for i in range(self.num_input)]
t_values = [0 for i in range(self.num_output)]
sequence = [i for i in range(len(train_data))]

Variable epoch is the training loop counter. Lists x_values and t_values hold the input values and the target output values from the training data. List sequence holds index values. The values in the list will be randomly shuffled and then used to access the training data in a different, random order in each iteration.

The main loop starts as:

while epoch < max_epochs:
  for ii in range(len(train_data)):
    idx = sequence[ii]
    for j in range(self.num_input): # peel off x_values 
      x_values[j] = train_data[idx][j]
    for j in range(self.num_output): # peel off t_values
      t_values[j] = train_data[idx][j + self.num_input]
    self.compute_outputs(x_values) # outputs stored internally

Python has a convenient built-in shuffle function which is used to scramble the array indices stored in list sequence. Each of these is pulled and stored into variable idx. The first num_input values in the current training item are placed into list x_values and the remaining num_output values are stored into list t_values. Then class function compute_outputs does just that, storing the computed output values internally.

The next part of function train implements the back-propagation algorithm. First the output node gradients are calculated as described earlier:

for i in range(self.num_output): 
  derivative = (1 - self.outputs[i]) * self.outputs[i]
  o_grads[i] = derivative * (t_values[i] - self.outputs[i])

Next, the hidden node gradients are calculated:

for i in range(self.num_hidden):
  derivative = (1 - self.h_outputs[i]) * (1 + self.h_outputs[i])
  sum = 0
  for j in range(self.num_output):
    x = o_grads[j] * self.ho_weights[i][j]
    sum += x
  h_grads[i] = derivative * sum

After the output and hidden gradients have been calculated, the weights and biases can be updated in any order. The demo updates the input-to-hidden weights first:

for i in range(self.num_input):
  for j in range(self.num_hidden):
   delta = learn_rate * h_grads[j] * self.inputs[i]
   self.ih_weights[i][j] += delta
   self.ih_weights[i][j] += momentum * ih_prev_weights_delta[i][j]
   ih_prev_weights_delta[i][j] = delta # save the delta

Next the hidden node biases are updated:

for i in range(self.num_hidden):
  delta = learn_rate * h_grads[i]
  self.h_biases[i] += delta
  self.h_biases[i] += momentum * h_prev_biases_delta[i];
  h_prev_biases_delta[i] = delta # save the delta

Then the new hidden-to-output weights are calculated:

for i in range(self.num_hidden):
  for j in range(self.num_output):
    delta = learn_rate * o_grads[j] * self.h_outputs[i]
    self.ho_weights[i][j] += delta
    self.ho_weights[i][j] += momentum * ho_prev_weights_delta[i][j];
    ho_prev_weights_delta[i][j] = delta # save

And last, the output node bias values are updated:

for i in range(self.num_output):
  delta = learn_rate * o_grads[i]
  self.o_biases[i] += delta
  self.o_biases[i] += momentum * o_prev_biases_delta[i]
  o_prev_biases_delta[i] = delta # save

Function train concludes by incrementing the loop counter, and after the main loop terminates, returning the final values of the weights and biases using function get_weights:

. . . 
      epoch += 1
    # end while
    result = self.get_weights()
    return result
  # end train

Wrapping Up
The information presented in this article should give you a good basis for experimenting with neural networks using Python. Back-propagation is by far the most common neural network training algorithm, but there are several alternative techniques, including particle swarm optimization and simplex (amoeba method) optimization.

Compared to alternative training approaches, back-propagation tends to be the fastest, even though back-propagation can be very slow for large data sets. One weakness of back-propagation is that the algorithm is often extremely sensitive to the values used for the learning rate and momentum, meaning that for some combinations of learning rate and momentum values, back-propagation can converge quite quickly to good weights and bias values. But for slightly different values of learning rate and momentum, back-propagation training may not converge at all. This means that training a neural network using back-propagation often requires some trial and error.

comments powered by Disqus
Upcoming Events

.NET Insight

Sign up for our newsletter.

Terms and Privacy Policy consent

I agree to this site's Privacy Policy.