### Differential Evolution Optimization

Dr. James McCaffrey of Microsoft Research explains stochastic gradient descent (SGD) neural network training, specifically implementing a bio-inspired optimization technique called differential evolution optimization (DEO).

Training a neural network is the process of finding good values for the network's weights and biases. Put another way, training a neural network is the process of using an optimization algorithm of some sort to find values for the weights and biases that minimize the error between the network's computed output values and the known correct output values from the training data.

The most common type of optimization for neural network training is some form of stochastic gradient descent (SGD). SGD has many variations including Adam (adaptive momentum estimation), Adagrad (adaptive gradient) and so on. All SGD-based optimization algorithms use the Calculus derivative (gradient) of an error function. But there are alternative optimization techniques that don't use gradients. Examples include bio-inspired optimization techniques such as genetic algorithms and particle swarm optimization and geometry-inspired techniques such as Nelder-Mead and spiral dynamics.

This article explains how to implement a bio-inspired optimization technique called differential evolution optimization (DEO). A good way to see where this article is headed is to take a look at the screenshot of a demo program in **Figure 1**. The demo program uses DEO to solve the Rastrigin function in three dimensions.

**[Click on image for larger view.]**

*Figure 1:*Using Differential Evolution Optimization to Solve the Rastrigin Function.

The Rastrigin function with dim = 3 has a known min-value solution of 0.0 at (0, 0, 0). The demo program sets up pop_size = 10 random points where each point is a possible solution. DEO has three key parameters. Differential weight, F, is set to 0.5. Crossover rate, CR, is set to 0.70. Maximum number of generations, max_gen, is set to 100. These parameters will be explained shortly.

The demo program iterates 100 generations. On each generation, each of the 10 possible solution points produces a new candidate solution. If the new candidate solution is better than the current solution that generated the candidate, the new candidate solution replaces the old solution. After 100 generations, the best solution found is at (-0.000001, 0.000000, 0.000001), which is very close to the true solution at (0, 0, 0).

This article assumes you have an intermediate or better familiarity with a C-family programming language. The demo program is implemented using Python, but you should have no trouble refactoring to another language such as C# or JavaScript if you wish.

The complete source code for the demo program is presented in this article and is also available in the accompanying file download. All normal error checking has been removed to keep the main ideas as clear as possible.

To run the demo program, you must have Python installed on your machine. The demo program was developed on Windows 10 using the Anaconda 2020.02 64-bit distribution (which contains Python 3.7.6). The demo program has no significant dependencies so any relatively recent version of Python 3 will work fine.

**The Rastrigin Function**

The Rastrigin function is a standard benchmark problem for testing optimization algorithms. The Rastrigin function can be defined for dimension = n = 2 or higher. The equation is f(x) = 10n + Sum [ xi^2 - (10 * cos(2*pi*xi^2)) ]. The graph in **Figure 2** shows the Rastrigin function for dim = n = 2 where the minimum value is 0.0 at (0, 0). The Rastrigin function is challenging because it resembles an egg carton with many depressions that can trap optimization algorithms into a false local minimum value.

**Understanding Differential Evolution**

An evolutionary algorithm is any algorithm that loosely mimics biological evolutionary mechanisms such as mating, chromosome crossover, mutation and natural selection.

A generic form of a standard evolutionary algorithm is:

create a population of possible solutions loop pick two good solutions combine them using crossover to create child mutate child slightly if child is good then replace a bad solution with child end-if end-loop return best solution found

Standard evolutionary algorithms can be implemented using dozens of specific techniques. Differential evolution is a special type of evolutionary algorithm that has a relatively well-defined structure:

create a population of possible solutions loop for-each possible solution pick three other random solutions combine the three to create a mutation combine curr solution with mutation = candidate if candidate is better than curr solution then replace current solution with candidate end-if end-for end-loop return best solution found

The "differential" term in "differential evolution" is somewhat misleading. Differential evolution does not use Calculus derivatives. The "differential" refers to a specific part of the algorithm where three possible solutions are combined to create a mutation, based on the difference between two of the possible solutions.

The mechanisms of differential evolution are best explained by example. In **Figure 3** the goal is to minimize the simple sphere function (rather than the complex Rastrigin function used by the demo program) in dim = 5 which is f(X) = x0^2 + x1^2 + x2^2 + x3^2 + x4^2.

The algorithm creates a population of eight possible solutions. Each initial possible solution is randomly generated. Larger population sizes increase the chance of getting a good result at the expense of computation time. Each possible solution has an associated error. The first possible solution, x[0] = (-3.0, 4.0, 2.0, -5.0, 3.0), has an absolute error of 63.00 because -3.0^2 + 4.0^2 + 2.0^2 + -5.0^2 + 3.0^2 = 9 + 16 + 4 + 25 + 9 = 63.00.

Each of the eight possible solutions is processed one at a time and produces a new candidate solution. First, three of the other possible solutions are randomly selected and labeled a, b, c. In the example, suppose that population items [2], [4] and [6] were randomly selected:

a = (2.0, -1.0, 3.0, 1.0, -2.0) b = (3.0, 3.0, -4.0, 1.0, -2.0) c = (2.0, 0.0, 5.0, 3.0, -1.0)

Next, items a, b and c are combined into a mutation y using the equation y = a + F * (b - c). The F is the "differential weight" and is a value between 0 and 2 that must be specified. In the example, F = 0.80.

The calculations for the mutation y are:

(b-c) = (3.0, 3.0, -4.0, 1.0, -2.0) - (2.0, 0.0, 5.0, 3.0, -1.0) = (1.0, 3.0, -9.0, -2.0, -1.0) F * (b-c) = 0.8 * (1.0, 3.0, -9.0, -2.0, -1.0) = (0.8, 2.4, -7.2, -1.6, -0.8) a + F * (b - c) = (2.0, -1.0, 3.0, 1.0, -2.0) + (0.8, 2.4, -7.2, -1.6, -0.8) = (2.8, 1.4, -4.2, -0.6, -2.8)

Next, the mutation y is combined with the current possible solution x[0] using crossover to give a candidate solution. Each cell of the candidate solution takes the corresponding cell value of the mutation with probability CR (usually) or the value of the current possible solution with probability 1-CR (rarely). In the example, CR is set to 0.9 and by chance, the candidate solution took values from the mutation at cells [1], [3], [4] and took values from the current possible solution at cells [0], [2]:

current: (-3.0, 4.0, 2.0, -5.0, 3.0) mutation: ( 2.8, 1.4, -4.2, -0.6, -2.8) candidate: (-3.0, 1.4, 2.0, -0.6, -2.8)

The last step is to evaluate the newly generated candidate solution. In this example the candidate solution has absolute error = -3.0^2 + 1.4^2 + 2.0^2 + -0.6^2 + -2.8^2 = 9.00 + 1.96 + 4.00 + 0.36 + 7.84 = 23.16. Because the error associated with the candidate solution is less than the error of the current possible solution (63.00), the current possible solution is replaced by the candidate. If the candidate solution had greater error, no replacement would take place.

To summarize, each possible solution in a population creates a new candidate solution. The candidate is generated by combining three randomly selected possible solutions (a mutation) and then combining the mutation with the current possible solution (crossover). The candidate replaces the current solution if the candidate is better (smaller error).

Note that the terminology of differential evolution optimization varies quite a bit from one research paper to another. For example, the possible solutions are sometimes called agents, and the term mutation is used in different ways. And there are many variations of basic differential evolution. For example, in some versions of differential evolution, one cell from the mutation is guaranteed to be used in the candidate.

**The Demo Program**

The complete demo program, with a few minor edits to save space, is presented in **Listing 1**.

**Listing 1:**Differential Evolution Demo Program

# diff_evo_demo.py # use differential evolution to solve # Rastrigin function for dim = 3 # Python 3.7.6 import numpy as np def rastrigin_error(x, dim): # f(X) = Sum[xj^2 – 10*cos(2*pi*xj)] + 10n z = 0.0 for j in range(dim): z += x[j]**2 - (10.0 * np.cos(2*np.pi*x[j])) z += dim * 10.0 # return squared difference from true min at 0.0 err = (z - 0.0)**2 return err def main(): print("\nBegin Differential Evolution demo ") print("Goal is to minimize Rastrigin dim = 3 ") print("Function has known min = 0.0 at (0, 0, 0) ") np.random.seed(1) np.set_printoptions(precision=6, suppress=True, sign=" ") np.set_printoptions(formatter={'float': '{: 0.6f}'.format}) dim = 3 pop_size = 10 F = 0.5 # mutation cr = 0.7 # crossover max_gen = 100 print("\nSetting population size = %d " % pop_size) print("Setting differential weight F = %0.1f " % F) print("Setting crossover rate r = %0.2f " % cr) print("Setting max generations max_gen = %d " % max_gen) # create array-of-arrays population of random solutions print("\nCreating random solutions and their error ") population = \ np.random.uniform(low=-5.0, high=5.0, size=(10,dim)) popln_errors = np.zeros(pop_size) for i in range(pop_size): popln_errors[i] = rastrigin_error(population[i], dim) # main processing loop for g in range(max_gen): for i in range(pop_size): # each possible soln in pop # pick 3 other possible solns indices = np.arange(pop_size) # [0, 1, 2, . . ] np.random.shuffle(indices) for j in range(3): if indices[j] == i: indices[j] = indices[pop_size-1] # use the 3 others to create a mutation a = indices[0]; b = indices[1]; c = indices[2] mutation = population[a] + F * \ (population[b] - population[c]) for k in range(dim): if mutation[k] < -5.0: mutation[k] = -5.0 if mutation[k] > 5.0: mutation[k] = 5.0 # use mutation and curr item to create candidate new_soln = np.zeros(dim) for k in range(dim): p = np.random.random() # between 0.0 and 1.0 if p < cr: # usually new_soln[k] = mutation[k] else: new_soln[k] = population[i][k] # use current # replace curr soln if new soln is better new_soln_err = rastrigin_error(new_soln, dim) if new_soln_err < popln_errors[i]: population[i] = new_soln popln_errors[i] = new_soln_err # find curr best soln best_idx = np.argmin(popln_errors) best_error = popln_errors[best_idx] if g % 10 == 0: print("Generation = %4d | best error = \ %10.4f | best_soln = " % (g, best_error), end="") print(population[best_idx]) # show final result best_idx = np.argmin(popln_errors) best_error = popln_errors[best_idx] print("\nFinal best error = %0.4f best_soln = " \ % best_error, end="") print(population[best_idx]) print("\nEnd demo ") if __name__ == "__main__": main()

The demo begins by setting the global NumPy random seed (so results are reproducible) and the key parameters:

np.random.seed(1) dim = 3 pop_size = 10 F = 0.5 # mutation cr = 0.7 # crossover max_gen = 100

Differential evolution optimization is quite sensitive to its parameter values which means you usually must do quite a bit of experimentation to get good results.

The demo creates an initial population of 10 possible solutions and computes the error of each:

population = \ np.random.uniform(low=-5.0, high=5.0, size=(10,dim)) popln_errors = np.zeros(pop_size) for i in range(pop_size): popln_errors[i] = rastrigin_error(population[i], dim)

In most problems scenarios, you must set limits on the possible values for each element of a solution vector. In this example the range of possible values is set to [-5.0, +5.0].

The main processing loop iterates max_gen times and begins by selecting three random solutions:

for g in range(max_gen): for i in range(pop_size): # each possible soln # pick 3 other possible solns indices = np.arange(pop_size) # [0, 1, 2, . . ] np.random.shuffle(indices) # like [6, 0, 5, . .] for j in range(3): if indices[j] == i: indices[j] = indices[pop_size-1]

After picking three random indices, the code checks to see if any of the three are the same as the current population item index i. If so, the duplicate index is arbitrarily replaced by the last of the scrambled index values. Next, items a, b and c are used to create a mutation y using the equation y = a + F * (b - c):

a = indices[0]; b = indices[1]; c = indices[2] mutation = population[a] + F * \ (population[b] - population[c]) for k in range(dim): if mutation[k] < -5.0: mutation[k] = -5.0 if mutation[k] > 5.0: mutation[k] = 5.0

If any element of the mutation is outside the range [-5.0, +5.0], it's brought back into range. Next, the current population item and the mutation are combined using crossover to create a new candidate solution:

new_soln = np.zeros(dim) for k in range(dim): # each element p = np.random.random() # between 0.0 and 1.0 if p < cr: # usually new_soln[k] = mutation[k] else: new_soln[k] = population[i][k] # use current

There are several alternative strategies for the differential evolution crossover mechanism. For example, the original 1997 research paper ("Differential Evolution - A Simple and Efficient Heuristic for Global Optimization over Continuous Spaces" by R. Storn and K. Price) used a version of contiguous crossover from genetic algorithms. The crossover approach used by the demo is the simplest.

After the new candidate solution has been created, it is compared to the current population item:

# replace curr soln if new soln is better new_soln_err = rastrigin_error(new_soln, dim) if new_soln_err < popln_errors[i]: population[i] = new_soln popln_errors[i] = new_soln_err

Unlike some optimization algorithms, differential evolution doesn't need to explicitly track the best solution found because the best solution will always be in the population. If you modify the basic differential evolution algorithm by periodically replacing a population item with a new randomly generated solution, you need to make sure you don't overwrite the current best solution in the population.

**Wrapping Up**

Differential evolution optimization was originally designed for use in electrical engineering problems. But DEO has received increased interest as a possible technique for training deep neural networks. The biggest disadvantage of DEO is performance. DEO typically takes much longer to train a deep neural network than standard stochastic gradient descent (SGD) optimization techniques. However, DEO is not subject to the SGD vanishing gradient problem. At some point in the future, it's quite likely that advances in computing power (possibly through quantum computing) will make differential evolution optimization and similar bio-inspired techniques viable alternatives to SGD training techniques.