In-Depth
Generating Distinct, Random Array Indices
James McCaffrey demonstrates the brute-force, Fisher-shuffle and reservoir-sampling techniques.
A surprisingly common, but tricky, mini-task in the area of machine-learning software is generating distinct, random array indices. For example, if you have an array with 10 cells, you may want to select the indices of four of those cells at random; for example, indices 2, 9, 0 and 5. In this article I'll demonstrate three ways to generate distinct, random array indices. The first technique uses brute force to repeatedly generate indices. The second technique uses a shuffle approach, while the third one uses a very clever but not well-known idea called reservoir sampling.
To see where I'm headed, take a look at the demo program in Figure 1. It uses each of the three different techniques to generate six sets of random indices between 0 and 9. Notice that the indices generated by the brute-force method are sorted, but the indices generated by the Fisher-shuffle and reservoir-sampling methods aren't sorted.
I'll walk you through the demo program so you'll understand the pros and cons of each technique and know when each technique is appropriate. I coded the demo program using C#, but you should have no trouble refactoring the demo code to another language such as Visual Basic or JavaScript. This article assumes you have intermediate-level programming skills. I've removed all error checking from the demo program to keep the main ideas clear.
The Demo Program Structure
To create the demo program, I launched Visual Studio and created a new C# console application named DistinctIndices. The demo has no significant .NET dependencies, so any version of Visual Studio will work. After the template code loaded in the Visual Studio editor, I deleted all unnecessary using statements at the top of file Program.cs, leaving just the reference to the System namespace.
The overall structure of the demo program, with some minor edits to the WriteLine statements, is presented in Listing 1.
Listing 1. Distinct indices program structure.
using System;
namespace DistinctIndices
{
class Program
{
static Random rnd = new Random(0);
static void Main(string[] args)
{
Console.WriteLine("Begin demo generating random, distinct, " +
"array indices\n");
Console.WriteLine("Generating six sets of 4 random ints " +
"between 0 and 9 using brute-force\n");
for (int i = 0; i < 6; ++i)
{
int[] indices = Brute(4, 10);
Console.Write("set " + i + ": ");
ShowVector(indices);
}
Console.WriteLine("Generating six sets of 4 random ints " +
"between 0 and 9 using Fisher-shuffle\n");
for (int i = 0; i < 6; ++i)
{
int[] indices = Fisher(4, 10);
Console.Write("set " + i + ": ");
ShowVector(indices);
}
Console.WriteLine("Generating six sets of 4 random ints " +
"between 0 and 9 using reservoir-sampling\n");
for (int i = 0; i < 6; ++i)
{
int[] indices = Reservoir(4, 10);
Console.Write("set " + i + ": ");
ShowVector(indices);
}
Console.WriteLine("End demo\n");
Console.ReadLine();
}
static int[] Brute(int n, int range) { . . }
private static bool HasDups(int[] array) { . . }
static int[] Fisher(int n, int range) { . . }
static int[] Reservoir(int n, int range) { . . }
private static void ShowVector(int[] vector) { . . }
} // Program
} // ns
Method Brute implements the brute-force technique. Method HasDups is a helper method used by method Brute. Method Fisher implements the Fisher-shuffle technique. Method Reservoir implements the reservoir-sampling technique. Method ShowVector is a convenience helper to display results in the Main method.
Each of the three methods that generates distinct, random indices accepts two input parameters. The first parameter, n, is the number of indices to generate. The second parameter, range, is the number of candidate indices. Because the goal is to generate random indices, the demo program uses a class-scope Random object named rnd. The rnd object is instantiated with an arbitrary seed value of 3 so that results can be replicated.
The Brute Method
One way to generate distinct, random array indices is to use a brute-force approach: generate random candidate result indices, then check to see if any duplicate indices exist. If there are no duplicates, the method returns the current set of indices; otherwise, it tries again. Method Brute is defined in Listing 2.
Listing 2. The brute-force method.
static int[] Brute(int n, int range) // n distinct values between [0,range-1]
{
int[] result = new int[n];
int sanity = 0;
while (sanity < 100000) {
for (int i = 0; i < result.Length; ++i)
result[i] = rnd.Next(0, range);
Array.Sort(result);
if (HasDups(result) == false)
return result;
++sanity;
}
return null;
}
Local variable sanity is used to prevent an infinite loop. Candidate indices are created by the statement result[i] = rnd.Next(0, range). The statement result[i] = rnd.Next(range) would have the same effect, but I like to use the explicit start-value overload version of method Next. After candidates are generated into the result array, those candidates are sorted to make detecting duplicates easier. If no set of candidates without duplicates is ever generated, method Brute returns null. You might want to throw an Exception instead.
Helper method HasDups is defined like so:
private static bool HasDups(int[] array)
{
for (int i = 0; i < array.Length - 1; ++i)
if (array[i] == array[i + 1])
return true;
return false;
}
Method HasDups assumes its array argument is sorted, and uses the idea that if there are any duplicate values, they must be in adjacent cells.
Using the brute-force method to generate distinct, random array indices is sometimes a reasonable choice when the number of indices to generate (n) is very small compared to number of candidates values (range), because the chances of generating duplicate values is small. But if the value of n is close to the value of range, using the brute-force method is a poor option. Also, you run the risk of a failed result.
The Fisher Method
The idea of the Fisher-shuffle technique to generate distinct, random indices is to first create a big array of all possible indices, shuffle those indices into a random ordering and then select the first n indices. For example, suppose n = 4 and range = 10, as in the demo. An array holding values [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] is generated. Then that array is randomly shuffled giving, for example, [3, 8, 0, 9, 5, 7, 1, 2 6, 4]. Then the first n values are pulled into a result, giving [3, 8, 0, 9].
Method Fisher is defined in Listing 3.
Listing 3. The Fisher method.
static int[] Fisher(int n, int range)
{
int[] result = new int[n];
int[] all = new int[range];
for (int i = 0; i < all.Length; ++i)
all[i] = i;
for (int i = 0; i < all.Length; ++i)
{
int r = rnd.Next(i, range);
int tmp = all[r];
all[r] = all[i];
all[i] = tmp;
}
for (int i = 0; i < n; ++i)
result[i] = all[i];
return result;
}
The middle part of the code in method Fisher shuffles the array named "all" using an algorithm called the Fisher-Yates shuffle (also called the Knuth shuffle). Notice the call to Random.Next is rnd.Next(i, range), rather than the common mistake of rnd.Next(0, range), which would generate a random-appearing, but definitely not random, result.
The Fisher method is a reasonable choice to generate distinct, random array indices when the range value is not too large (say, under 1,000). But when the value of range is very large, and you're working on a device with limited memory, the memory storage required for the all array may not be available or may lead to poor performance.
The Reservoir Method
Generating distinct, random array indices using the reservoir-sampling technique is quick and efficient, but trickier than using the brute-force or Fisher-shuffle techniques. The idea is to seed a result array with candidate indices 0..(n-1). Then each remaining candidate index from n through range-1 is examined as a possible replacement for one of the seed indices. Method Reservoir is defined as:
static int[] Reservoir(int n, int range)
{
int[] result = new int[n];
for (int i = 0; i < n; ++i)
result[i] = i;
for (int t = n; t < range; ++t)
{
int m = rnd.Next(0, t + 1);
if (m < n) result[m] = t;
}
return result;
}
The technique is best explained by a concrete example. Suppose n = 4 and range = 10, as in the demo. Array result is seeded with [0, 1, 2, 3]. The for-loop variable t iterates starting at t = 4. A random integer m, between 0 and 4, is generated. Suppose m is 2. Because m = 2 < n = 4, result[m] is replaced by t (result[2] = 4), so the result is now [0, 1, 4, 3]. The next iteration has t = 5. Random m is between 0 and 5. Suppose m = 5; because m = 5 is not less than n = 4, the result array is not changed.
This process continues until t = 9. Random m is between 0 and 9. Notice that at each succeeding value of t, the chance of a replacement occurring becomes smaller and smaller. Although it's not entirely obvious, it can be shown mathematically that this algorithm does in fact generate distinct, random values.
The reservoir-sampling technique is my method of choice in most situations when I need to generate distinct, random array indices. The technique is fast, it doesn't run the risk of a failed result and doesn't use extra memory.
About the Author
Dr. James McCaffrey works for Microsoft Research in Redmond, Wash. He has worked on several Microsoft products including Azure and Bing. James can be reached at [email protected].