The Data Science Lab

R Language Searching and Sorting

A language that's data-intensive naturally should have a way to dig into the data effectively. Here's a look at some of the R functions for searching and sorting through it all.

Searching and sorting vectors are two of the most fundamental operations in R (vectors correspond to arrays in other programming languages). In this article I'll explain how to search and sort vectors using built-in R functions and using program-defined functions.

In a nutshell, R has a built-in sort function that can sort the values in a vector using a shell sort, a radix sort and a quicksort. You can also write program-defined sorting functions using R language primitives. R has a large number of built-in sequential search functions. The most commonly used are the match, is.element and which.max functions, pluls the %in% operator, but none of these allow you to control the epsilon for floating point value equality. You can write a program-defined sequential search function with an epsilon tolerance when needed. R doesn't have a built-in binary search function, but writing such a function isn't too difficult.

To give you an idea of where this article is headed, here are five commands from a hypothetical interactive R session:

> v <- c(3, 4, 0, 2, 1)
> target <- 2 
> found <- 2 %in% v  # TRUE
> sv <- sort(v)
> idx <- which.max(v)  # idx = 2

The first statement creates an integer vector with five values. The second statement sets up a target value for which to search. The third statement uses the built-in %in% operator to search for the target. The return value is TRUE because the target value is in the vector. The fourth statement returns a sorted copy of the vector using the built-in sort function with the default radix algorithm. The last statement uses the built-in which.max function to return the 1-based index of the largest value in vector v (index first occurrence in situations where there are multiple instances of the largest value).

Take a look at the screenshot of a demo R session in Figure 1. The outer container window is the Rgui.exe shell. Inside the shell, the window on the left is the R Console where you can issue interactive commands. I use the setwd function to set the working directory to the location of my demo R file, and then I use the special R incantation rm(list=ls()) to remove all existing objects in the current workspace. Then I use the source function to execute the searchSort.R demo script/program.

[Click on image for larger view.] Figure 1. R Console with Search/Sort Code

Built-in Sequential Search Functions
If you want to copy, paste and run the demo program, and you don't have R installed on your machine, installation (and just as important, uninstallation) is quick and easy. Do an Internet search for "install R" and you'll find a URL to a page that has a link to "Download R for Windows." If you click on the link, you'll launch a self-extracting executable installer program. You can accept all the installation option defaults, and R will install in about 30 seconds.

To create the demo program, I navigated to directory C:\Program Files\R\R-3.x.y\bin\x64 and double-clicked on the Rgui.exe file. After the shell launched, from the menu bar I selected the File | New script option which launched an untitled R Editor. I added two comments to the script, indicating the file name and R version I'm using:

# searchSort.R
# R 3.3.0

and then I did a File | Save as, to save the script. I put my demo script at C:\VectorsLists, but you can use any convenient directory.

The R language was designed in part to free data scientists from having to write program-defined functions, so there are a large number of built-in search functions in R. The most basic search technique is to use the %in% operator. The demo program illustrates %in% like so:

v <- c(4, 0, 3, 6, 5, 1, 2)
cat("v: \n")
print(v)

target <- 0
cat("\ntarget = ", target, "\n")
result <- target %in% v
cat("Using built-in %in%, result = ", result, "\n\n")

The sequential search function in most general-purpose programming languages usually has a name similar to Index and returns the 0-based index of the first occurrence of the target, or -1 if the target value isn't found. The R %in% operator is a bit less sophisticated and just returns TRUE or FALSE. In this example the output is:

Using built-in %in%, result =  TRUE

In addition to searching a vector for a single target value, the %in% operator can also search for multiple targets, as I'll explain shortly, but in my opinion the %in% operator is most suitable when you have just a single target value to search for.

The built-in is.element and match functions can accept multiple target values. The demo program illustrates the is.element function with these statements:

# v <- c(4, 0, 3, 6, 5, 1, 2)
target <- c(4, 3)
cat("target = ", target, "\n")
result <- is.element(v, target)
cat("Using built-in is.element(), result = \n")
cat(result, "\n\n")

The resulting output is:

TRUE  FALSE  TRUE  FALSE  FALSE  FALSE  FALSE

The output of the is.element function is a logical vector with the same number of cells as the source search-vector. The TRUE/FALSE value indicates if the vector value is one of the target values.

The built-in match function is very similar to the is.element function except the return result is an integer vector rather than a logical vector. For example:

# v <- c(4, 0, 3, 6, 5, 1, 2)
target <- c(2, 3, 4)
cat("target = ", target, "\n")
result <- match(v, target)
cat("Using built-in match(), result = \n")
cat(result, "\n\n")

The resulting output is:

3  NA  2  NA  NA  NA  1

The 3 value at output[1] means the value at v[1], which is 4, is at position [3] of the target vector. The NA values tell you which of the source-vector values aren't being searched for. I always find the return results of the is.element and match functions a bit unintuitive to interpret.

Program-Defined Sequential Search
The built-in sequential search functions and the %in% operator work just fine in most situations. However, if you're searching a floating point vector, none of the built-in approaches allow you to control the comparison of floating point values for equality.

Many floating point numbers are only approximations -- to a fixed number of decimal points -- to their true values. For example:

> x <- 0.15 + 0.15  # so x = 0.30 > y <- 0.20 + 0.10  # so y = 0.30 > x == y            # does x == y ?
> > [1] FALSE         # what??

The demo script has a program-defined sequential search function that allows you to control how close two floating point values must be in order for them to be considered "equal enough":

seq_search = function(v, t, eps) {
  n <- length(v)
  for (i in 1:n) {
    if (abs(v[i] - t) <= eps) {
      return(i)
    }
  }
  return(0)
}

The seq_search function iterates through floating point vector v, searching for floating point target value t. If the current vector value is within plus or minus eps ("epsilon," the standard Greek letter to represent a small value), the function returns the index location of the target value. Because R vectors are 1-based, a return value of 0 indicates the target was not found.

The demo program calls the seq_search function like so:

v <- c(4.5, 0.5, 3.5, 1.5, 5.5, 6.5, 2.5)
cat("v: \n")
print(v)

target <- 0.5
cat("\ntarget = ", target, "\n")
idx <- seq_search(v, target, 1.0e-5)
cat("Using seq_search(), idx = ", idx, "\n\n")

The resulting output is:

v: 
[1] 4.5 0.5 3.5 1.5 5.5 6.5 2.5

target =  0.5 
Using seq_search(), idx =  2

Different scenarios call for different values of epsilon. The demo value of 1.0e-5 = 0.00001 is common in many machine learning scenarios.

The Built-in Sort Function
The built-in sort function can sort a vector using one of three algorithms: shell sort, radix sort or quicksort. The demo program illustrates sorting with these statements:

# v <- c(4, 0, 3, 6, 5, 1, 2)
cat("Sorting v using built-in sort() \n")
sv <- sort(v, method="shell")
cat("sv: \n")
print(sv)

The optional "method" parameter can take one of four string values: "auto," radix," "shell" or "quick." The default value is "auto," which will invoke the radix algorithm for vectors that contain integer values, logical values or factor values (a factor is a variable that can take on only one of a limited set of categorical values), and the shell algorithm for all other cell types.

The sort function, like almost all regular R functions, passes parameters by value rather than by reference. In other words, a sorted copy of the original vector is returned, and the original vector isn't changed. If you really want to sort a vector in-place rather than get a new vector, you can use the calling pattern:

v <- sort(v)

The sort function can also accept an optional parameter named "decreasing," which, if set to TRUE, will sort in descending order rather than the default ascending order. There are several other less-commonly used optional parameters, too.

Program-Defined Sort Functions
The built-in sort function can handle the vast majority of common programming scenarios, but you can write your own program-defined sort function. These situations usually arise when you need custom sorting behavior such as guaranteeing that equal values in the source vector maintain their position relative to each other (such sorting algorithms are sometimes called "stable").

The demo program defines a hideously inefficient implementation of the bubble sort algorithm:

worst_sort = function(v) {
  n <- length(v)
  for (i in 1:(n-1)) {
    for (j in (i+1):n) {
      if (v[i] > v[j]) {
        tmp = v[i]; v[i] = v[j]; v[j] = tmp
      }
    }
  }
  return(v)
}

Let me emphasize that program-defined worst_sort is for demo purposes only. The only practical use of worst_sort I can think of would be to act as a timing baseline for other sorting functions.

Notice that the worst_sort function operates directly on its input parameter v. Because R functions pass by value, a copy of v is made behind the scenes so the vector argument that corresponds to v when the function is called doesn't change.

The demo program calls worst_sort like so:

# v <- c(4.5, 0.5, 3.5, 1.5, 5.5, 6.5, 2.5)
cat("Sorting v using worst_sort() \n")
sv <- worst_sort(v)
cat("sv: \n")
print(sv)

Inside the body of worst_sort, two values in the vector are swapped using a tmp variable. Interestingly, the R language supports nested function definitions so you could define an internal swap function if you wanted. However, in my opinion the use of nested function definitions creates unnecessary complexity.

Vector Binary Search
As installed, the R language doesn't have a binary search function for use when searching a sorted vector. The gtools add-on package has many utility functions, including a binsearch function. In situations where you don't want to introduce an external package dependency, you can write your own program-defined function. The demo defines a bin_search() function as:

bin_search = function(v, t, eps) {
  lo <- 1; hi <- length(v)
  while (lo <= hi) {
    mid <- as.integer(round((lo + hi) / 2)) # always even!
    if (abs(v[mid] - t) <= eps) {
      return(mid)
    } else if (v[mid] < t) { # C style would be OK
      lo <- mid + 1
    } else {
      hi <- mid - 1
    }
  }
  return(0)
}

Although there aren't many lines of code in a binary search function, implementations are a bit subtle. When calculating the middle index, I use the as.integer and round functions. You must be cautious when using the R round function because, unlike the rounding function in every other language I'm aware of, the R round function uses IEEE 754-2008 "round ties to even," For example, both round(1.5) and round(2.5) return 2.0 rather than rounding up or down.

The bin_search function uses an if, else-if, else control structure. Inside a function you can use C language-style formatting where curly braces and "else" keywords are on different lines. But outside of block statements, you must place curly braces and "else" keywords on the same line.

The demo calls the program-defined function like this:

# v <- c(4.5, 0.5, 3.5, 1.5, 5.5, 6.5, 2.5)
# sv <- worst_sort(v)
# sv : (0.5, 1.5, 2.5, 3.5, 4.5, 5.5, 6.5)
target <- 2.5
cat("\ntarget = ", target, "\n")
idx <- bin_search(sv, target, 1.0e-5)
cat("Using bin_search(), idx = ", idx, "\n")

The resulting output is:

target =  2.5 
Using bin_search(), idx =  3

Binary search algorithms are surprisingly tricky and there are several subtle variations you can use. The Wikipedia entry on binary search has a good explanation of these variations.

Wrapping Up
To summarize, to search an unsorted vector for a single target value, a good choice is to use the %in% operator. To search for multiple target values, you can use the is.element function, which returns results as a logical vector, or use the similar match function, which returns an integer vector. The which.max function can be used to find the index of the largest value in a vector.

You can write a program-defined sequential search function if you want to get functionality similar to the Index function found in most other programming languages, or if you need custom behavior or if you want to control the epsilon tolerance for floating point value equality.

The built-in sort function supports shell sort, radix sort and quicksort. The sort function is all you need in the vast majority of programming scenarios, but you can write a program-defined sort function if you need specialized behavior.

As installed, the R language doesn't have a binary search function for use in situations where you're searching a sorted vector. You can install the gtools add-on package and use its binsearch function, or you can write a program-defined binary search function.

About the Author

Dr. James McCaffrey works for Microsoft Research in Redmond, Wash. He has worked on several Microsoft products including Azure and Bing. James can be reached at [email protected].

comments powered by Disqus

Featured

Subscribe on YouTube