The Data Science Lab

Program-Defined Functions in R

The three most common open source technologies for writing data science programs are Python, SciLab, and R. Here's how to write program-defined functions in R.

There's no clear definition of the term data science. I think of data science as the process of programmatically analyzing data using classical statistics techniques or making predictions using machine learning techniques. Among my developer colleagues, the three most common ways to perform data science tasks with open source tools are using the R language, using the Python language, and using the SciLab (or roughly equivalent Octave) integrated system. In this article I present a short tutorial on writing program-defined functions in the R language.

Whenever I'm learning about program-defined functions in a new language, I want to know seven things: What's the basic syntax and return mechanism? Are parameters passed by value, by reference or both? Does the language support default parameter values? Does the language support function overloading? Does the language support variable number of arguments? Does the language support recursion? Does the language support nested definitions?

The demo program shown running in Figure 1 illustrates each of these seven topics and gives you an idea of where this article is headed. As you'll see:

  1. Basic R function syntax resembles C# and uses the "function" and "return" keywords.
  2. R function parameters are passed by value, not by reference.
  3. R supports default parameter values using the "=" assignment operator.
  4. R does not support C# style function name overloading.
  5. R supports variable number of function arguments using the "..." token.
  6. R supports recursive function definitions.
  7. R supports nested function definitions with the "<<-" assignment operator.
[Click on image for larger view.] Figure 1. Program-Defined Functions in R Demo

Installing R
If you're new to R and want to try out the language, the good news is that installing (and uninstalling) R is simple. You have several alternatives, including the recently released Microsoft R Server, but for simplicity I recommend using the base R system. Search the Web for "install R" and you'll find a link to https://cran.r-project.org/bin/windows/base/. Navigate to that page and click on the Download link at the top of the page to launch a self-extracting executable installer (see Figure 2).

[Click on image for larger view.] Figure 2. Installing R

You can accept all the installation defaults. After installation finishes, go to C:\Program Files\R-3.x.x\bin\x64 and then double-click on the Rgui.exe file to launch an R Console shell like the one shown on the left side of Figure 1.

On the top menu bar, click File | New Script. That action will launch an R Editor window like the one on the right side of Figure 1. You write your R program (technically a script because R is interpreted) in the Editor window. You call the program by issuing a "source" command in the Console window. Program output is displayed in the Console window.

The Demo Program
The entire R demo program is presented in Listing 1. To run the program, copy and paste the code into the Editor window. With focus set to the Editor window, on the Console window menu, click File | Save As, then navigate to any convenient directory (I used the rather wordy C:\ProgramDefinedFunctionsWithR) and save the script there as functions.R.

Listing 1: Program-Defined Functions in R Demo Code
# functions.R
# program-defined functions examples

# basic syntax
my.sum = function(x, y) {
  result <- x + y
  return(result)
}

# return value is last expression
my.sumterse = function(x, y) {
  x + y
}

# 'void' function
my.printvec = function(vec, dec) {
  cat("[ ")
  for (i in 1:length(vec)) {
    x <- formatC(vec[i], digits = dec, format="f")
    cat(x, " ")
  }
  cat("]\n")
}
  
# C#-style overloading not allowed
# my.sum = function(x, y, z) { . . }
# error if a my.sum() function exists

# arguments are val not ref --
# my.inc = function(arr) {
#   for (i in 1:length(arr)) {
#     arr[i] <- arr[i] + 1
#   }
# }
# does not work

# default parameter value
my.prod = function(x, y, z=10) {
  result <- x * y * z
  return(result)
}

# missing parameter value
my.prod2 = function(x, y, z) {
  if (missing(z))
    return(x * y * 10)
  else
    return(x * y * z)
}

# return two values as an array
my.sumdiff = function(x, y) {
  res1 <- x + y
  res2 <- x - y
  # result = c(res1, res2) # vector
  result <- array(0.0, 2)
  result[1] <- res1; result[2] <- res2
  return(result)
}

# return two values as a list
my.divide = function(x, y) {
  if (y == 0) {
    res = list("result" = NULL, "msg" = "error")
  }
  else {
    res = list("result" = x/y, "msg" = "success")
  }
  return(res)
}

# variable number parameters
my.multiprod = function(...) {
  vals <- list(...)
  result <- 1
  for (key in names(vals)) {
    result <- result * vals[[key]]
  }
  return(result)
}

# recursion
my.qsort = function(arr) {
  n <- length(arr)
  if (n > 1) {
    pv <- arr[n %/% 2]
    left <- my.qsort(arr[arr < pv])
    mid <- arr[arr == pv]
    right <-  my.qsort(arr[arr > pv])
    return(c(left, mid, right))
  }
  else return(arr)
}

# nested definition
my.bsort = function(arr) {
  # -----
  my.swap = function(ii, jj) {
    tmp <<- arr[ii]
    arr[ii] <<- arr[jj]
    arr[jj] <<- tmp
  }
  # -----
  n <- length(arr)
  repeat {
    swapped <- FALSE
    for (i in 1:(n-1)) {
      if (arr[i] > arr[i+1]) {
        my.swap(i, i+1)
        swapped <- TRUE
      }
    }
    if (swapped == FALSE) break
  }
  return(arr)
}

# ========

cat("\nBegin program-defined functions demo \n\n")

x <- 5.1
y <- 3
z <- 2.0

cat("x, y, z = ", x, ",", y, ",", z, "\n\n")

sum <- my.sum(x, y)
cat("Result of my.sum(x,y) = ", sum, "\n\n")

vec <- c(3.14, 2/3, 1.2345)
cat("Vector vec = ", vec, "\n")
cat("Result of my.printvec(vec, 3) : ", "\n")
my.printvec(vec, 3)
# my.printvec(vec, dec=3)
cat("\n")

prod <- my.prod(x, y) # missing z
cat("Result of my.prod(x,y) = ", prod, "\n\n")

sumdiff <- my.sumdiff(x, y)
cat("Result of my.sumdiff(x,y)= ",
  sumdiff, "\n\n")

myd <- my.divide(x, y)
cat("Result of my.divide(x,y) = ", myd[[1]],
  myd[[2]], "\n\n")
# cat("Result of my.divide(x,y) = ", myd$result,
#    myd$msg, "\n\n")
myd <- my.divide(x, 0)
cat("Result of my.divide(x,0) = ", myd[[1]],
  myd[[2]], "\n\n")

mymp <- my.multiprod(a=3, b=5, c=7)
cat("Result of my.multiprod(a=3, b=5, c=7) = ",
  mymp, "\n\n")
 
vec <- c(4.4, 9.9, 2.2, 3.3, 0.0, 5.5, 8.8,
  1.1, 7.7, 6.6)
cat("Vector vec = \n")
cat(vec, "\n")
svec <- my.qsort(vec)
cat("Result of my.qsort(vec) : \n")
cat(svec, "\n\n")

vec <- c(4.4, 9.9, 2.2, 3.3, 0.0, 5.5, 8.8,
  1.1, 7.7, 6.6)
cat("Vector vec = \n")
cat(vec, "\n")
svec <- my.bsort(vec)
cat("Result of my.bsort(vec) : \n")
cat(svec, "\n")

cat("\nEnd R functions demo \n\n") 

After saving the demo script, give focus to the Console window. Enter the command setwd("C:\\ProgramDefinedFunctionsWithR") to point the working directory to the location of your script. Then enter the command source("functions.R") to execute the program.

Basic Function Syntax
The demo program begins by defining a simple R function that returns the sum of two numeric values in order to illustrate basic syntax:

my.sum = function(x, y) {
  result <- x + y
  return(result)
}

I named the function my.sum rather than just sum because R allows you overwrite built-in functions. In other words, if I had named my function sum, it would've killed the built-in sum function that adds up the values in an array or vector. Because R has many hundreds of built-in functions, you should try to make your program-defined function names different from built-in function names. Prepending program-defined function names with "my." is my personal preference, but is not a standard convention.

In R it's common to use the dot character in function and variable names to make them more readable (most languages, including C#, use the underscore character for better readability). In R the "=" and "<-" assignment operators are usually interchangeable; I prefer to use "=" in the function signature and "<-" in the function body.

In this example I use the "return" keyword. Interestingly, by default R functions return the last expression in a function. Therefore, the function could've been written as:

my.sumterse = function(x, y) {
  x + y
}

If I'm defining a function interactively on-the-fly I'll sometimes omit the return keyword, but I think code is more readable and less error-prone with the return keyword. The code that calls the function is:

x <- 5.1
y <- 3
sum <- my.sum(x, y)
cat("Result of my.sum(x,y) = ", sum, "\n\n")

The R calling mechanism is straightforward and closely resembles that of other C-family languages.

It's possible to write R functions that don't return a value. The demo program has a "void" function to print the values in a vector using a specified number of decimals:

my.printvec = function(vec, dec) {
  cat("[ ")
  for (i in 1:length(vec)) {
    x <- formatC(vec[i], digits = dec, format="f")
    cat(x, " ")
  }
  cat("]\n")
}

The demo code that calls function my.printvec is:

vec <- c(3.14, 2/3, 1.2345)
cat("Vector vec = ", vec, "\n")
cat("Result of my.printvec(vec, 3) : ", "\n")
my.printarr(vec, 3) 

R supports named parameter calls, so the function could have been called as:

my.printvec(vec, dec=3)

Although using a named parameter call in this example doesn't improve readability much, many built-in R functions have a large number of parameters and using named parameters can greatly improve code readability.

Function Overloading and Argument Pass by Value
R doesn't support C#-style function name overloading. For example, because the demo program defines a function my.sum(x, y), an attempt to define a function my.sum(x, y, z) will generate a runtime error. Although R doesn't support explicit function overloading, you can get similar behavior by using the default parameters and the variable number of parameters mechanisms described in this article.

In R, parameters are passed by value, not by reference. For example, consider this (incorrect) function definition that attempts to add 1.0 to each value of an R array:

my.inc = function(arr) {
  for (i in 1:length(arr)) {
    arr[i] <- arr[i] + 1
  }
}

Then a call like this:

a <- array(0.0, 3)
a[1] <- 1.1; a[2] <- 5.5; a[3] <- 7.7)
my.inc(a)

would leave array a unchanged. One way to simulate the desired behavior is to make the function return a value like so:

my.inc = function(arr) {
  for (i in 1:length(arr)) {
    arr[i] <- arr[i] + 1
  }
  return(arr)
} 

And then assign the return value by calling the function like so:

a <- my.inc(a)

Another consequence of pass by value is that R does not have C#-style out or ref parameters. You can simulate out and ref parameters by returning multiple values in an array, vector or list, as I'll demonstrate shortly.

Default Parameter Values and Missing Parameters
The demo program illustrates R default function parameter values by defining this function:

my.prod = function(x, y, z=10) {
  result <- x * y * z
  return(result)
}

Function my.prod returns the product of three numeric values. If only the first two arguments are passed to the function, the function will automatically generate a third parameter with a value of 10, for example:

val <- my.prod(3.0, 4.0, 2.0)

returns 3.0 * 4.0 * 2.0 = 24.0. But the call:
val <- my.prod(3.0, 4.0)

returns 3.0 * 4.0 * 10 = 120.0

Many of the built-in R functions have a large number of parameters, and the parameters often have default values. This design approach allows you to make simplified calls to the functions. For example, the built-in formatC function has 14 parameters. Only the first parameter, the value to format, is required, and the remaining 13 parameters have default values. This allows you to write code like:

x <- format(3.14, width=6)
cat("x = ", x, "\n")

As a very general rule of thumb, if you're writing an R function for one-time use in a program, there's little advantage to generalizing the heck out of the function by adding lots of unnecessary parameters with default values. Default parameter values are most useful when you're writing library functions and you're not sure how the functions might be called.

The R language allows you to deal with missing parameter values. For example, the demo program defines a function my.prod2 like so:

my.prod2 = function(x, y, z) {
  if (missing(z))
    return(x * y * 10)
  else
    return(x * y * z)
}

Here, the built-in missing function returns TRUE if there's no argument corresponding to parameter z, FALSE otherwise. Using the built-in missing function gives you more flexibility than using a default parameter value, at the expense of a slight increase in complexity. In the programming scenarios I work with, I don't use the R missing parameter mechanism very often.

Returning Multiple Values
Because R parameters are passed by value, in situations where you want a function to return multiple values, you can't write an R function that has C# style out-parameters. When you want to return multiple values, you can return the values in an array, a vector, or a list. For example, the demo program defines a function my.sumdiff as:

my.sumdiff = function(x, y) {
  res1 <- x + y
  res2 <- x - y
  result <- array(0.0, 2)
  result[1] <- res1; result[2] <- res2
  return(result)
}

The function computes the sum and the difference of two numeric values and returns those values in an array. The function could be called like this:

sd <- my.sumdiff(3.0, 5.0) 
cat(sd[1], "\n")  # 8.0
cat(sd[2], "\n")  # -2.0

Note that using sd as a variable name isn't a very good idea because it clashes with the built-in sd (standard deviation) R function. Instead of placing the return results into an array, you can use the built-in c function to create a vector, for example:

my.sumdiff = function(x, y) {
  res1 <- x + y
  res2 <- x - y
  result <- c(res1, res2)
  return(result)
}

To return multiple results, you can also use an R list. The demo program defines a function my.divide like this:

my.divide = function(x, y) {
  if (y == 0) {
    res = list("result" = NULL, "msg" = "error")
  }
  else {
    res = list("result" = x/y, "msg" = "success")
  }
  return(res)
}

An R list is really more like a C# Dictionary collection, or an associative array, than a List data structure in most languages. An R list has key-value pairs. In this example the keys are "result" and "msg." The demo program calls function my.divide like so:

myd <- my.divide(5.1, 3)
cat("Result of my.divide(x,y) = ", myd[[1]], myd[[2]], "\n\n")

In R, the values in a list can be accessed by index or by key. The previous code accesses by index. Notice the paired square bracket syntax. The values in the return list could've been accessed using the keys, like this:

myd <- my.divide(5.1, 3)
cat("Result of my.divide(x,y) = ", myd$result, myd$msg, "\n\n")

Notice the $ accessor syntax. When retrieving multiple return values that are in a list, the decision to access by index or by key is mostly a matter of personal preference, but accessing by key is more principled.

Variable Number of Parameters
The R language supports function with a variable number of parameters using a special "..." (three consecutive dot characters) token. The mechanism is best explained by example. The demo program defines a function my.multiprod like this:

my.multiprod = function(...) {
  vals <- list(...)
  result <- 1
  for (key in names(vals)) {
    result <- result * vals[[key]]
  }
  return(result)
}

The function multiplies together a set of numeric values. The function begins by transferring the parameters into a list named vals. Then the list is walked through using the built-in names function and each value is multiplied to the running product. Instead of accessing the list by keys, the list could've been accessed by index, like this:

for (i in length(vals)) {
  result <- result * vals[[i]]
}

The demo calls the function like this:

mymp <- my.multiprod(a=3, b=5, c=7)

Notice that because function my.multiprod uses a list, the call to the function must use named parameters.

Recursion and Nested Definitions
Two mechanisms that are supported by R, but that I rarely use, are recursion and nested definitions. The demo program defines function my.qsort, which implements the quicksort algorithm to sort an array. If you refer to the code in Listing 1, you'll see that function my.qsort calls itself. Let me emphasize that recursion is rarely needed or a good idea. I can count on one hand the number of times I've had to use recursion in production code.

If you're a C# developer like me, every now then then you find yourself writing a function (well, OK, a method because it's C#) and end up writing a small helper method that's called by your primary method. And you may have wondered if it's possible to define the short helper method inside the primary method to make your code more modular. It's not possible to define a method inside a method in C#, but you can do so in R.

If you refer to the code in Listing 1, you'll see the definition of function my.bsort, which implements the bubble sort algorithm to order the values in an array. Defined inside function my.bsort is a nested function my.swap that exchanges the values of an array at two specified indices:

my.bsort = function(arr) {
  # -----
  my.swap = function(ii, jj) {
    tmp <<- arr[ii]
    arr[ii] <<- arr[jj]
    arr[jj] <<- tmp
  }
  # -----
  n <- length(arr)
    # bubble sort code here
    return(arr)
}

Notice that nested function my.swap can access parameter arr of the outer my.bsort function. However, to manipulate arr you must use the special "<<-" assignment operator rather than the normal "=" or "<-" assignment operators.

The ability to define nested functions in R can sometimes lead to slightly cleaner code at the expense of a slight increase in complexity. Some of my colleagues don't like using nested function definitions, but others, like me, think the increase in modularity is sometimes worth the effort.

Wrapping Up
The information presented in this article will allow you to understand and deal with the majority of situations where you write R program-defined functions. Although R gives you a lot of flexibility, including the ability to use default parameter values, detect missing parameter values, specify a variable number of parameters, and define nested functions, in most coding situations all you need is basic function syntax, plus occasionally returning multiple values in an array or list.

comments powered by Disqus

Featured

Subscribe on YouTube