The Data Science Lab
R Language Basic Data Structures
Vectors, lists, arrays, matrices and data frames -- a look at five of the most fundamental data structures built into R.
Among my colleagues, R is one of the fastest-growing programming languages. Most of my colleagues who are learning R already have some programming experience in other languages such as C#, Java, Visual Basic and Python. I've observed that experienced programmers who are learning R have more trouble than you might expect when dealing with R basic data structures.
In my opinion, the five most fundamental built-in R data structures are vectors, lists, arrays, matrices and data frames. In this article I'll show you how to work with these data structures.
Two disclaimers: First, there are several other built-in R data structures, such as a table type, that aren't quite as commonly used as the ones I describe in this article. Second, R basic data structures have more nuances and quirks than their counterparts in other programming languages, so I'll concentrate on the main points and leave out details to keep the main ideas as clear as possible.
To give you an idea of where this article is headed, here are five hypothetical commands from an interactive R session:
> v <- vector(mode="integer", 3)
> lst <- list(fn="Joe", ln="Doe", age=29)
> m <- matrix(1.0, nrow=2, ncol=3)
> arr <- array(0.0, c(2,4,5))
> df <- data.frame(v, lst)
The first statement creates a vector with three cells, all initialized to 0. The second statement creates a list that has three key-value items. The third statement creates a 2x3 matrix with all six cells initialized to 1.0. The fourth statement creates a 2x4x5 array with all 40 cells initialized to 0.0. The fifth statement creates a data frame that has two columns, using the values in the vector for the first column and the values in the list for the second column.
Take a look at the demo R session in Figure 1. The outer container window is the Rgui.exe shell. Inside the shell, the window on the left is the R Console where you can issue interactive commands. I use the setwd function to set the working directory to the location of my demo R file, and then I use the special R incantation rm(list=ls()) to remove all existing objects in the current workspace. Then I use the source function to execute the vectorsLists.R demo script/program.
Vectors
The most basic R data structure is the vector. One source of confusion for people who have experience with other programming languages is that an R vector object corresponds most closely to what other languages call an array, and R has an array type that doesn't have a close counterpart in other languages. An R vector has a fixed number of cells and each cell holds a value with the same data type: integer, numeric (double), character, logical, complex, raw.
If you want to copy-paste the demo program, and you don't have R installed on your machine, installation (and just as important, uninstallation) is quick and easy. Do an Internet search for "install R" and you'll find a URL to a page that has a link to "Download R for Windows." If you click on the link, you'll launch a self-extracting executable installer program. You can accept all the installation option defaults, and R will install in about 30 seconds.
To create the demo program, I navigated to directory C:\Program Files\R\R-3.x.y\bin\x64 and double-clicked on the Rgui.exe file. After the shell launched, from the menu bar I selected the File | New script option, which launched an untitled R Editor. I added two comments to the script, indicating the file name and R version I'm using:
# vectorsLists.R
# R 3.3.0
And then I did a File | Save as, to save the script. I put my demo script at C:\VectorsLists, but you can use any convenient directory.
When working with R vectors in an interactive mode in the R Console window, you can display a vector using the built-in print function. When working with R programs, I prefer more control over how a vector is displayed. The demo program defines a custom function to print a vector:
my_print = function(v, dec=2) {
n <- length(v) # built-in length()
for (i in 1:n) { # for-loop
x <- v[i] # 1-based indexing
xf <- formatC(x, # built-in formatC()
digits=dec, format="f") # you can break long lines
cat(xf, " ") # basic display function
}
cat("\n")
}
If you're new to R, function syntax might look a bit odd, but for the most part be understandable. One quirk of R is that there are two assignment operators, = and <-. The two assignment operators are mostly interchangeable and in the situations where they're not, you'll get an error message.
Unlike most other programming languages, R collections use 1-based indexing rather than 0-based indexing. Function my_print determines the length of its input parameter, vector v, using the built-in length function, and then iterates through the vector, displaying one cell at a time using the cat function.
After the definition of a custom vector-print function, there's a custom list-print function definition that I'll present shortly. After the two custom function definitions, program execution begins with:
cat("\nBegin vectors and lists demo \n\n")
cat("Creating five demo vectors \n\n")
v <- c(1:3)
cat(v, "\n")
The tersely named built-in c function accepts a range and returns a vector. The data type of the vector is determined implicitly so in this case the result is an integer vector with three cells holding (1, 2, 3). Unlike the other four basic data structures, vectors can be printed using the built-in cat function, as well as the print function.
The demo program shows two more ways to create a vector:
v <- vector(mode="numeric", 4)
my_print(v, dec=3)
v <- c("a", "b", "x")
cat(v, "\n")
The built-in vector function has many optional parameters, but as called here the function returns a vector object that has four cells, each of which is type double, where each is initialized to 0.0 much like C# but unlike C (which doesn't perform automatic initialization). Calling program-defined function my_print(v, dec=3) displays 0.000 0.000 0.000 0.000 to the Console window.
The c function can accept a comma-delimited sequence of constants. The demo creates a vector with three character values. Notice that object v is being reused for vectors of different types, which is OK in R, but not in most other languages.
The demo program shows two other common ways to create vectors, using built-in functions seq and sample:
v <- seq(from=1, to=3, by=0.5)
my_print(v)
set.seed(0)
v <- sample(1:4)
cat(v, "\n\n")
The seq function returns a vector where the number of cells is inferred from argument values. The example here can be interpreted as, "Make a vector of (double) values starting at 1.0 and ending at 3.0, incrementing by 0.5." So the result is (1.0, 1.5, 2.0, 2.5, 3.0). Notice the call to program-defined function my_print(v) omits the dec parameter, so the default number of decimals, 2, is used.
The built-in sample function in this example returns a vector with the integer values 1 through 4 in random order. Unlike most languages that allow you to instantiate a random number-generating object, R uses one system-wide generator. In order to get reproducible results, the generator is initialized using the set.seed function. Note that in R the "." character is valid in any identifier name, and built-in R functions often use "." rather than "_" to make variable and function names more readable. If you're an experienced programmer, it's surprisingly difficult not to mentally interpret an R identifier with a "." character as representing object.method rather than just a function name.
Lists
An R list object can have items of different data types (including program-defined objects), and the size of a list can change during runtime. The demo program creates and displays two lists. The first is:
cat("Creating two demo lists \n\n")
ls <- list("a", 2.2)
ls[3] <- as.integer(3)
my_print_list(ls)
# print(ls)
cat("Cell [2] is: ", ls[[2]], "\n\n")
The list function accepts a comma-delimited sequence of values. In this example the first item is the character "a" and the second item is the numeric value 2.2. The next statement appends integer value 3 to the list so it now has three items. Notice that you can increase the size of a list on the fly merely by assigning a value to an index greater than the current length of the list.
A quirk of R language lists is that to access a list item by index, you use double square brackets rather than single square brackets. There is a reason for this weirdness, but it'd take far too long to explain here. Just note that incorrectly trying to access a list item by using single square brackets rather than by double square brackets is by far the most common syntax error associated with R lists.
You can print an R list object using the built-in print function. The demo program defines a custom function:
my_print_list = function(lst) {
n <- length(lst)
for (i in 1:n) {
cat(lst[[i]])
if (i < n) { cat(" -> ") }
}
cat("\n")
}
Function my_print_list uses the built-in length function and double square bracket indexing syntax. Note that you can't name this function my_print because that name is already being used for function to display a vector, and R doesn't support C# style function overloading.
The demo program illustrates that you can create a list object that holds name-value pairs and can be accessed using name indexing:
ls <- list(lname="Smith", age=22)
my_print_list(ls)
cat("Cells accessed by cell names are: ", ls$lname, "and", ls$age)
cat("\n\n")
When calling the list function, instead of supplying anonymous values, you can supply name-value pairs as shown. Then the list can be accessed either by index (using double square bracket syntax) or by item name using the syntax listName$cellName. R language lists are extremely versatile.
Matrices
Unlike many programming languages, R has a built-in matrix type. The terminology used by R for arrays and matrices is a bit different from the usage of other languages, so I'll take a few liberties with vocabulary in this section and the next in the interest of clarity. In R a matrix is a two-dimensional object that has a fixed number of rows and columns, and each cell holds the same data type.
The demo program creates and displays a 2x3 matrix with all six cells initialized to 0.0 using the statements:
cat("Creating a 2x3 matrix \n\n")
m <- matrix(0.0, nrow=2, ncol=3)
print(m)
You can initialize a matrix to a set of fixed values like so:
m <- matrix(c(1.0, 2.0, 3.0, 4.0, 5.0, 6.0), nrow=2, ncol=3)
However, by default, the matrix function populates by columns rather than by rows, so the resulting matrix is:
1.0 3.0 5.0
2.0 4.0 6.0
Matrices can be traversed and accessed using nested loops and square bracket syntax in much the same way as most other programming languages:
my_print_matrix = function(m) {
nr <- nrow(m) # get number of rows
nc <- ncol(m) # number of columns
for (i in 1:nr) {
for (j in 1:nc) {
cat(m[i,j], " ")
}
cat("\n")
}
}
Unlike matrices in most other languages, in R you can supply optional row and column names using the dimnames parameter.
Arrays
In R, an array can have one, two, or three or more dimensions. The demo program creates a one-dimensional array with three cells like so:
cat("Creating 1 and 2-dim arrays \n\n")
arr <- array(0.0, 3)
print(arr)
Here, the 3 argument means three values, not three dimensions. R arrays with one dimension are similar, but not quite the same as vectors. My general rule of thumb, in situations where I can use either a vector or a one-dimensional array, is to use a vector except when I'm also using matrices, in which case I'll use an array.
The demo creates and displays a two-dimensional array with six cells initialized to 0.0 like this:
arr <- array(0.0, c(2,3)) # 2x3
print(arr)
The second argument is a vector that holds the number of cells in each dimension. For example, you can create a three-dimensional array like this:
arr = array(0.0, c(2,5,4)) # 2x5x4 n-array
print(arr) # 40 values displayed
To summarize, a one-dimensional array is similar to a vector, a two-dimensional array is similar to a matrix, and you can also create arrays to three or more dimensions.
Data Frames
An R data frame is roughly similar to a table object in many other languages. Data frame objects are often used when working with built-in R statistics functions such as performing a linear regression analysis using the lm function.
The demo program creates and displays a data frame object using these statements:
cat("Creating a data frame \n\n")
people <- c("Adam", "Bill", "Cris")
ages <- c(18, 28, 38)
df <- data.frame(people, ages)
names(df) <- c("NAME", "AGE")
print(df)
The data frame object has two columns, where the first column has the names of three people and the second column has integer values representing their ages.
The data.frame function has many optional parameters, and data frame objects are a bit more complex than vectors, lists, arrays and matrices. Two common programming scenarios are transferring data from a vector into a data frame object column, and copying values from a matrix into a data frame.
You can traverse a data frame object in much the same way you traverse a matrix, however, a data frame will have an additional row that corresponds to the names property (column headers), so you need to add 1 to the number of row indices.
Wrapping Up
To summarize, a vector object has a fixed number of cells where each cell has the same data type and cells are accessed using 1-based square bracket indexing. R vectors correspond to arrays in most other programming languages.
A list object can hold items with different types and can change size/length at runtime, and list items can be accessed using 1-based double square bracket indexing or by name if the list items consist of name-value pairs.
A matrix object is a two-dimensional object with fixed size and where all cells hold the same type, and cell values are accessed using [i,j] syntax.
An array object can have one, two, or more dimensions where the number of cells in each dimension is fixed and all are the same type. Arrays can supply much of the same functionality as vectors and matrices, but in general using R arrays makes most sense when working with three or more dimensions.
A data frame object corresponds to a table in many other programming languages. Data frame objects are column-based, but can be traversed and accessed much like matrices.
About the Author
Dr. James McCaffrey works for Microsoft Research in Redmond, Wash. He has worked on several Microsoft products including Azure and Bing. James can be reached at [email protected].