The Data Science Lab
This R/S4 Demo Might Take You Out of Your Comfort Zone
Let's explore factor analysis again, this time using the R ability to tap into OOP, but we won't use the RC model.
The R language was created primarily to perform statistical analyses in an interactive environment using hundreds of built-in functions such as factanal for factor analysis and lm for linear model analysis. R has always had a basic scripting language with loop control structures, if-then decision control and so on, but somewhat unusually, R has several completely different object-oriented programming (OOP) models.
In this article I'll show you how to write OOP code using the S4 model. Although the latest built-in OOP model, RC ("reference classes"), is superior in many ways to S4 and its immediate predecessor model S3, both S3 and S4 are still widely used. Quite a few common R language functions were created using the S4 model. Even if you never write S4 code, understanding S4 can help you use the R language more effectively and give you some interesting insights into OOP in other languages such as C# and Python.
Although the official documentation for S4 is quite good, it's more of a technical reference than a guide for developers and users. I'll explain programming with S4 (technically scripting because R is interpreted rather than compiled) from the point of view of a .NET developer who is relatively new to R.
To get an idea of where this article is headed, take a look at the screenshot of a demo R session in Figure 1. The outer container window is the Rgui.exe shell. Inside the shell, the window on the left is the R Console where you can issue interactive commands. Here, I use the setwd function to set the working directory to the location of my demo R file. Then I call the rm function to delete all existing objects in my workspace. And then I use the source function to execute the OopDemo.R script.
The right window inside the shell is the source R code for the OopDemo.R script. That script uses an S4 class that defines a Person object. In most realistic scenarios an S4 class would contain numeric arrays and matrices, but a Person class is easy to understand and is a common "Hello World" example for OOP.
Installing R and S4
If you don't have R on your system, installing R (and uninstalling) is very easy. Do an Internet search for "Install R" and you'll find a page on the cran.r-project.org Web site with a link labeled something like Download R 3.3.2 for Windows. Click that link and you'll get an option to run a self-extracting installer file named something like R-3.3.2-win.exe. Click on the Run button presented to you by your browser to launch the installer. You can accept all the configuration defaults and installation is very quick.
The demo code has no significant R version dependencies so you can use R version 3.0 or later. The libraries needed to write S4 (as well as the older S3 and newer RC OOP models) are included with a default R installation.
To launch the Rgui program, open a file explorer and navigate to the C:\Program Files\R\R-3.3.2\bin\x64 directory. Then double-click on file Rgui.exe and the Rgui shell with an R Console window will launch. For this demo session I don't need to install any non-default packages so I didn't need to run Rgui with administrator privileges. After Rgui launches, you can clear away the wordy start-up messages by issuing a Ctrl+L.
Understanding S4 Classes
An S4 class definition is quite a bit different from a class definition in other programming languages such as C# and Python. I think a good way for .NET developers to get a grasp of S4 is to take a C# class definition and then see what a roughly equivalent definition looks like in R.
Consider this C# Person class definition skeleton:
public class Person
{
public int empID;
public string lastName;
public DateTime hireDate;
public double payRate;
public Person { . . } // default ctor
public void Display { . . }
public int YearsService { . . }
}
The C# class encapsulates data fields empID, lastName, hireDate, and payRate with a constructor method, a Display method, and a YearsService method. A roughly equivalent R language S4 class definition skeleton is:
Person = setClass("Person", . . )
setMethod(f="initialize", . . )
setGeneric(name="display", . . )
setMethod(f="display", . . )
setGeneric(name="yearsService", . . )
setMethod(f="yearsService ", . . )
An S4 class encapsulates data fields inside a setClass function, but doesn't encapsulate class methods. Instead, S4 class methods are defined by pairs of special R functions named setMethod and setGeneric. Notice that both the Person display and the yearsService methods are defined by two functions each.
As if this isn't wacky enough (if you're new to R language OOP, that is), there’s a special initialize function that’s defined only by setMethod but not with setGeneric. The special initialize function corresponds to a C# constructor, as you'll see shortly.
The calling code for using an S4 object is also quite different from the calling code for a C# class. In C# you could write code like:
Person p1 = new Person; // Default values for fields
p1.empID = 65565;
p1.lastName = "Adams"; // Change name
p1.hireDate = DateTime.Parse("2010/09/15");
p1.payRate = 43.21;
p1.Display;
int tenure = p1.YearsService;
Roughly equivalent calling code for the S4 Person class looks like:
p1 <- new("Person") # default values for fields
p1@empID <- as.integer(65565)
p1@lastName <- "Adams" # change name
p1@hireDate <- "2010-09-15"
p1@payRate <- 43.21
display(p1)
tenure <- yearsService(p1)
Notice that S4 objects are instantiated using the built-in new function. All S4 class fields have public scope and are accessed using the @ operator (unlike S3 and RC objects, which use the $ operator). And in R, class methods are called using a pattern of methodName(objectName), instead of the C# pattern of objectName.methodName; for example, display(p1) rather than p1.display.
The S4 Person Class Definition Code
The complete R code for the demo program is presented in Listing 1.
Listing 1: S4 OOP Demo Program
# OopDemo.R
# R 3.3.2
# S4 OOP
Person = setClass(
"Person",
slots = c(
empID = "integer",
lastName = "character",
hireDate = "character",
payRate = "numeric"
)
)
setMethod(f="initialize", signature="Person",
definition=function(.Object) {
.Object@empID <- as.integer(-1)
.Object@lastName <- "NONAME"
.Object@hireDate <- "1990-01-01"
.Object@payRate <- 0.01
return(.Object)
}
)
setGeneric(name="display", def=function(obj) {
standardGeneric("display")
}
)
setMethod(f="display", signature="Person",
definition=function(obj) {
cat("Employee ID :", obj@empID, "\n")
cat("Last Name :", obj@lastName, "\n")
cat("Hire Date :", obj@hireDate, "\n")
cat("Pay Rate : $", obj@payRate, "\n\n")
}
)
setGeneric(name="yearsService", def=function(obj) {
standardGeneric("yearsService")
}
)
setMethod(f="yearsService", signature="Person",
definition=function(obj) {
hd <- as.POSIXlt(obj@hireDate)
today <- as.POSIXlt(Sys.Date)
yrs <- today$year - hd$year
if (today$mon < hd$mon || (today$mon == hd$mon &&
today$mday < hd$mday)) {
yrs <- yrs - 1
}
return(yrs)
}
)
# ==
cat("\nBegin OOP with S4 demo \n\n")
cat("Creating Person p1 with initialize values \n\n")
p1 <- new("Person") # could use p1 <- Person
display(p1) # could use print(p1)
cat("Setting p1 fields directly \n\n")
p1@empID <- as.integer(65565)
p1@lastName <- "Adams"
p1@hireDate <- "2010-09-15"
p1@payRate <- 43.21
display(p1)
cat("Calling yearsService \n\n")
tenure <- yearsService(p1)
cat("Person p1 tenure = ", tenure, " years \n")
cat("Making a value-copy of p1 using '<-' \n\n")
p2 <- p1
cat("\nEnd OOP with S4 demo \n\n")
The first part of the S4 Person class definition is:
Person = setClass(
"Person",
slots = c(
empID = "integer",
lastName = "character",
hireDate = "character",
payRate = "numeric"
)
)
The built-in setClass function holds the class name and defines the fields and their types. The "slots" keyword replaces the older "representation" keyword, which has been deprecated (but still works). The tersely named c function ("combine") creates a vector with named indices that are the field names.
One of the significant improvements in the S4 OOP model compared to the S3 model is that you can specify types for the fields. In addition to the atomic types "integer," "character," and "numeric" used in the demo, you can specify composite types such as "vector," "matrix," "array," and "data.frame."
It is possible to supply setClass with an optional function that checks the validity of initial field values. I don't find this mechanism very useful, and if I want to check initial field values, I do so in the initialize function, which is explained below. Also, you can use the "contains" keyword for a rudimentary form of OOP inheritance, which is outside the scope of this article.
The next part of the demo code essentially defines a default constructor:
setMethod(f="initialize", signature="Person",
definition=function(.Object) {
.Object@empID <- as.integer(-1)
.Object@lastName <- "NONAME"
.Object@hireDate <- "1990-01-01"
.Object@payRate <- 0.01
return(.Object)
}
)
The built-in setMethod function is used to define a special function named "initialize" for a "Person" object. The initialize mechanism replaces an alternative approach that used a special "prototype" keyword. Note that the deprecated "representation" (replaced by "slots") and "prototype" (replaced by "initialize") make the examples of S4 OOP you find on the Internet rather confusing because there are so many combinations of their use.
In the initialize definition, ".Object" is a required parameter that references the object. It's somewhat similar to the "this" keyword in C# or the "self" keyword in Python. Notice that you must call return(.Object) as the last statement.
When assigning a default value to the empID field, I have to use the as.integer function because the default R numeric type is double rather than integer. Also notice that R is somewhat primitive in the sense that there isn't any special data type to hold date-time objects.
The definition of an initialize function is optional. If you don't define an initialize, when you instantiate an object, fields will be given default values that are zero-length vectors, indicated by integer(0), character(0), and so on. Zero-length vectors are not the same as NULL values. The bottom line is that in most situations you should supply an initialize function to set initial field values.
Most S4 methods require a pair of setMethod and setGeneric functions. The initialize function is an exception. By default, you get an automatically defined print function when defining an S4 class. The demo program defines an alternative way to display an S4 object:
setGeneric(name="display", def=function(obj) {
standardGeneric("display")
}
)
setMethod(f="display", signature="Person",
definition=function(obj) {
cat("Employee ID :", obj@empID, "\n")
cat("Last Name :", obj@lastName, "\n")
cat("Hire Date :", obj@hireDate, "\n")
cat("Pay Rate : $", obj@payRate, "\n\n")
}
)
Loosely stated, the setGeneric function tells the R runtime that there’s a global-scope function named "display." The setMethod function associates the "display" function with the "Person" class, and also defines the behavior of the display method.
Recall that the initialize function required a parameter named ".Object"; however, for other class methods you can use whatever parameter name you like to reference the object. In the demo I use "obj" but I could have used "this" (as in C#"), "me" (as in VB), "self" (as in Python) or even "foo" (if I wanted to be really annoying).
The demo defines a yearsService method, using the same setGeneric-setMethod pattern. The function returns the number of years between a Person date-of-hire and the current date. The setGeneric registration code is:
setGeneric(name="yearsService", def=function(obj) {
standardGeneric("yearsService")
}
)
And the implementation is:
setMethod(f="yearsService", signature="Person",
definition=function(obj) {
hd <- as.POSIXlt(obj@hireDate)
today <- as.POSIXlt(Sys.Date)
yrs <- today$year - hd$year
if (today$mon < hd$mon || (today$mon == hd$mon &&
today$mday < hd$mday)) {
yrs <- yrs - 1
}
return(yrs)
}
)
The yearsService method uses the built-in POSIXlt function to convert the hireDate string variable to a structure that has year, mon and mday fields. The calculation just subtracts year values, and then subtracts an additional year if the date of hire is before the current date.
To summarize, you define an S4 class using the setClass function. The "slots" keyword (replaces the deprecated "representation") defines field names and their types. You can optionally define a validity-check function, or implement a primitive form of inheritance using the "contains" keyword. All fields have public scope and are accessed using the "@" operator. You use setMethod to define an "initialize" function (replaces the deprecated "prototype") with an ".Object" parameter that sets initial field values. You use a pair of setGeneric and setMethod to define a class method.
Using an S4 Object
The demo program creates a Person object, like so:
cat("Creating Person p1 with initialize values \n\n")
p1 <- new("Person")
display(p1)
When an S4 object is instantiated using the built-in new function, if an initialize function for the class has been defined (as in the demo), it will be invoked behind the scenes and set initial field values. An alternative way to instantiate an S4 object is to call the class as a function, for example:
p2 <- Person() # alternate S4 instantiation
print(p2)
The official R documentation isn’t clear about why there are two instantiation mechanisms for S4, and doesn't provide much advice. I prefer to use the new function for S4 instantiation.
The demo illustrates field access with this code:
cat("Setting p1 fields directly \n\n")
p1@empID <- as.integer(65565)
p1@lastName <- "Adams"
p1@hireDate <- "2010-09-15"
p1@payRate <- 43.21
display(p1)
It’s possible to define get and set methods for an S4 class, but because all fields have public scope, there's no advantage in doing so.
Next, the demo shows how to call a class method:
cat("Calling yearsService \n\n")
tenure <- yearsService(p1)
cat("Person p1 tenure = ", tenure, " years \n")
Notice that because the yearsService method was registered as a function that operates on a Person object, the method is called just like an ordinary built-in R function.
The demo concludes by showing S4 object assignment:
cat("Making a value-copy of p1 using '<-' \n\n")
p2 <- p1
cat("\nEnd OOP with S4 demo \n\n")
Here, object p2 is a value copy of p1 because S4 copies by value rather than reference. In other words, p2 is an independent duplicate of p1 and any changes made to p2 will have no effect on p1.
Wrapping Up
The demo code presented in this article should give you all the information you need to get up and running with S4 classes. When I need to write OOP code in R, I often have a difficult time deciding whether to use the S3, S4 or RC model. The RC model is much more like the C# OOP model I'm used to, but based on my experiences, most R programmers come from a strictly R programming background and feel more comfortable with S3 and S4. So, if I'm writing code intended for my own use only, then I'll usually use the RC model, but if I'm writing code for R programmers, I'll usually use S3 or S4.
In spite of the technical superiority of the S4 model over the S3 model, most of my colleagues prefer S3 to S4. I suspect that this is due mostly to the rather confusing documentation for S4, which in turn is due in large part to the many changes made to S4 since its introduction, such as deprecating "representation" and "prototype" in favor of "slots" and "initialize."