Geek Talk: Are Strings Value Types or Reference Types?
Some things only matter to the true nerd. So, if you're looking for a genuinely useful article, this isn't it.
While gurus would like to claim that all issues are black and white, that's not even remotely true. Many developers read about a "best practice," for example, and then wield that practice like a blunt weapon against any code that doesn't follow that practice. This reveals a fundamental misunderstanding about what a "best" practice is: a "best" practice is only best when (a) compared against some other practice and (b) when costs and benefits of the best practice are better, in comparison, than the costs and benefits of the other practice in some environment. This means that best practices aren't universally the "best": At the very least, in some other environment, the costs and benefits will change. Alternatively, different programmers will value some benefits and costs differently than other programmers. Either way, a best practice won't always be the best implementation and pretending otherwise will create more problems than it solves.
Knowing the best practice isn't as useful as understanding why it's the best practice, to whom it's the best practice and when it's the best practice. To put it another way: You need to think about what you're doing.
As another example, I recently ran across a Microsoft .NET Framework programming course that identified Strings as reference types. That isn't a lie, but it also isn't useful, either, so I'm going to talk about why that information isn't useful here. And, I apologize for this: The reason this column is called Practical .NET is because the intent of the column is to cause you, after reading the column, to want to write code differently. Nothing in what follows here will cause you to change anything about the way you write code. Promise.
Actually, the argument over whether the String data type is a value or a reference type isn't really an argument about Strings at all: It's about where, in the developer's conceptual hierarchy, the term "value" and the term "reference" should be applied.
There's no argument, for example, about how Strings are currently stored by the .NET Framework. Under the hood, the .NET Framework divides memory into two sections: the heap and the stack. The stack is the smaller of the two spaces and is used to hold what were once called the scalar data types: Integers, Longs, DateTimes. In general, you can think of these data types as the ones whose storage never changes -- an Integer is always 16/32/64 bytes, depending on your OS, for example.
The heap is used to hold those data types that have more "flexible" storage requirements. Arrays, for example, have a variable length depending on how many items are in the array (an array of 12 items requires more space than an array with 10 items). Strings, whose length can vary, are also stored on the heap.
Variables with data stored on the stack effectively hold the data in the variable; Variables with data on the heap effectively hold a reference that points to the current location of your data on the heap (I'm simplifying here, but this is a "good enough" description). This distinction is where "value" and "reference" first appear: Data types on the stack are accessed directly (the variable holds the value) while data types on the heap are accessed indirectly (the variable holds a reference to the value). This is the lowest level on the conceptual ladder for using "value" and "reference."
You Don't Care
But, quite frankly, who cares? While there are differences between the heap and the stack, you can't do much about it. To begin with, you can't access the heap or the stack directly. It's true that the heap is subject to fragmentation and garbage collection, which can affect your application's performance, but trying to code around that or take control of the heap is (with very few exceptions) almost always a waste of your time. With more recent versions of the .NET Framework, garbage collection (which used to bring applications to a screeching halt) is having less impact on application performance.
It's not clear to me why you should even begin to worry about the storage mechanism, especially because the "value data types stored on the stack" isn't even true -- at least, not all the time. Everyone's favorite example of a value data type is the Integer data type, but Integer fields in a class are stored on the heap and not on the stack.
Furthermore, the stack isn't even part of the .NET Framework specification (it's part of the Windows implementation of the .NET Framework). Besides: If, tomorrow, the .NET Framework team completely changed the way memory was handled, you wouldn't care as long as your code ran unchanged.
The Real Differences
In fact, from a developer's point of view, the most obvious difference between two sets of data types isn't the storage mechanism or how the data is accessed. The difference that matters is whether you need to use the New keyword when using the data type. You don't have to use New with Integers, Longs, Strings or Structs, for example, but do need to use New with other data types. This matters in a very practical sense: If you don't use the New keyword the right way, your code won't compile.
But, before you make a big deal about how the New keyword relates to how items are stored, let me point out that while the New keyword isn't required for data types stored on the stack, nothing stops you from using it. Both of these lines of code will compile just fine, for example:
Dim i As Integer = New Integer
Dim x As String = New String("x")
Of course, I can't imagine why anyone would write the first line of code and I'd only write the first line to take advantage of a feature I discussed in a tip (and which several readers felt was a dumb way to do things, anyway).
Here's the other thing that matters: When value types are passed between variables, then each variable gets its own, independent copy of the data. Look at this code as an example:
Dim a1 As Integer
Dim b1 As Integer
a1 = 6
b1 = a1
a1 = 5
In the last line of that code, you don't expect the value in b1 to be changed when the value in a1 is changed in the last line. The reason you expect b1 to remain unchanged is because in the fourth line where b1 is assigned the value in a1, you expect b1 to get its own independent copy of 6. When a1 is changed in the fifth line, you expect b1's copy of the data to be unchanged.
If you replace Integer with String you get exactly the same behavior:
Dim a1 As String
Dim b1 As String
a1 = "Pat"
b1 = a1
a1 = "Tracy"
Following the fifth line, you'd be very surprised to discover that b1 is now set to "Tracy." And before someone brings it up: Yes, after b1 is assigned the value in a1 and before the fifth line, the variables a1 and b1 are referencing the same section of the heap (something called "string interning"). But, again, why do you care? I assume that the .NET Framework team adopted interning to save memory (roughly like having only one copy of a DLL on the disk in the bad old days of Windows COM). However, you can't tell that's what's happening so you don't care.
Therefore: For all intents and purposes, at the conceptual level that actually matters, Strings are value types. You don't have to use New and assignments give you your own personal copy of the data.
I've seen developers refer to Strings as "reference types with value semantics." This is only possible if you value the storage mechanism over behavior. However, even that only makes sense if you're also willing to refer to relational databases as "Btrieve storage structures with relational semantics" because your relational data isn't actually stored on your hard disk as relational tables, it just behaves that way in your code. If you value behavior over storage (as I do) then, I suppose, you might say that Strings are "value types with, currently, reference storage mechanisms." I'm not clear what advantage that would give you because there's nothing stopping the .NET Framework team from replacing the current storage model with a better one … as long as the behavior of your code didn't change.
I'll say it again: You don't care. You don't care if your data is stored in ASCII, EBCIDIC or Unicode. You don't care if it's on the heap or the stack.
If you objected to my earlier statement that Strings are value types, how about this: If all you want to talk about are value types and reference types, then Strings are value types.
Special Handling for Strings
Which isn't to say that there aren't situations when working with Strings requires some special attention. Manipulating Strings can result in fragmenting the heap and forcing garbage collection. But that issue isn't because Strings are value or reference types: it's because Strings are immutable (as are lots of other data types -- Integers, for example, are also immutable). When you change an existing String in your code, you don't change the existing String in memory. Instead, new memory is allocated on the heap and a new String is created based on copying the old String and manipulating it to match your change. The old String is then abandoned to garbage collection.
If you want, you can think of Strings as "immutable reference types" (but if you do then I think you're obliged to think of the other types as "mutable value types" or "mutable reference types"… which you don't). But, again, if you just think of Strings as value types and stop worrying about how they're stored (you don't care) then you won't be surprised.
What You Shouldn't Do
That's not to say you should ignore the garbage collection issues, any more than you should ignore the result of dividing an integer by zero or using the property of a variable set to Nothing/null (though those issues will cause your application to crash rather than just pause randomly, which is all garbage collection will do).
If you're going to repeatedly change an existing String, then you should use the StringBuilder object, which allocates itself a hunk of memory to hold the String it manipulates. So, because of immutability (not the reference or value type distinction), if you're going to change an existing String, then you should use a StringBuilder to hold the String. Unlike identifying Strings as reference types, identifying Strings as immutable is actually useful.
Hey! That last paragraph was almost useful. And, at the start of this column, I promised not to be useful. I obviously can't be trusted.
About the Author
Peter Vogel is a system architect and principal in PH&V Information Services. PH&V provides full-stack consulting from UX design through object modeling to database design. Peter tweets about his VSM columns with the hashtag #vogelarticles. His blog posts on user experience design can be found at http://blog.learningtree.com/tag/ui/.