Code Focused
Add Distinction to Your Code
Learn how LINQ, extension methods, and lambda functions can help you add a bit of distinction, simplicity, and robustness to your code.
- By Bill McCarthy
- 11/01/2008
TECHNOLOGY TOOLBOX: VB.NET
One of my favorite things about Language Integrated Query (LINQ) is how it allows you to focus on what you want, rather than on how to achieve it. Nothing illustrates this principle better than LINQ's Distinct clause, which you can take advantage of to create Distinct views of your object collections. Doing so can help you sort, massage, and filter data to fit your application's needs, whether you want to take advantage of query syntax to transform your data, or you want to create your own types that implement IEquatable(Of T). Another possibility: You might use an extension method syntax to provide an implementation of IEqualityComparer(Of T).
I'll walk you through all of these scenarios in this article, explaining how the combination of LINQ, extension methods, and lambda functions can help you add a bit of distinction, simplicity, and robustness to your code. Let's begin with a common (and simple) type of problem. Assume you have a list of customers and addresses, and you want to create a query to determine all the ZIP codes or postcodes:
Dim postcodes = From c In customers _
Select c.PostCode Distinct
The Distinct clause ensures you don't get any duplicate postcodes. Without the Distinct clause, you'd have to write your own routine that loops through the list, gets each postcode, and checks to see whether it's unique by using a Dictionary(Of T) or the newer HashSet(Of T) class. The Distinct clause abstracts the how for you, allowing you to specify the what.
Sometimes that abstraction doesn't abstract you from the how; instead, you have to consider carefully how you want the query to work. Assume you want a list of suburbs from your customer collection. You might be tempted to write something like this:
Dim suburbs = From c In customers _
Select c.Suburb Distinct
If your data is formatted properly, this query should work fine. However, it falls down if the data might include different or mixed case entries for the Suburb field: "Springfield" and "springfield" are considered distinct from each other, so you end up with duplicates if the casing is different. A robust and well-abstracted query would say "Ignore Case" as part of the Distinct. Unfortunately, there's no such statement in LINQ (at present), so you have no choice but to delve into the how.
If you want to do something simple like get a list of suburbs, you can force the Suburb to be either all lowercase or all uppercase:
Dim suburbs = From c In customers _
Select c.Suburb.ToUpper Distinct
A list of the suburbs without any other information is of limited usefulness. Give yourself a pat on the back and put gold star on your monitor if you were thinking that different locations can have the same suburb name. There are literally dozens of Springfields in the United States. To help ensure you've got the right locations, you can include the state and postcode in the return set:
Dim places = From c In customers _
Select c.Suburb, c.State, c.PostCode Distinct
For this query to work with case-insensitive data, you need to make both Suburb and State uppercase. However, you'll get a design-time error ("Range variable ‘ToUpper' is already declared.") if you change the query to make those fields uppercase without specifying new field names:
Dim places = From c In customers _
Select c.Suburb.ToUpper, c.State.ToUpper, _
c.PostCode Distinct
The cause of the error is that the query creates an anonymous type, and the compiler tries to infer the field names for the anonymous type based on the last member call, two of which are calls to ToUpper. If your query selected only c.Suburb.ToUpper and c.PostCode, the anonymous type created would have two properties: one named ToUpper and one named PostCode. The trick here is to define your own names for the properties:
Dim places = From c In customers _
Select Suburb = c.Suburb.ToUpper, _
State = c.State.ToUpper, _
c.PostCode Distinct
The Distinct clause in this query works with the anonymous type because with the anonymous type the VB compiler implements IEquatable(Of T). IEquatable(Of T) has only one member, an overload of the Equals method:
Interface IEquatable(Of T)
Function Equals(ByVal other As T) As Boolean
End Interface
If you write your own class with Suburb, State, and PostCode properties, you should implement IEquatable(Of T).
Although IEquatable(Of T) defines only the Equals(T) function, for IEquatable(Of T) to work properly you need to also override GetHashCode. This is important with the Distinct clause because it calls GetHashCode first, then calls the IEquatable(Of T).Equals method only if the hashes match. The creation of a minimal implementation of IEquatable (Of T) is simple enough (see Listing 1).
Implementing IEquatable(Of T) works, but it limits you to only one way of comparing items and reporting them as being equal. As your types get more complex, this can be a serious limitation. Consider a de-normalized customer class: At times, you might want to consider two customers the same if the name and Social Security number are the same, even if the addresses are different. Other times you might want all fields to match. The query syntax using the Distinct clause doesn't provide any means of specifying a different way to compare objects, but if you use the Extension methods directly, you can use an overload of Distinct that accepts an IComparable(Of T) argument:
Dim query = From c In customers _
Where c.LastName = "Smith"
query = query.Distinct(myCustomerComparer)
The myCustomerComparer argument is an example of a custom comparer (IEqualityComparer(Of T)) for which you would have to define the class. Your custom comparer must implement IEqualityComparer(Of T):
Class myCustomerComparer
Implements IEqualityComparer(Of Customer)
Public Overloads Function Equals( _
ByVal x As Customer, ByVal y As Customer) _
As Boolean Implements IEqualityComparer( _
Of Customer).Equals
‘ code here to compare customers
End Function
Public Overloads Function GetHashCode( _
ByVal obj As Customer) As Int32 _
Implements IEqualityComparer( _
Of Customer).GetHashCode
' must return same value for
customers considered ‘ equal
End Function
End Class
Creating a class that implements IEqualityComparer(Of T) gives you flexibility when making calls to the Distinct extension, but the trade-off is that you must define a type for each different
Distinct query you want to write. The real work in the implementation of the IEqualityComparer(Of T) interface is the Equals method and the GetHashCode method. You can use lambda expressions for these and factor the implementation into a generic class:
Class DistinctComparer(Of T)
Implements IEqualityComparer(Of T)
Private _equalFunc As Func(Of T, T, Boolean)
Private _hashFunc As Func(Of T, Int32)
Sub New( _
ByVal equalFunc As Func(Of T, T, Boolean), _
ByVal hashFunc As Func(Of T, Int32))
_equalFunc = equalFunc
_hashFunc = hashFunc
End Sub
Public Function IEqualityComparer_Equals( _
ByVal x As T, _
ByVal y As T _
) As Boolean _
Implements IEqualityComparer(Of T).Equals
Return _equalFunc(x, y)
End Function
Public Function _
IEqualityComparer_GetHashCode( _
ByVal obj As T) As Int32 _
Implements IEqualityComparer( _
Of T).GetHashCode
Return _hashFunc(obj)
End Function
End Class
With the generic DistinctComparer class, you can define custom lambda functions in line in your query. For example, you can use this code to get a list of customers keeping only a single match per Social Security number:
Dim distinctCustomers = _
myCustomers.Distinct( _
New DistinctComparer(Of Customer)( _
Function(x, y) x.SSNumber = y.SSNumber, _
Function(x) x.SSNumber.GetHashCode))
There are two potential problems with this approach. First, the call to the DistinctComparer(Of T) constructor requires you to specify the generic parameter T. That's fine for normal types, but it won't work with anonymous types. The trick to working with anonymous types is to pass them to a generic method -- allowing the generic parameter to be inferred -- and then instantiate the type in the generic method. This approach works for various tasks, such as creating a List(Of T) for anonymous types. One helpful technique is to create an overload for the Distinct extension method, enabling you to use anonymous types with a DistinctComparer (see Listing 2). This extension simplifies the calling code significantly:
Dim distinctCustomers = myCustomers.Distinct( _
Function(x, y) x.SSNumber = y.SSNumber, _
Function(x) x.SSNumber.GetHashCode)
A problem still remains with this code. If the SSNumber is a string or a nullable type, and your data doesn't prevent nulls being stored, then all customers with no Social Security number on record will be deemed to be the same. This example provides a more robust expression:
Dim distinctCustomers = customers.Distinct( _
Function(x, y) If(x.SSNumber Is Nothing, _
False, x.SSNumber = y.SSNumber), _
Function(x) x.SSNumber.GetHashCode)
This approach works well if you are comparing only a few properties, but the code can get ugly quickly. One thing to note is the second lambda function -- which provides the hash code -- is required only to return the same number for items that might be equal. For example, you might decide to filter based on Social Security number and address fields, but have the hash based purely on the Social Security number. Remember the hash code is the first check: The equality-comparison function is called only if the hash codes are equal.
Using the extension method directly is handy when dealing with in-memory objects because it provides you with greater flexibility compared to the Distinct query clause. But be aware that it will be executed against the objects in memory. Compared to using the Distinct clause, which can be safely translated to SQL statements when using LINQ to SQL or LINQ to Entity Framework, the in-memory approach fetches the data and then filters it, rather than filtering it first. The Distinct clause and Distinct extension provide handy tools to add to your toolbox. Combined with the other LINQ clauses, such as Group By and Order By, and the wealth of LINQ extensions, you can move toward focusing your code on the what, not the how.
About the Author
Bill McCarthy is an independent consultant based in Australia and is one of the foremost .NET language experts specializing in Visual Basic. He has been a Microsoft MVP for VB for the last nine years and sat in on internal development reviews with the Visual Basic team for the last five years where he helped to steer the language’s future direction. These days he writes his thoughts about language direction on his blog at http://msmvps.com/bill.