Practical .NET
Why You Really Should Be Using LINQ
Peter Vogel introduces a new column on application development in the real world, and begins by advocating for Language Integrated Query.
Welcome to Practical .NET, a new column offering how-to insight and advice for developers working with the flagship Microsoft programming framework. You may be familiar with my Practical ASP.NET column, which I've been writing weekly (more or less) on the Visual Studio Magazine Web site since May 2008. Now my focus expands from ASP.NET to explore a variety of .NET technologies. Think diversity.
What won't change is my commitment to "practical" programming. I'll focus on the tasks developers do right now, every day, in delivering business applications to their users. There will be times when I cover a really cool technology that's worth an early look. But by and large this column will focus on the tools and techniques that developers use to build functional applications.
One of the technologies that developers would be well advised to adopt is Language Integrated Query (LINQ). Not only does LINQ reduce the amount of code you have to write, it also gives you better performance and positions you for other technologies (like Parallel LINQ, or PLINQ, parallel processing).
Despite the advantages of LINQ, in my experience as a consultant and an instructor I've found that most of the developers I meet aren't using it. I'll look at how to get started with LINQ and the key technology you'll need to fully exploit the technology. I also look into why so many developers aren't using this compelling technology here.
In this column I assume that you've created an Entity Framework model based on the Northwind database and now want to retrieve the entities that correspond to your tables. I then walk through using LINQ to do that and introduce a key technology for exploiting LINQ.
LINQ Basics
LINQ is easy to get started with, assuming you've ever written an SQL statement or a For...Each loop. While SQL lets you select rows from a table, LINQ lets you select objects from a collection. With Entity Framework (and LINQ to Entities), the two technologies overlap because the objects in the collection you're querying represent the rows in the table you want to retrieve. Not only can you leverage your SQL knowledge in LINQ, but you can also take advantage of what you know about For...Each loops. In a For...Each loop, you're used to creating a range variable like the cust variable in this example:
'...a bunch of ADO.NET code to retrieve rows
'into a collection of objects called custs
For Each cust As Customer In custs
Next
If you counted up all the lines involved in converting the rows into objects, it would probably be several dozen lines of code -- impossible for the compiler to optimize.
In LINQ/Entity Framework the equivalent statement also uses a range variable but, other than that, it looks much like a SQL statement:
'...code to instantiate an Entity Framework
'ObjectContext called Northwind
Dim res = From cust In Northwind.Customers
Select cust
Ignoring the cust range variable, the major difference between this LINQ statement and an equivalent SQL statement is that first, I don't need an asterisk (*) in the Select clause to retrieve all the properties (I just retrieve the whole object), and second, the From clause appears first. The From clause comes first because it allows you to specify the collection that your range variable is retrieving objects from. This gives you IntelliSense support for your range variable as you type in the rest of your LINQ statement.
One of the benefits of LINQ is that it's only one statement (plus the Entity Framework code, of course) compared to the For...Each loop's multiple statements. That compression makes it considerably easier for the compiler to optimize your code. Another benefit is that LINQ to Entities will generate all the appropriate SQL for you. It will also take care of generating all the ADO.NET activity, and manage opening and closing the connections. There are a lot of benefits to turning that work over to LINQ.
LINQ queries are also .NET friendly -- I can use the output of a LINQ query to set the DataSource on a grid and let the user update the query's results:
Me.grdCust.DataSource = From cust In Northwind.Customers
Select cust
SQL Operations
In SQL, the input and the output for a query is a table, so I can use the output from one SQL query as the input to another. I can also use the output from one LINQ query as the input to another LINQ query. That way, instead of having to write one big complicated query (that I probably won't understand), I can write several simpler queries (that I will understand). This example, for instance, finds all the customers without a PostalCode and, if none are found, searches that collection based on the Customer's city:
Dim res = From cust In Northwind.Customers
Where cust.PostalCode IsNot Nothing
Select cust
If res.Count > 0 Then
Dim res2 = From cust In res
Where cust.City = "Berlin"
Select cust
End If
Doing other "typical SQL" operations, like sorting and filtering, lets you continue to leverage your SQL experience:
Dim resBC = From cust In Northwind.Customers
Where cust.Region = "BC"
Order By cust.CustomerID
Select cust
As I noted earlier, part of the problem developers have with LINQ is that it has two different syntaxes. One is this "SQL-like" syntax that I prefer. The other is based around methods and lambda expressions. The method-based equivalent to my previous example would look like this:
Dim resMethod = Northwind.Customers.
Where(Function(cust) cust.Region = "BC").
Select(Function(cust) cust).
OrderBy(Function(cust) cust.CustomerID)
My friends who know more about the inner workings of the .NET compilers tell me that the method-based syntax is the "real" syntax -- the SQL-like syntax is just "syntactical sugar." It's because of this underlying method-like syntax that you need to include an imports statement in your code for the System.LINQ namespace that these methods are part of.
Personally, I'm willing to take the hit on the compile time required to convert my pseudo-SQL to methods so that I can stick with a syntax I recognize. For instance, joining two collections looks enough like SQL to make me happy. The Join and On clauses look like SQL, except with range variables where I'd normally have table names:
Dim resjoin = From cust In Northwind.Customers
Join ord In Northwind.Orders
On cust.CustomerID Equals ord.CustomerID
Select cust
The biggest annoyance is that I can't use "=" in the On clause (which is what my fingers want to type). I have to use "Equals."
Not the Way It Seems
I think another problem that developers have in embracing LINQ is that the syntax looks so very, very inefficient. It looks like every row in the table is converted into an object in a collection and then processed in a For...Each loop, with the rows you don't want being discarded.
That's not what's happening. Instead, the compiler and LINQ to Entities looks at your LINQ statement and generates the appropriate SQL statement. I've looked at the SQL statements that are generated: They're good. By which I mean, they look like what I would've written. Except that I didn't have to.
Developers also, I think, get uncomfortable about the use of implicit declarations. I have to admit that I sometimes wonder what data type is being returned by this LINQ statement:
Dim resjoin = From cust In Northwind.Customers
Select cust
The range variable (cust) represents items from the Customers collection, so the result is probably going to be a collection of Customer objects. I'm comfortable with letting the compiler figure it out. I'm just interested in what methods and properties appear in the IntelliSense list when I type "res." That answers the only question I'm really interested in: What can I do with this collection?
If you hover your mouse over the res variable, the tooltip will tell you what data type you're getting. It usually turns out to be something like System.LINQ.IQueryable(of someObject). The "I" prefix in IQueryable indicates that the declaration is an interface. So, apparently, the compiler isn't even sure what's coming back from the LINQ expression: All the compiler seems to know is that the return value will be some class that implements the IQueryable interface. If the compiler doesn't know, why should I?
If not explicitly knowing the declaration bothered me, I could type it in:
Dim res As System.Linq.IQueryable(of northwndModel.Customer) =
But why bother? What value am I adding to the process if I do? Either I type in the same declaration that the compiler has figured out or I type in a different (and wrong) one. If the only "value" that I can add to a process is getting it wrong, I'm just as happy to skip it.
Quite frankly, not having to work out the data type is one less piece of trivia for me to have to worry about. In addition, small changes to my LINQ statement result in significant changes to what's returned. For instance, I might decide to return just the CompanyID instead of returning the whole Customer object, giving this LINQ statement:
Dim resCustId = From cust In Northwind.Customers
Select cust.CustomerID
My variable res is now going to be some collection of strings. If I'd declared the data type on res when the query retrieved the whole object, I'd now have to go back and change it. Life is too short.
The extreme example is when you don't even specify the object that's being returned. If, for instance, I only want a few of the properties on an entity, I'll create an anonymous object in my Select clause. With an anonymous object, I use the New keyword but don't specify a class name -- the compiler will make one up. I just have to list off the properties I want on the object and what values to set them to. You even get to take advantage of your knowledge of the With keyword (though you have to use some annoying braces).
This code causes the compiler to generate a class with two properties called Id and Contact, and sets those properties to values from the object being retrieved:
Dim resPart = From cust In Northwind.Customers
Select New With {
.Id = cust.CustomerID,
.Contact = cust.ContactName}
It's a very Zen approach to programming: just let go and let the universe -- or the compiler -- take care of the trivial details (like what your class is called). All you need is the IntelliSense list so you know what properties you can access.
Don't Execute
Something else that bothers developers about LINQ (and that I personally like) is the deferred execution. While I can put a LINQ statement into my code, it's not going to execute at the place where I put it. At run time, the connection to the database won't be opened, the SQL statement won't be sent to the database, and the rows won't be retrieved until I manipulate the LINQ query's result. In this case, that means that the code won't touch the database until I get to the For...Each loop:
Dim resDefered = From cust In Northwind.Customers
Select cust
'...126 lines of code...
For Each c In res
Next
I used to spend a lot of time ensuring that I never opened a connection that I didn't use -- and that I closed that connection as soon as I possibly could. This deferred execution of LINQ means I'm now guaranteed that I won't open a connection until I need the objects being returned. I count on LINQ and Entity Framework to ensure that the connection is closed as soon as possible.
If I've used the output of one LINQ expression as the input to another LINQ expression, it's only when the output of the second expression is used that the SQL is generated. So, I may have written this:
Dim res = From cust In Northwind.Customers
Where cust.PostalCode IsNot Nothing
Select cust
If res.Count > 0 Then
Dim res2 = From cust In res
Where cust.City = "Berlin"
Select cust
End If
But it won't be until I touch the res2 collection that my SQL will be generated. At that point the compiler will look at both LINQ expressions and, probably, generate a single SQL statement that combines both. So I get the benefit of having only one query issued to the database, but I also get the clarity in my code that comes from breaking down the problem into simpler steps.
Of course, if I'm returning data from a method, returning an "unresolved" LINQ query is probably a bad idea. In addition, if I'm returning some data, I probably want to define the data type that I'm returning. While I'm OK with the compiler specifying my collection's data types and leaving me in ignorance, developers using my methods probably want something more definite. And, if my method is part of a Web service, I also want to return something that will convert well into an XML format.
I can force my LINQ query to retrieve all of its data and nail down the data type by using one of the To* methods on the LINQ result to convert my results to a List or an array. It's not unusual for my methods to look like this:
Public Function GetCustomerByID(ByVal CustId As String)
As List(Of northwndModel.Customer)
Dim nw As New northwndModel.northwndEntities
Dim res = From cust In nw.Customers
Where cust.CustomerID = CustId
Select cust
Return res.ToList
End Function
Thinking Architecturally
This method (and ones like it) creates a problem with architecturally rigid developers. Does this method go in the data layer of your architecture? After all, it's accessing data, and because you're no longer writing methods filled with ADO.NET code, this is as close to your database engine as you'll get. Or does this method belong in your business layer because it's performing a business function: returning the customer object that corresponds to the customer Id? Should I call the method from my presentation layer or should I call the method from an intervening business layer that my presentation layer calls? LINQ and Entity Framework, to a certain extent, blur the boundary between the business and data layers.
But the underlying driver is the "separation of concerns." Each class and each method does one thing; each method is (relatively) simple and easy to understand; and applications are assembled out of simple components that come together to do complex things. The criteria that are set by the separation of concerns are what matter and this method, LINQ and Entity Framework, meets all of those criteria.
Getting Rid of the For...Each Loop
Another issue that developers have with LINQ is that it seems like you do everything twice: You write the LINQ statement to retrieve your rows and then you write a For...Each loop to process each item you've retrieved. With ADO.NET, you can just retrieve each row and process it as you retrieve it.
This reflects how developers are missing the connection between LINQ and another important technology: Extension methods. Extension methods let you skip that For...Each loop after your LINQ statement.
Extension methods are methods that aren't attached to a specific object. Instead, you specify the kind of object the methods should be attached to and .NET takes care of listing the method in those objects' IntelliSense lists. This allows you to define a method that, for instance, will attach itself to any IQueryable collection -- such as the result of any LINQ query.
For example, let's say that you want to convert the results of your LINQ query into a comma-delimited string (I don't think there's an existing .NET extension method that would do this). You want the CustomerId and CompanyName properties in each row, separated by a comma, and with each row terminated by a carriage return.
You could write a For...Each loop to run after your LINQ query:
Imports System.Runtime.CompilerServices
...
Dim res = From cust In nw.Customers
Where cust.CustomerID = CustId
Select cust
Dim csv As StringBuilder = New StringBuilder
For Each cust In res
csv.Append(cust.CustomerID & "," &
cust.CompanyName & vbCr)
Next
Or you could put that code in a method in a Module and decorate that code with the Extension attribute:
Module PHVExtensions
<System.Runtime.CompilerServices.Extension()>
Public Function ToCsv
(ByVal custs As IQueryable(Of northwndModel.Customer)) As String
Dim csv As StringBuilder = New StringBuilder
For Each cust In custs
csv.Append(cust.CustomerID & "," &
cust.CompanyName & vbCr)
Next
Return csv.ToString()
End Function
End Module
The first parameter in the method establishes what kind of object this ToCsv method will attach itself to. In this case, it will attach itself to any collection of Customer objects that implements the IQueryable interface. Or, to put it another way, it will attach itself to the output of any LINQ query that works with Customer objects. The collection this method is attached to will be passed into the method where it can be manipulated.
You can now eliminate the For...Each loop that follows your LINQ query and use your new extension method:
Dim res = From cust In res
Where cust.City = "Berlin"
Select cust
Dim csv As String = res.ToCsv
Or, because there's always at least four different ways to do something with LINQ, just attach your extension method to your LINQ expression (after all, your expression is really just a set of methods anyway):
Dim csv As String
csv = (From cust In res
Where cust.City = "Berlin"
Select cust).ToCsv
Letting Go
I think another reason developers avoid LINQ is because it implies a certain amount of "de-skilling." I spent years developing expertise in SQL (and I have the three-foot shelf of books to prove it), and now I may never need it again. Ditto for ADO.NET: I know all sorts of ways to optimize ADO.NET, and I may never need it again. LINQ seems to take care of it all.
But that's not a bad thing. When I started programming, the course that I took had me learn machine code and then assembler (and this was long enough ago that I was typing my code into a punch-card machine). What can I say: I'm very old. That's another set of skills I just don't need any more.
But there are lots of opportunity to develop skills in LINQ to Entities -- for instance, optimizing LINQ by leveraging or disabling lazy loading (that's the Include function). Understanding all of those built-in extension methods and how you can leverage them to get what you want is another area where you can develop expertise.
It's time to develop some new lore.