Practical .NET

Getting All the Results When Joining with LINQ

When you want to find all the objects that are missing a corresponding object, then you need the LINQ equivalent of an outer join. Here, step by painful step, is how to build that query.

Here's a common problem: finding or presenting two collections of related objects where one collection is missing some items. Examples include listing all Customers with their SalesOrders, including Customers that don't have a SalesOrder, and listing all Customers who don't have a SalesOrder.

To meet these requirements you need to use the LINQ Join clause. By default, the Join keyword joins two collections together to get all the matching objects. The keyword in that sentence is "matching." If, for example, you're joining a collection of Customer and SalesOrder objects, you'll get all the Customers that have a matching SalesOrder, but you won't get Customers that don't have a SalesOrder.

In those cases, you want what a SQL programmer would call an "outer join." You can do it in LINQ but the syntax is a little odd. The code is also sufficiently different between C# and Visual Basic that I'll need to cover both language versions (it would be unreasonable to expect a developer to be able to translate from one language to another in this case).

Starting the Query
You begin your outer join query with the collection that you want to get all of the objects from. In my case, I want all of the Customer objects, so I start with the Customers collection. In Visual Basic, the start of my query looks like this:

Dim db As New CustomerOrdersContext
Dim res = From cust In db.Customers

In C#, the start of the query is similar:

CustomerOrdersContext db = new CustomerOrdersContext();
var res = from cust in db.Customers

Now, you use the Join clause in C# or the Group Join clause in Visual Basic to bring in the other collection. The syntax for the Join/Group Join clause is very similar to the syntax for the From clause (it includes, for example, a range variable to refer to individual items in the collection). Here's the C# code, using a range variable called so to refer to objects in the SalesOrder collection:

var res = from cust in db.Customers
            join so in db.SalesOrders

And here's the Visual Basic code, again using a range variable called so:

Dim res = From cust In db.Customers
            Group Join so In db.SalesOrders

Now, as with a SQL Join, you need to specify the criteria for matching up the objects from each collection. You have only one tool for matching objects: the LINQ Equals keyword. In my case, the SalesOrder object has a CustomerId property that holds a value that matches the Id property on the related Customer object.

Be aware, though: The C# Join syntax is fussier than the Visual Basic Group Join clause. In C#, the range variable from the first From statement must be on the left-hand side of the equals keyword. The C# code, therefore, has to look like this:

var res = from cust in db.Customers
            join so in db.SalesOrders
              on cust.Id equals so.CustomerId

In Visual Basic, I've taken advantage of the language's flexibility on this matter to swap the items on either side of the Equals keyword (not that it makes any difference to the result):

Dim res = From cust In db.Customers
            Group Join so In db.SalesOrders
              On so.CustomerId Equals cust.Id
Retrieving the Matching Objects

The next step is to generate a new collection consisting of the SalesOrder objects that match to an object in the Customers collection. For that, you use the LINQ Into clause.

In C# the into clause defines a collection consisting of the SalesOrders that have a matching Customer object:

var res = from cust in db.Customers
            join so in db.SalesOrders
              on cust.Id equals so.CustomerId
              into MatchedOrders

The Visual Basic version, on the other hand, creates a collection of anonymous objects. In Visual Basic you use the Into clause to define properties on those anonymous objects. This code defines a property on the anonymous object called MatchedOrders and uses the Group keyword to set that property to the collection of SalesOrders matched to each CustomerId:

Dim res = From cust In db.Customers
            Group Join so In db.SalesOrders
              On so.CustomerId Equals cust.Id
              Into MatchedOrders = Group

Now you're ready to process that new, generated collection of matched objects. You need to establish a range variable to use with that collection, so you follow the Into clause with a From clause. Here's the Visual Basic code that establishes a range variable called mo:

Dim res = From cust In db.Customers
            Group Join so In db.SalesOrders
              On so.CustomerId Equals cust.Id
              Into MatchedOrders = Group
          From mo In MatchedOrders

And the C# code:

var res = from cust in db.Customers
          join so in db.SalesOrders
            on cust.Id equals so.CustomerId
            into MatchedOrders
          from mo in MatchedOrders

Now you're almost ready to return some data in your Select clause, but you need to handle Customers without a matching SalesOrder.

Here's a way to think about what happens next: If, in the Select clause, you use both the cust range variable (retrieving Customers) and the mo range variable (retrieving matching SalesOrders) then LINQ will attempt to find a matching SalesOrder from the MatchedOrders collection for each Customer. If there's no matching SalesOrder in MatchedOrders then the Customer object will be discarded.

If you don't want Customer objects to be discarded (and that's the point of this column) you call the MatchedOrders' DefaultIfEmpty method. Now, if there's no matching entry for the Customer object in the MatchedOrders collection, a default value is provided and the Customer object isn't discarded. For an object like the SalesOrder object, the default value is null/Nothing.

Here's the C# code that ensures all the Customer objects will be available:

var res = from cust in db.Customers
          join so in db.SalesOrders
            on cust.Id equals so.CustomerId
            into MatchedOrders
          from mo in MatchedOrders.DefaultIfEmpty()

And the Visual Basic code:

Dim res = From cust In db.Customers
          Group Join so In db.SalesOrders
            On so.CustomerId Equals cust.Id
            Into MatchedOrders = Group
          From mo In MatchedOrders.DefaultIfEmpty()

Producing the Result
Finally, in your Select method you can produce a result. In your Select clause you can process items from the first From clause (the one with cust as its range variable) and from the last From clause (the one with mo as its range variable). You can't, in the Select clause, use the range variable from the Join clause (in my example, that's the so variable).

Listing 1 shows the "close to final" version in C#; Listing 2 shows the Visual Basic version. In both versions of my Select clause, I'm generating an anonymous object with values from the two different collections (see my tip on generating anonymous objects in LINQ with Visual Basic). In this code I'm taking advantage of a feature of both languages and letting names of the properties on my anonymous object default to the names of the properties on my cust and mo range variables. You can see what the anonymous object looks like in the For Each code that processes the queries' result.

Unfortunately, neither version will work.

Listing 1: C# Code for a LINQ Outer Join
var res = from cust in db.Customers
            join so in db.SalesOrders
              on cust.Id equals so.CustomerId
              into MatchedOrders
          from mo in MatchedOrders.DefaultIfEmpty()
          select new { cust.CustomerId, mo.SalesOrderId };

foreach (var r in res)
{
  Debug.WriteLine(r.CustomerId + ", " + r.SalesOrderId);
}
Listing 2: Visual Basic Code for a LINQ Outer Join
Dim res = From cust In db.Customers
            Group Join so In db.SalesOrders
              On so.CustomerId Equals cust.Id
              Into MatchedOrders = Group
          From mo In MatchedOrders.DefaultIfEmpty()
          Select cust.CustomerId, mo.SalesOrderId

For Each r In res
  Debug.WriteLine(r.CustomerId & ", " & r.SalesOrderId)
Next

Working with Unmatched Objects
The reason that neither version will work is because the queries will die as soon as you have a Customer without a matching SalesOrder (that's because, thanks to DefaultIfEmpty, the mo reference in the Select clause will be null when there's no matching SalesOrder). As you know, trying to retrieve the value of a property (like SalesOrderId) from a null reference doesn't end well. You're going to need to deal with the null value in the range variable that works with the MatchedOrders collection.

The Immediate If expression in C# and Visual Basic provides an answer by allowing you to test if a value is null. An Immediate If provides one result when a test is passed and another result when a test is failed. Even better, it does its job as part of a larger statement. In this case, my test will be whether the mo variable is null/Nothing. If the value isn't null, I'll return the SalesOrderId; if the value is null, I'll return "Not Found."

The bad news with using an expression to set a property value on an anonymous object is that no default property name will be generated for an expression. As a result, my Select clause will get more complicated because I have to explicitly assign a property name for the result of the Immediate If.

Putting that all together, here's the revised version in Visual Basic of the Select clause:

Select New With {
                  cust.CustomerId,
                  .SalesOrderId = If(mo IsNot Nothing, mo.SalesOrderId.ToString(), "Not Found")
}

And the C# version:

select new {
             cust.FirstName,
             SalesOrderId = mo != null ? mo.SalesOrderId.ToString() : "Not Found"
           };

On the other hand, if you want to limit your result to only the Customers who don't have a SalesOrder, the code gets simpler. With this query you don't, for example, have to worry about occasionally handling a missing SalesOrder because you know there will never be a SalesOrder.

To get just the Customers without a SalesOrder, all you have to do is test for the null/Nothing value provided by the DefaultIfEmpty method for missing SalesOrders. You can do that by adding a Where clause to the From clause that processes the generated SalesOrder collection (MatchedOrders). The resulting query, in C#, is shown in Listing 3; the Visual Basic query is in Listing 4.

Listing 3: C# Code To Find Object without a Match
var res = from cust in db.Customers
            join so in db.SalesOrders
              on cust.Id equals so.CustomerId
              into MatchedOrders
          from mo in MatchedOrders.DefaultIfEmpty()
            where mo == null                  
          select new { cust.CustomerId };
Listing 4: Visual Basic Code To Find Object without a Match
Dim res = From cust In db.Customers
            Group Join s In db.SalesOrders
              On s.CustomerId Equals cust.Id
              Into MatchedOrders = Group
          From mo In MatchedOrders
            Where mo Is Nothing
          Select cust.CustomerId

This has been a lot of words to analyze a single kind of LINQ query. However, this also sets me up to discuss using groups in LINQ in a later column. You'll probably find that functionality more generally useful.

About the Author

Peter Vogel is a system architect and principal in PH&V Information Services. PH&V provides full-stack consulting from UX design through object modeling to database design. Peter tweets about his VSM columns with the hashtag #vogelarticles. His blog posts on user experience design can be found at http://blog.learningtree.com/tag/ui/.

comments powered by Disqus

Featured

  • Get Started Using .NET Aspire with SQL Server & Azure SQL Database

    Microsoft experts are making the rounds educating developers about the company's new, opinionated, cloud-ready stack for building observable, production ready, distributed, cloud-native applications with .NET.

  • Microsoft Revamps Fledgling AutoGen Framework for Agentic AI

    Only at v0.4, Microsoft's AutoGen framework for agentic AI -- the hottest new trend in AI development -- has already undergone a complete revamp, going to an asynchronous, event-driven architecture.

  • IDE Irony: Coding Errors Cause 'Critical' Vulnerability in Visual Studio

    In a larger-than-normal Patch Tuesday, Microsoft warned of a "critical" vulnerability in Visual Studio that should be fixed immediately if automatic patching isn't enabled, ironically caused by coding errors.

  • Building Blazor Applications

    A trio of Blazor experts will conduct a full-day workshop for devs to learn everything about the tech a a March developer conference in Las Vegas keynoted by Microsoft execs and featuring many Microsoft devs.

  • Gradient Boosting Regression Using C#

    Dr. James McCaffrey from Microsoft Research presents a complete end-to-end demonstration of the gradient boosting regression technique, where the goal is to predict a single numeric value. Compared to existing library implementations of gradient boosting regression, a from-scratch implementation allows much easier customization and integration with other .NET systems.

Subscribe on YouTube