Ask Kathleen
Convert XPath into LINQ to XML
Learn how to convert XPath statements in XMLNode.SelectNodes expressions to LINQ to XML for better maintainability and performance; also, drill down on the performance implications of using LINQ to XML relative to their XPath against XmlDocument objects counterparts.
Technologies mentioned in this article include VB.NET, C#, and XAML.
Q We had a contractor do some XML work for us that we're having a hard time maintaining, partly because no one understands the complex XPath statements he used in XmlNode.SelectNodes statements. I'm trying to rewrite this in LINQ to XML so that we have an easier time maintaining it. However, I'm getting the error "Namespace Manager or XsltContext needed. This query has a prefix, variable, or user-defined function." How can I make this work?
A The problem is managing XML namespaces correctly in LINQ to XML. You didn't mention whether you're working in VB or C#, so I'll show you how to do solve this in C# with the .NET classes, and then show you the same solution with the Visual Basic shortcuts.
I'll use the example XML structure in Listing 1, to access the expanded online version of this listing that incorporates additional fields, customers, and orders). This XML declares a namespace without a prefix, so all non-prefixed elements are in the namespace http:kadgen/CustomersAndOrders. The real name of the element includes the namespace, and your queries need to reflect this.
You handle namespace resolution with a namespace manager when working with the SelectNodes method. If the namespace manager doesn't support a prefix in your XPath, you receive an error:
string xPath =
"//co:Order[.//co:ShipRegion[ = '"
+ state + "']";
XmlDocument xmlDoc =
new XmlDocument();
xmlDoc.Load(xmlFileName);
XmlNamespaceManager nsmgr =
new XmlNamespaceManager(
xmlDoc.NameTable);
nsmgr.AddNamespace("co",
"http:CustomersAndOrder");
XmlNodeList nodeList =
xmlDoc.SelectNodes(
xPath, nsmgr);
You can convert this to work with an XDocument using the XPathSelectElements extension method in the Sys-tem.Xml.XPath namespace:
IEnumerable<XElement> elements =
xDoc.XPathSelectElements(
xPath, nsmgr);
This is an easy conversion, but you're stuck with the same complex XPath statements.
To rewrite this as a LINQ expression, you need to shift to element naming to use instances of the XName class. The XName and XNamespace classes don't have constructors because .NET optimizes them through atomization. Instead of constructors, implicit conversion operators create XName objects from strings. You can create an XName instance from a string using the expanded notation:
XName xName =
"{http:CustomersAndOrder}Customer";
If you had to write that every time you accessed an element, you'd go crazy! Instead, create a variable, perhaps at the class level, that includes an instance of the XNamespace class and use this wherever you use a namespace:
XNamespace co =
"http:CustomersAndOrder";
var q = from order
in xDoc.Descendants(
co + "Order")
where order.Descendants(
co + "ShipRegion").First().Value
== state
select order;
You can use the resulting query in a for each loop or convert it using the ToArrray and ToList extension methods.
Visual Basic lets you define the namespace and prefix in an Imports statement:
Imports _
<xmlns:co="http:CustomersAndOrder">
VB also supplies shortcuts on the three main XML axes: child, descendant, and attribute. If a schema is available, VB offers IntelliSense specific to your XML structure along these axes. These shortcuts simplify the VB query:
Dim q = From order _
In xDoc...<co:Order> _
Where order...<co:ShipRegion>.Value _
= state
LINQ to XML makes it easier to work with namespaces than any previous version of .NET because you use them directly with background optimizations. This is true in both C# and Visual Basic, although VB takes the syntax simplification one step further.
Q I'm currently using a lot of XPath against XmlDocument objects. What are the performance implications of converting these into LINQ
to XML?
A Many LINQ to XML statements run significantly faster than their XPath counterparts. But this depends on how the query is constructed, so you might plan for performance checks on frequently used queries against the types of XML files you use.
In addition to being faster in many scenarios, LINQ is easier to performance tune because it's hard to figure out what those complex XPath statements are doing. It's easier to understand LINQ to XML queries and delayed execution means you can break them up into logical chunks without incurring a performance penalty.
I'll use this non-trivial XPath statement to explore performance. It extracts all the customers who have shipped to a specific state from the XML in Listing 1:
xPath As String = _
"//co:Customer[@CustomerID = " & _
"//co:ShipRegion[. = '" & _
state & "']//ancestor::co:Order" & _
/co:CustomerID]/co:CompanyName"
You can write this query in LINQ using three different approaches. You can use a join, nested query, or two adjacent queries. Throw in a few variations, including using extension methods and there are tons of possibilities. Performance is not equivalent--they vary from about three times to about 30 times faster than the XPath version. A nested query looks a bit like a nested query in SQL Server:
Dim elements = _
From customer In xDoc...<co:Customer> _
Where (From order In xDoc...<co:Order> _
Where order...<co:ShipRegion>.First().Value _
= "OR" And _
order.<co:CustomerID>.First().Value = _
customer.@CustomerID _
Select 1).Count() > 0 _
Select customer.<co:CompanyName>.First().Value _
Distinct
You can solve the same problem with a join:
Dim elements = _
From region In xDoc...<co:ShipRegion> _
Join customer In xDoc...<co:Customer> _
On region.Ancestors( _
ns + "Order").<co:CustomerID>.Value _
Equals customer.@CustomerID _
Where region.Value = "OR" _
Select customer.<co:CompanyName>.Value _
Distinct
Both produce the same results, but performance testing (available in the download) shows the join to be nearly 10 times faster than the nested query, due to optimizations LINQ supplies to the join. The join is also easier to read and shorter, so it wins hands down.
You can simplify this query by splitting it into two parts, which also makes debugging easier. First, determine the customer IDs for the orders you're interested in, and then determine what company names belong to these customer IDs with a join:
Dim selectedCustIDs = _
From order In xDoc...<co:Order> _
Where order...<co:ShipRegion>.Value = "OR" _
Select order...<co:CustomerID>.Value
Dim elements = _
From customer _
In xDoc.Descendants(ns + "Customer") _
Join selectedCustId _
In selectedCustIDs _
On customer.@CustomerID _
Equals selectedCustId _
Select customer.<co:CompanyName> _
Distinct
There are a few variations, including using the Contains method on the IEnumerable(Of Order), but the join is faster.
The execution is delayed, so the expression trees of the two queries are combined before the query executes. Assuming you iterate only the second query, only the combined query is executed and performance is the same as the earlier combined join. You get increased readability and easier debugging for free.
You can also write this query in C#:
var selectedCustIDs =
from order in
xDoc.Descendants(ns + "Order")
where order.Descendants(
ns + "ShipRegion").First(
).Value == "OR"
select order.Descendants(
ns + "CustomerID").First().Value;
var elements = (
from customer in
xDoc.Descendants(ns + "Customer")
join selectedCustId in selectedCustIDs
on customer.Attribute(
"CustomerID").Value equals
selectedCustId
select customer.Elements(
ns + "CompanyName").First(
)).Distinct();
That's harder to read, but you can clean it up a good bit by writing a few extension methods:
public static XElement
FirstDescendant(
this XElement element,
XNamespace ns, string name)
{
return element.Descendants(
ns + name).FirstOrDefault();
}
public static XElement
FirstElement(this XElement element,
XNamespace ns, string name)
{
return element.Elements(
ns + name).FirstOrDefault();
}
These extension methods allow you to describe the intent of your selection along the axis in a single method:
var selectedCustIDs =
from order in
xDoc.Descendants(
ns + "Order")
where order.FirstDescendant(
ns, "ShipRegion").Value == "OR"
select order.FirstDescendant(
ns, "CustomerID").Value;
var elements = (
from customer in
xDoc.Descendants(ns + "Customer")
join selectedCustId
in selectedCustIDs
on customer.Attribute(
"CustomerID").Value
equals selectedCustId
select customer.FirstElement(
ns, "CompanyName")).Distinct();
C# and VB both enable you to exact a bit more performance if you cache your XName objects so they're created only once.
I'm not ready to guarantee that LINQ to XML statements will always be faster than their XPath equivalents, but in my experience, once I get things tuned, LINQ to XML is at least as fast and usually faster.