Friday, May 4, 2012

VB.NET using xpath

XML is perhaps the most common way to store date these days outside of a database. Often, you’ll have the job of searching XML files and strings for particular data. I’ve seen several programmers struggle with this and come up with bad hacks to do the job. XPath, or XML Path Language, provides a way to search XML documents. Unfortunately, it is a bit daunting at first. When looking at the XPath specs it is easy for a novice to get lost. In this article, I’ll try to simplify it somewhat and give some common examples of how to use it.
The Basic VB.NET Code
While there are several .NET objects related to XPath, we will concentrate on the two easiest ones to deal with in this article. In many cases, they’re all you need. These functions are SelectSingleNode and SelectNodes. These functions are part of the XmlNode class, the class that many of the XML classes are derived from. SelectSingleNode returns the first matching node in the target document or fragment while SelectNodes returns a collection of matching nodes.
Just Like a Directory Tree
Anyone who does programming should be familiar with a directory tree, the way directories and sub-directories are organized on a disk drive. As you might guess from the name, XPath works the same way within the framework of the XML Document. The main difference is that we use a forward slash, ‘/’ rather than a back slash, ‘\’, to describe the path. For example, if we look at this snippet of an XML document, how would we describe the path to the product ID?
<products>
    <product>
        <idcode>00872624</idcode>
    </product>
</products>
Here’s what our code would look like:
Dim IdCode as String = ProductXml.SelectSingleNode("/products/product/idcode").InnerText
.
.
.
As you can see, the path is exactly as you would expect “/products/product/idcode”. It walks down the tree to our target value.
Note that we’re using InnerText and not Value to get the value of the element. The Value returns Nothing in most cases so watch out how you code it.
Multiple Values
The example above worked OK if we had a single value, but what if our XML fragment looked like this:
<products>
    <product>
        <idcode>00872624</idcode>
        <category>12</idcode>
    </product>
    <product>
        <idcode>00872845</idcode>
        <category>17</idcode>
    </product>
    <product>
        <idcode>01871024</idcode>
        <category>12</idcode>
    </product>
</products>
What would our results look like? How would we work with it? Here’s our code:
Dim ProductIDList As XmlNodeList = ProductXml.SelectNodes("/products/product/idcode")
For Each ProductNode As XmlNode In ProductIDList
    LoadProduct(ProductNode.InnerText)
Next
In this example we load the nodes that we want into the list using the SelectNodes function and then work with the individual nodes.
Filters
OK, it seems pretty easy if we’re dealing with an absolute path but how can we filter the values? Let’s assume that given the fragment above, we only want to get the idcode nodes from products with a category value of 12. How would we describe this XPath and how would we code it?
Filter patterns in XPath are enclosed in square brackets: [value]. So to describe our XPath to our desired category our XPath should look like this: “/products/product[category=12]/idcode”. What this tells XPath is to look at the product nodes and only select those that have a category value of 12.
What if we wanted to exclude products with a category of 12 instead? In that case we would use the inequality operator, !=, like so: “/products/product[category!=12]/idcode”. Note that this is the C style operator, not the VB style <> one.
If you wanted to include both category 12 and 15, what would this look like? It would look like this: “/products/product[category=12 or category=15]/idcode”. Note that there are more complex queries that can be done using XSL and union operations but I won’t be covering them in this article.
Wildcards
There are also wildcard operators in XPath: the asterisk ‘*’, the single and double slash ‘/’ ‘//’, and period and double period ‘.’ ‘..’.
The asterisk ‘*’ tells the XPath operation to select all of the elements regardless of name. For example, If you wanted to find a particular product in our example and return all of the node for the product, the XPath would look like this: “/products/product[idcode=01871024]/*”.
The single slash ‘/’ tells XPath to select immediate children, nodes that are one level below the root node, while the double slash ‘//’ recursively searches through all of the child nodes. This operator can be effective when you need to extract information in depth or just gather it from the top level.
The single period indicates the current, top level, node while the double period indicates the parent of the current node. This can be handy when you’re navigating the tree. For example, if you had selected a particular node and wanted to read an attribute from its parent node.
Attributes
So far, I’ve only dealt with XML that has elements with no attributes. But how does XPath work with attributes as seen in this example?
<products>
    <product source='ABC Corp' >
        <idcode>00872624</idcode>
        <category>12</idcode>
    </product>
    <product source='XYZ Inc.' >
        <idcode>00872845</idcode>
        <category>17</idcode>
    </product>
    <product source='ABC Corp' >
        <idcode>01871024</idcode>
        <category>12</idcode>
    </product>
</products>
The @ symbol identifies part of the path as an attribute rather than an element. Let’s say that we need to get all of the idcodes for products where the source is ABC Corp, what would the XPath look like? It would look like this: “/products/product[@source='ABC Corp']/idcode”. Make sure that you remember the quotes around the attribute.
That’s all for this introduction to using XPath. There is a lot more to learn about it so I’ll probably do so more articles on it at some point. Let me know what you would like to see more about or let me know if you have any questions or observations on XPath by leaving me a comment.

No comments:

Post a Comment

Blog Archive