Thursday, May 31, 2012

xpath in JAVA

http://www.ibm.com/developerworks/library/x-javaxpathapi/index.html

If you send someone out to purchase a gallon of milk, what would you rather tell that person? "Please go buy a gallon of milk." Or, "Exit the house through the front door. Turn left at the sidewalk. Walk three blocks. Turn right. Walk one half block. Turn right and enter the store. Go to aisle four. Walk five meters down the aisle. Turn left. Pick up a gallon jug of milk. Bring it to the checkout counter. Pay for it. Then retrace your steps home." That's ridiculous. Most adults are intelligent enough to procure the milk on their own with little more instruction than "Please go buy a gallon of milk."
Query languages and computer search are similar. It's easier to say, "Find a copy of Cryptonomicon" than it is to write the detailed logic for searching some database. Because search operations have very similar logic, you can invent general languages that allow you to make statements like "Find all the books by Neal Stephenson," and then write an engine that processes those queries against certain data stores.
XPath
Among the many query languages, Structured Query Language (SQL) is a language designed and optimized for querying certain kinds of relational databases. Other less familiar query languages include Object Query Language (OQL) and XQuery. However, the subject of this article is XPath, a query language designed for querying XML documents. For example, a simple XPath query that finds the titles of all the books in a document whose author is Neal Stephenson might look like this:
//book[author="Neal Stephenson"]/title

By contrast, a pure DOM search for that same information would look something like Listing 1:

Listing 1. DOM code to find all the title elements of books by Neal Stephenson
                
ArrayList result = new ArrayList();
NodeList books = doc.getElementsByTagName("book");
for (int i = 0; i < books.getLength(); i++) {
    Element book = (Element) books.item(i);
    NodeList authors = book.getElementsByTagName("author");
    boolean stephenson = false;
    for (int j = 0; j < authors.getLength(); j++) {
        Element author = (Element) authors.item(j);
        NodeList children = author.getChildNodes();
        StringBuffer sb = new StringBuffer();
        for (int k = 0; k < children.getLength(); k++) {
            Node child = children.item(k);
            // really should to do this recursively
            if (child.getNodeType() == Node.TEXT_NODE) {
                sb.append(child.getNodeValue());
            }
        }
        if (sb.toString().equals("Neal Stephenson")) {
            stephenson = true;
            break;
        }
   }

    if (stephenson) {
        NodeList titles = book.getElementsByTagName("title");
        for (int j = 0; j < titles.getLength(); j++) {
            result.add(titles.item(j));
        }
    }

}
        

Believe it or not, the DOM code in Listing 1 still isn't as generic or robust as the simple XPath expression. Which would you rather write, debug, and maintain? I think the answer is obvious.
However, expressive as it is, XPath is not the Java language -- in fact, XPath is not a complete programming language. There are many things you can't say in XPath, even queries you can't make. For example, XPath can't find all the books whose International Standard Book Number (ISBN) check digit doesn't match or all the authors for whom the external accounts database shows a royalty payment is due. Fortunately, it is possible to integrate XPath into Java programs so that you get the best of both worlds: Java for what Java is good for and XPath for what XPath is good for.
Until recently, the exact application program interface (API) by which Java programs made XPath queries varied with the XPath engine. Xalan had one API, Saxon had another, and other engines had other APIs. This meant your code tended to lock you into one product. Ideally, you'd like to able to experiment with different engines that have different performance characteristics without undue hassle or rewriting of code.
For this reason, Java 5 introduced the javax.xml.xpath package to provide an engine and object-model independent XPath library. This package is also available in Java 1.3 and later if you install Java API for XML Processing (JAXP) 1.3 separately. Among other products, Xalan 2.7 and Saxon 8 include an implementation of this library.
A simple example
I'll begin with a demonstration of how this actually works in practice. Then I'll delve into some of the details. Suppose you want to query a list of books to find those written by Neal Stephenson. In particular, assume the list is in the form shown in Listing 2:

Listing 2. XML document containing book information
                
<inventory>
    <book year="2000">
        <title>Snow Crash</title>
        <author>Neal Stephenson</author>
        <publisher>Spectra</publisher>
        <isbn>0553380958</isbn>
        <price>14.95</price>
    </book>
 
    <book year="2005">
        <title>Burning Tower</title>
        <author>Larry Niven</author>
        <author>Jerry Pournelle</author>
        <publisher>Pocket</publisher>
        <isbn>0743416910</isbn>
        <price>5.99</price>
    </book>
 
    <book year="1995">
        <title>Zodiac</title>
        <author>Neal Stephenson</author>
        <publisher>Spectra</publisher>
        <isbn>0553573862</isbn>
        <price>7.50</price>
    </book>

    <!-- more books... -->
 
</inventory>

Abstract factories

The XPathFactory is an abstract factory. The abstract factory design pattern enables this one API to support different object models such as DOM, JDOM, and XOM. To choose a different model, you pass a Uniform Resource Identifier (URI) identifying the object model to the XPathFactory.newInstance() method. For example, http://xom.nu/ might select XOM. However, in practice, DOM is the only object model this API supports so far.
The XPath query that finds all the books is simple enough: //book[author="Neal Stephenson"]. To find the titles of those books, simply add one more step so the expression becomes //book[author="Neal Stephenson"]/title. Finally, what you really want are the text node children of the title element. This requires one more step so the full expression is //book[author="Neal Stephenson"]/title/text().
Now I'll produce a simple program that executes this search from Java language and then prints out the titles of all the books it finds. First you need to load the document into a DOM Document object. For simplicity, I'll assume the document is in the books.xml file in the current working directory. Here's a simple code fragment that parses the document and constructs the corresponding Document object:

Listing 3. Parsing a document with JAXP
                
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setNamespaceAware(true); // never forget this!
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc = builder.parse("books.xml");

So far, this is just standard JAXP and DOM, nothing really new.
Next you create an XPathFactory:
XPathFactory factory = XPathFactory.newInstance();

You then use this factory to create an XPath object:
XPath xpath = factory.newXPath();

The XPath object compiles the XPath expression:
XPathExpression expr = xpath.compile("//book[author='Neal Stephenson']/title/text()");

Immediate evaluation: If you only use the XPath expression once, you might want to skip the compilation step and call the evaluate() method on the XPath object instead. However, if you reuse the same expression many times, compilation is likely faster.
Finally, you evaluate the XPath expression to get the result. The expression is evaluated with respect to a certain context node, which in this case is the entire document. It's also necessary to specify the return type. Here I ask for a node-set back:
Object result = expr.evaluate(doc, XPathConstants.NODESET);

You can then cast the result to a DOM NodeList and iterate through that to find all the titles:
NodeList nodes = (NodeList) result;
for (int i = 0; i < nodes.getLength(); i++) {
    System.out.println(nodes.item(i).getNodeValue()); 
}

Listing 4 puts this all together into a single program. Notice also that these methods can throw several checked exceptions that I must declare in a throws clause, though I glossed over them above:

Listing 4. A complete program to query an XML document with a fixed XPath expression
                
import java.io.IOException;
import org.w3c.dom.*;
import org.xml.sax.SAXException;
import javax.xml.parsers.*;
import javax.xml.xpath.*;

public class XPathExample {

  public static void main(String[] args) 
   throws ParserConfigurationException, SAXException, 
          IOException, XPathExpressionException {

    DocumentBuilderFactory domFactory = DocumentBuilderFactory.newInstance();
    domFactory.setNamespaceAware(true); // never forget this!
    DocumentBuilder builder = domFactory.newDocumentBuilder();
    Document doc = builder.parse("books.xml");

    XPathFactory factory = XPathFactory.newInstance();
    XPath xpath = factory.newXPath();
    XPathExpression expr 
     = xpath.compile("//book[author='Neal Stephenson']/title/text()");

    Object result = expr.evaluate(doc, XPathConstants.NODESET);
    NodeList nodes = (NodeList) result;
    for (int i = 0; i < nodes.getLength(); i++) {
        System.out.println(nodes.item(i).getNodeValue()); 
    }

  }

}

The XPath data model
Whenever you mix two different languages such as XPath and Java, expect some noticeable seams where you've glued the two together. Not everything fits just right. XPath and Java language do not have identical type systems. XPath 1.0 has only four basic data types:
  • node-set
  • number
  • boolean
  • string
The Java language, of course, has many more, including user-defined object types.
Most XPath expressions, especially location paths, return node-sets. However, there are other possibilities. For example, the XPath expression count(//book) returns the number of books in the document. The XPath expression count(//book[@author="Neal Stephenson"]) > 10 returns a boolean: true if there are more than ten books by Neal Stephenson in the document, false if there are ten or fewer.
The evaluate() method is declared to return Object. What it actually does return depends on the result of the XPath expression, as well as the type you ask for. Generally speaking, an XPath
  • number maps to a java.lang.Double
  • string maps to a java.lang.String
  • boolean maps to a java.lang.Boolean
  • node-set maps to an org.w3c.dom.NodeList

XPath 2

So far I assumed that you're working with XPath 1.0. XPath 2 significantly expands and revises the type system. The main change needed in the Java XPath API to support XPath 2 is additional constants for returning the new XPath 2 types.
When you evaluate an XPath expression in Java, the second argument specifies the return type you want. There are five possibilities, all named constants in the javax.xml.xpath.XPathConstants class:
  • XPathConstants.NODESET
  • XPathConstants.BOOLEAN
  • XPathConstants.NUMBER
  • XPathConstants.STRING
  • XPathConstants.NODE
The last one, XPathConstants.NODE, doesn't actually match an XPath type. You use it when you know the XPath expression will only return a single node or you don't want more than one node. If the XPath expression does return more than one node and you've specified XPathConstants.NODE, then evaluate() returns the first node in document order. If the XPath expression selects an empty set and you've specified XPathConstants.NODE, then evaluate() returns null.
If the requested conversion can't be made, then evaluate() throws an XPathException.
Namespace contexts
If the elements in the XML document are in a namespace, then the XPath expression for querying that document must use the same namespace. The XPath expression does not need to use the same prefixes, only the same namespace URIs. Indeed, when the XML document uses the default namespace, the XPath expression must use a prefix even though the target document does not.
However, Java programs are not XML documents, so normal namespace resolution does not apply. Instead you provide an object that maps the prefixes to the namespace URIs. This object is an instance of the javax.xml.namespace.NamespaceContext interface. For example, suppose the books document is placed in the http://www.example.com/books namespace, as in Listing 5:

Listing 5. XML document using the default namespace
                
<inventory xmlns="http://www.example.com/books">
    <book year="2000">
        <title>Snow Crash</title>
        <author>Neal Stephenson</author>
        <publisher>Spectra</publisher>
        <isbn>0553380958</isbn>
        <price>14.95</price>
    </book>

    <!-- more books... -->

</inventory>

The XPath expression that finds the titles of all of Neal Stephenson's books now becomes something like //pre:book[pre:author="Neal Stephenson"]/pre:title/text(). However, you have to map the prefix pre to the URI http://www.example.com/books. It's a little silly that the NamespaceContext interface doesn't have a default implementation in the Java software development kit (JDK) or JAXP, but it doesn't. However, it's not hard to implement yourself. Listing 6 demonstrates a simple implementation just for this one namespace. You should map the xml prefix as well.

Listing 6. A simple context for binding a single namespace plus the default
                
import java.util.Iterator;
import javax.xml.*;
import javax.xml.namespace.NamespaceContext;

public class PersonalNamespaceContext implements NamespaceContext {

    public String getNamespaceURI(String prefix) {
        if (prefix == null) throw new NullPointerException("Null prefix");
        else if ("pre".equals(prefix)) return "http://www.example.com/books";
        else if ("xml".equals(prefix)) return XMLConstants.XML_NS_URI;
        return XMLConstants.NULL_NS_URI;
    }

    // This method isn't necessary for XPath processing.
    public String getPrefix(String uri) {
        throw new UnsupportedOperationException();
    }

    // This method isn't necessary for XPath processing either.
    public Iterator getPrefixes(String uri) {
        throw new UnsupportedOperationException();
    }

}

It's not hard to use a map to store the bindings and add setter methods that allow for a more reusable namespace context.
After you create a NamespaceContext object, install it on the XPath object before you compile the expression. From that point forward, you can query using those prefixes as before. For example:

Listing 7. XPath query that uses namespaces
                  XPathFactory factory = XPathFactory.newInstance();
  XPath xpath = factory.newXPath();
  xpath.setNamespaceContext(new PersonalNamespaceContext());
  XPathExpression expr 
    = xpath.compile("//pre:book[pre:author='Neal Stephenson']/pre:title/text()");

  Object result = expr.evaluate(doc, XPathConstants.NODESET);
  NodeList nodes = (NodeList) result;
  for (int i = 0; i < nodes.getLength(); i++) {
      System.out.println(nodes.item(i).getNodeValue()); 
  }

Function resolvers
On occasion, it's useful to define extension functions in Java language for use within XPath expressions. These functions perform tasks that are difficult to impossible to perform with pure XPath. However, they should be true functions, not simply arbitrary methods. That is, they should have no side-effects. (XPath functions can be evaluated in any order and any number of times.)
Extension functions accessed through the Java XPath API must implement the javax.xml.xpath.XPathFunction interface. This interface declares a single method, evaluate:
public Object evaluate(List args) throws XPathFunctionException

This method should return one of the five types that Java language can convert to XPath:
  • String
  • Double
  • Boolean
  • Nodelist
  • Node
For example, Listing 8 shows an extension function that verifies the checksum in an ISBN and returns a Boolean. The basic rule for this checksum is that each of the first nine digits is multiplied by its position (that is, the first digit times one, the second digit times two, and so on). These values are added, and the remainder after the division by eleven is taken. If the remainder is ten, then the last digit is X.

Listing 8. An XPath extension function for checking ISBNs
                
import java.util.List;
import javax.xml.xpath.*;
import org.w3c.dom.*;

public class ISBNValidator implements XPathFunction {

  // This class could easily be implemented as a Singleton.
    
  public Object evaluate(List args) throws XPathFunctionException {

    if (args.size() != 1) {
      throw new XPathFunctionException("Wrong number of arguments to valid-isbn()");
    }

    String isbn;
    Object o = args.get(0);

    // perform conversions
    if (o instanceof String) isbn = (String) args.get(0);
    else if (o instanceof Boolean) isbn = o.toString();
    else if (o instanceof Double) isbn = o.toString();
    else if (o instanceof NodeList) {
        NodeList list = (NodeList) o;
        Node node = list.item(0);
        // getTextContent is available in Java 5 and DOM 3.
        // In Java 1.4 and DOM 2, you'd need to recursively 
        // accumulate the content.
        isbn= node.getTextContent();
    }
    else {
        throw new XPathFunctionException("Could not convert argument type");
    }

    char[] data = isbn.toCharArray();
    if (data.length != 10) return Boolean.FALSE;
    int checksum = 0;
    for (int i = 0; i < 9; i++) {
        checksum += (i+1) * (data[i]-'0');
    }
    int checkdigit = checksum % 11;

    if (checkdigit + '0' == data[9] || (data[9] == 'X' && checkdigit == 10)) {
        return Boolean.TRUE;
    }
    return Boolean.FALSE;

  }

}

The next step is to make the extension function available to the Java program. To do this, you install a javax.xml.xpath.XPathFunctionResolver in the XPath object before compiling the expression. The function resolver maps an XPath name and namespace URI for the function to the Java class that implements the function. Listing 9 is a simple function resolver that maps the extension function valid-isbn with the namespace http://www.example.com/books to the class in Listing 8. For example, the XPath expression //book[not(pre:valid-isbn(isbn))] finds all the books whose ISBN checksum doesn't match.

Listing 9. A function context that recognizes the valid-isbn extension function
                
import javax.xml.namespace.QName;
import javax.xml.xpath.*;

public class ISBNFunctionContext implements XPathFunctionResolver {

  private static final QName name 
   = new QName("http://www.example.com/books", "valid-isbn");

  public XPathFunction resolveFunction(QName name, int arity) {
      if (name.equals(ISBNFunctionContext.name) && arity == 1) {
          return new ISBNValidator();
      }
      return null;
  }

}

Because extension functions must be in namespaces, you must use a NamespaceResolver when evaluating an expression containing extension functions, even if the document being queried doesn't use namespaces at all. Because XPathFunctionResolver, XPathFunction, and NamespaceResolver are interfaces, you can even put them all in the same class, if that's convenient.
In conclusion
It is far, far easier to write queries in declarative languages, like SQL and XPath, than in imperative languages, like Java and C. It is far, far easier to write complex logic in Turing complete languages, like Java and C, than in declarative languages, like SQL and XPath. Fortunately, it's possible to mix the two using APIs such as Java Database Connectivity (JDBC) and javax.xml.xpath. As more and more of the world's data moves to XML, javax.xml.xpath will become as important as java.sql already is.

No comments:

Post a Comment

Blog Archive