XML Connections

Sunday, November 25, 2007

XQJ Part XI - Processing large inputs

Today's post in the XQJ series explains how to handle and query large XML documents through the XQJ API.

Since XML became a standard in the late 90's, we have been taught that XML is a tree; and the most intuitive (and popular) representation of such tree has been (still is!) the Document Object Model (DOM).

When you think about querying XML documents, using XQuery, XSLT or XPath, you usually think about a processor that navigates the DOM tree, extracts, compares the values it needs, and it creates another DOM as a result of those operations. Which is indeed what happens using typical XML processing implementations. Although today's processors use a more optimal representation than DOM, the problem remains the same, scalability.

What happens if the XML you are dealing with cannot be represented in the physical constraints of the memory available to your application? That's usually the limit that typical "in-memory" XQuery, XSLT, XPath implementations hit. But what if you were able to forget about DOMs, forget about materializing in memory the whole XML tree and do XML processing in a purely streaming fashion?

Using an XQuery streaming processor, like DataDirect XQuery, is a good start. But a chain is only as strong as the weakest link. Beside the streaming capabilities of your XQuery implementation, also the API must have the provision to handle those large XML fragments.

From an XQuery API perspective, it is crucial that the input to your query can be handled in a streaming fashion. In XQJ Part VIII - Binding external variables we learned how to bind values to external variables declared in an xquery. By default, binding a value to an XQExpression or XQPreparedExpression using bindXXX(), it is consumed during the binding process, and it stays active and valid for all subsequent execution cycles. We say that XQJ operates in 'immediate binding mode'.
Let's look closely at one of the pipeline examples from the previous post in this series.

...
XQExpression xqe1;
XQSequence xqs1;

xqe1 = xqc.createExpression();
xqs1 = xqe1.executeQuery("doc('orders.xml')//order");

XQExpression xqe2;
xqe2 = xqc.createExpression();
xqe2.bindSequence(xqs1);
xqe1.close();

XQSequence xqs2;
xqs2 = xqe2.executeQuery(
"declare variable $orders as element(*,xs:untyped) external; " +
"for $order in $orders " +
"where $order/@status = 'closed' " +
"return " +
" <closed_order id = '{$order/@id}'>{ " +
" $order/* " +
" </closed_order>";
xqs2.writeSequence(System.out, null);
xqe2.close();
...

During the bindSequence() call, the complete xqs1 sequence is consumed. Subsequently we can safely close the xqe1 expression, freeing up any runtime resources it held. On the other hand, consuming the complete sequence during bindSequence() implies that the XQJ implementation has to buffer the data one way or the other for subsequent query evaluations. All this works perfectly fine as long as we're handling relative small XML instances. But as the data is buffered, it breaks all opportunities for the underlying XQuery processor to take advantage of its streaming capabilities.

If you know that the data bound to the external variable will be used for only a single XQuery execution, is there then a way to inform the XQJ/XQuery implementation of possible optimization opportunities, and use its streaming capabilities?

The default binding mode in XQJ is 'immediate', which means the value bound to an external variable is consumed during the bindXXX() method. In addition, an application has the ability to set the binding mode to 'deferred'. With deferred binding mode, the application gives a hint to the XQJ-implementation and underlying XQuery processor, to take advantage of its streaming capabilities. In deferred binding mode, bindings are only active for a single execution cycle. The application is required to explicitly re-bind values to every external variable before each execution.

You can change the binding mode through the XQStaticContext interface, as shown in the next example. Refer to Part VI in this series for more information on how to manipulate the static context.

...
XQStaticContext xqsc = xqc.getStaticContext();
// change the binding mode
xqsc.setBindingMode(XQConstants.BINDING_MODE_DEFERRED);
// make the changes effective
xqc.setStaticContext(xqsc);
...

In deferred mode the application cannot assume that the bound value will be consumed during the invocation of the bindXXX() method. The XQJ-implementation is free to read the bound value either at bind time or during the subsequent evaluation and processing of the query results. This has some consequences on when resources can be cleaned up. If we consider the first example again, it will not work properly in deferred binding mode. Note that xqe1 was closed right after calling bindSequence(). The example needs to be modified as follows,

...
XQExpression xqe1;
XQSequence xqs1;

xqe1 = xqc.createExpression();
xqs1 = xqe1.executeQuery("doc('orders.xml')//orders");
XQExpression xqe2 = xqc.createExpression();
xqe2.bindSequence(xqs1);

XQSequence xqs2 = xqe2.executeQuery(
"declare variable $orders as element(*,xs:untyped) external; " +
"for $order in $orders " +
"where $order/@status = 'closed' " +
"return " +
" <closed_order id = '{$order/@id}'>{ " +
" $order/* " +
" </closed_order>";
xqs2.writeSequence(System.out, null);
xqe2.close();
xqe1.close();
...

This example shows how to build a pipeline of xqueries. But deferred binding mode applies also to the other bindXXX() methods. In the next example we show how to bind a StreamSource to the context item. As binding mode is deferred, the implementation can handle the query in streaming mode and as such process huge XML documents that don't fit in available memory.

...
XQStaticContext xqsc = xqc.getStaticContext();
// change the binding mode
xqsc.setBindingMode(XQConstants.BINDING_MODE_DEFERRED);
// make the changes effective
xqc.setStaticContext(xqsc);

XQExpression xqe;
XQSequence xqs;

xqe = xqc.createExpression();
xqe.bindDocument(
XQConstants.CONTEXT_ITEM,
new StreamSource("large_orders_document.xml"));
xqs = xqe.executeExpression("/orders/order")
...

To conclude, using deferred binding mode requires a little more care than immediate. But the potential improvements when querying large XML documents is enormous. Of course, the API needs to provide the necessary functionality, but the heavy lifting is performed in the underlying XQuery processor. Especially with DataDirect XQuery, where deferred binding mode allows you to both take advantage of XML document projection and its XML streaming capabilities. This allows to query XML documents in the hundreds of megabytes, even in the gigabytes!

Labels:

Wednesday, November 21, 2007

XQJ Proposed Final Draft

The Proposed Final Draft for XQJ, which is being developed under the Java Community Process as JSR 225 is available for download. XQJ is the standard Java API to access XQuery implementations. Eventually XQJ will be for XQuery what JDBC is for SQL on the Java platform.

The Proposed Final Draft includes the following components,

  • Specification (PDF)
  • JavaDoc of the API (HTML)
  • Java sources of the API
  • JAR file of the API (xqjapi.jar)
  • Reference implementation
  • Technology Compatibility Kit

Comments can be send to jsr-225-comments@jcp.org.

Want to learn XQJ? The ongoing XQJ series is a good starting point. And there are implementations available, including DataDirect XQuery.

Labels:

Monday, November 19, 2007

Join us at the XML 2007 Conference!


The annual XML Conference is always a great opportunity to hear about XML technologies from the people that work on XML standards and products.

This year's XML 2007 Conference is happening in Boston, starting on the 3rd of December.

We will be there, talking about XML Converters (Tuesday, 2:45pm) and DataDirect XQuery (Wednesday 1pm, 2pm, 3pm and 4pm!).

In addition to that, DataDirect is also hosting a breakfast session, where we are planning to have some fun playing games and giving away prizes (promised! No slides during breakfast!).

Hope to see you all there!

Tech Tags:

Labels: , ,

Wednesday, November 7, 2007

Updating XML with XQuery 1.0

I was reading an interesting discussion yesterday on xquery-talk, replacing a node in in-memory XML.

How can one modify an XML structure through XQuery? In the future, the answer is definitely XQuery Update Facility. But the XQuery Update Facility is currently still work in progress, and not yet widely supported. What do we do today?

Ryan Grimm wrote an XQuery library to update an in-memory XML structure. And it looks like the in-mem-update library is pretty functional complete, having the following functions.

  • node-insert-child
  • node-insert-before
  • node-insert-after
  • node-replace
  • node-delete

How do you use these functions? Let's have a look at a query from the XQuery Update Facility Use Cases, and show an equivalent implementation based on the in-mem-update library.

Consider Q2, Enter a bid for user Annabel Lee on February 1st, 1999 for 60 dollars on item 1001. The XQuery Update Facility based solution is as follows,

let $uid := 
doc("users.xml")/users/user_tuple[name="Annabel Lee"]/userid
return do
insert
<bid_tuple>
<userid>{data($uid)}</userid>
<itemno>1001</itemno>
<bid>60</bid>
<bid_date>1999-02-01</bid_date>
<bid_tuple>
into doc("bids.xml")/bids

Using the library we end up doing something as follows,

import module namespace mem = "http://xqdev.com/in-mem-update" at "in-mem-update.xqy";
let $uid :=
doc("users.xml")/users/user_tuple[name="Annabel Lee"]/userid
return
mem:node-insert-child(
doc("bids.xml")/bids,
<bid_tuple>
<userid>{data($uid)}</userid>
<itemno>1001</itemno>
<bid>60</bid>
<bid_date>1999-02-01</bid_date>
</bid_tuple>)
Looks pretty similar, no? There is actually one fundamental difference. With the XQuery Update Facility, the bids.xml document is actually updated. The in-mem-update variant, doesn't update the bids.xml document, but rather returns a copy of the original document reflecting the change.
This shows one of the possible issues with the library. Each modification made to an XML structure results in a copy. Making a lot of changes to a single XML structure, or updating a huge XML structure might affect performance. Still, I believe the library is useful in a lot of common scenarios.

The library is written to be used with MarkLogic Server, and unfortunately based on an older version of the XQuery specification. This makes it fail out of the box using XQuery 1.0 compliant processors. I updated the XQuery module in order to make it XQuery 1.0 compatible, and in addition added support for document nodes. It's available for download here.

So, you can now update all your data with DataDirect XQuery. Using the ddtek:sql-insert, ddtek:sql-update and ddtek:sql-delete functions you can update your relation database. And using the in-mem-update library you can now also make changes to your XML documents.

I believe this library is complementary to the functions modifying XML elements and attributes available in the FunctX XQuery library. Wouldn't it be cool to have these functions added to FunctX? I leave it to Ryan Grimm and Priscilla Walmsley to discuss this in detail.

Labels: , , , ,