XML Connections

Thursday, March 27, 2008

Joining XML with RDBMS, how and what are the consequences?

A few months ago, my friend Minollo blogged about XQuery against RDBMS: let the engine optimize your SQL. And last week again, a question on some newsgroup attracted our attention.

Imagine you have an XML document with customers, there might be one or a few hundred, and want to retrieve all orders for these customers out of your Oracle database. One could write something as follows, using a general comparison in the where clause.

for $order in collection("ORDERS")/ORDERS
where $order/O_CUSTKEY = doc("customers.xml")/customers/customer
return
<order id="{$order/O_ORDERKEY}">
<customer>{$order/O_CUSTKEY/data(.)}</customer>
<date>{$order/O_ORDERDATE/data(.)}</date>
<price>{$order/O_TOTALPRICE/data(.)}</price>
</order>

As explained in the white paper Implementing XQuery efficiently for relational databases, DataDirect XQuery always handles general comparisons efficiently and translates them into SQL - taking advantage of database indexes and only retrieving the relevant data out of the database.

Most other XQuery implementations don't translate such general comparisons into SQL, because the second argument of the general comparison is a sequence of customers. In other words, the where clause is not evaluated by the database engine, with all the negative performance and scalability consequences that implies. Basically, all records in the ORDERS table are retrieved out of the database, and filtering is accomplished in memory.

As advised on the newsgroup, one can rewrite the query as follows.

for $customer in doc("customers.xml")/customers/customer
for $order in collection("ORDERS")/ORDERS
where $order/O_CUSTKEY = $customer
return
<order id="{$order/O_ORDERKEY}">
<customer>{$order/O_CUSTKEY/data(.)}</customer>
<date>{$order/O_ORDERDATE/data(.)}</date>
<price>{$order/O_TOTALPRICE/data(.)}</price>
</order>

Both queries yield the same result. Well, to be technically accurate, there are some differences.

  • the order of the returned orders might be different. With the second query, the orders are grouped per customer
  • in order to make both queries equivalent, with the second alternative, you need to get distinct values out of the customers.xml
But ok, let's assume for a moment both queries yield the same result.

One interesting thing about this second query, compared to the first, is that most XQuery implementations are capable of translating this where clause in SQL as both arguments of the comparison are single values. And you end up with SQL as follows, executed through a nested loop algorithm.

select  O_ORDERKEY, O_CUSTKEY, O_ORDERDATE, O_TOTALPRICE
from ORDERS
where O_CUSTKEY = ?

Further, better implementations will try to minimize the number of roundtrips to the database, and combine multiple requests in a single SQL statement. Smells like a topic for another post, anyway...

Looks great, our initial concerns are answered, the where clause is evaluated by the SQL engine. We're perfectly happy, right?

Unfortunately, there is a significant drawback with the second query! Suppose, for example, we want to order our results by ORDERDATE and TOTALPRICE.

for $customer in doc("customers.xml")/customers/customer
for $order in collection("ORDERS")/ORDERS
where $order/O_CUSTKEY = $customer
order by $order/O_ORDERDATE, $order/O_TOTALPRICE
return
<order id="{$order/O_ORDERKEY}">
<customer>{$order/O_CUSTKEY/data(.)}</customer>
<date>{$order/O_ORDERDATE/data(.)}</date>
<price>{$order/O_TOTALPRICE/data(.)}</price>
</order>

Although the where clause is translated into SQL, the order by clause is not. It is simply not possible! Remember that multiple SQL statements will be executed in a nested loop, all these results are buffered in memory, after which we need to order the buffered results. This is clearly a significant drawback of the proposed alternative. Handing order by clauses out of your database engine, implies a serious performance penalty and excessive memory consumption, even with medium sized query results.

Let's go back to our initial query and add the order by clause.

for $order in collection("ORDERS")/ORDERS
where $order/O_CUSTKEY = doc("customers.xml")/customers/customer
order by $order/O_ORDERDATE, $order/O_TOTALPRICE
return
<order id="{$order/O_ORDERKEY}">
<customer>{$order/O_CUSTKEY/data(.)}</customer>
<date>{$order/O_ORDERDATE/data(.)}</date>
<price>{$order/O_TOTALPRICE/data(.)}</price>
</order>

If you have an XQuery implementation that is capable of translating such where clauses into SQL, the topic this post started with, it can as a consequence go further and also translate the order by clause into SQL! With DataDirect XQuery, the execution plan for this last query is as follows. Note that both the where and order by clause are pushed into SQL and processed by your Oracle database.

As you see, choosing the right XQuery implementation can make a huge difference. If you want to find out more about how DataDirect XQuery generates SQL and leverages the strengths of your current database investment, check out this white paper.

Labels: , , , ,

Tuesday, August 28, 2007

XQuery against RDBMS: let the engine optimize your SQL


We recently noticed some questions on a newsgroup that attracted our attention. The question was something similar to this:

How to create an XQuery which generates SQL like SELECT t1.columnName1 FROM Table t1 WHERE t1.columnName2 IN('as','fa','pr')

My immediate reaction was: why is he worrying about that? You should't need to think about how to write an XQuery to obtain a specific SQL; it is the XQuery processor's goal to digest your XQuery and make the "best" SQL out of it...

The obvious XQuery that comes to my mind would be...

for $ts in fn:collection("Table")/Table
where $t1/columnName2 = ('as','fa','pr')
return $ts/columnName1


Strangely enough, experts on that newsgroup suggested to instead write the query like:

for $t in Table() where $t/columnName2 = 'as' or $t/columnName2 = 'fa' or $t/columnName2 = 'pr'
return $t


...or, assuming a sequence of values on which to filter:

for $v in $values for $t in Table()
where $t/columnName2 = $v
return $t


Why would they suggest such an unnatural way to solve that problem in XQuery? The reality is that I'm thinking in terms of what DataDirect XQuery would do; while they are thinking about different XQuery processors. DataDirect XQuery has been designed to re-write XQuery expressions in SQL without forcing the user to code XQuery in a specific way. In XQuery it's possible to express the same logic in many different ways; but it shouldn't be the XQuery author's responsibility to guess about how the underlying XQuery engine optimizes what he writes; it should be the XQuery engine that is able to take the "right decisions" no matter how the user codes the solution in XQuery (at least in reasonably equivalent scenarios, like the one described above).

For the record, the XQueries described above all end up generating the same execution plan in DataDirect XQuery:
If you are interested in more details about how DataDirect XQuery generates SQL when running XQuery against Relational data sources, a good source of information is: the Generating SQL white paper as Consistent SQL Generation.


Labels: , , , ,

Wednesday, August 15, 2007

Why do I want to use XQuery against relational databases?


A few days ago someone brought to my attention a post by Elliotte Harold about "The State of Native XML Databases", which also mentions DataDirect XQuery:
"DataDirect XQuery is not itself a database. Rather it is an adapter layer that sits on top of your existing payware database such as SQL Server or Oracle and provides an XQuery interface. Why you’d want to use XQuery instead of SQL when talking to a relational database, I’ve never quite been able to fathom. Data Direct XQuery also has adapters for XML files, EDI, and other flat files."

Why would I want to use XQuery when talking to a relational database? I've spent a good portion of my last two years demonstrating how DataDirect XQuery can be used to solve problems that also require access to relational databases; so, let me describe a few usecases I have seen; in all of them, being able to use XQuery rather than Java+SQL makes developer's life much simpler and productive.

I need to query my database to generate XML documents
We have recently published a customer story about this; it's not infrequent that data needs to be transferred/communicated in XML format. So, even when the data is all stored in relational databases, SQL (plus whatever language around it to "massage" the data) is not necessarily the easiest path to generate XML representing portions of the relational data. The customer in question has commented on some quantitative comparison between *how much* easier it is; it's not difficult believe it if you know the technologies in question.

I need to process an incoming XML message and respond including data stored in my database
This is a very frequent usecases; in some cases instead of receiving an XML request the user is dealing even with EDI; but there are tools that make the conversion of EDI to XML quite easy, so the problem ends up really being the same. How would you address that problem without an XQuery processor able to handle relational database access? You would probably write a Java application that parses the incoming XML document, even these days most likely you would materialize it in an XML DOM, fetch the information you needs from it, issue SQL queries over JDBC to get more related information, and finally you would create an XML document to return the requested result(s). We did try implementing this exercise dealing with some ACORD XML requests, and a relational back-end storing the various insurance policy details; we will soon publish more information and code samples on http://www.xquery.com/; in the meanwhile, I'll need to leave it to your imagination how the Java code looks like, and how it compares to the equivalent XQuery+Java code that would achieve the same goal using DataDirect XQuery.

I need to shred the content of my XML document into my database
This is one of the reasons why we have added RDBMS update capabilities in the latest version of DataDirect XQuery; it is another quite popular usecase. Again you are dealing with the necessity to navigate an XML document, find the proper information in there, and then execute operations against a relational database based on that information. Once again the same problem can indeed be solved using Java+SQL+some XML API; but isn't it more productive to use an XML native language featuring the ability to access both XML and relational data in the same context?


To wrap up, I believe the real question I would ask myself is not "Why would I want to use XQuery when talking to a relational database?", but rather "Can I use XQuery to access a relational database achieving the same levels of scalability and performance that I could achieve using a combination of Java and SQL?". That's exactly where the difficulty is, and where XQuery implementations differ dramatically. Of course it would be easy for me to answer "Sure you can! DataDirect XQuery has been designed exactly with that goal in mind!"; and I could point you to a variety of literature on http://www.xquery.com/ (the most interesting one being probably this one by Marc). But instead, I encourage you to try it yourself; get a fully functional trial version from http://www.xquery.com/download, and give it a try, using your own usecase, your own database and your own data set. That's the only way you will actually experience how XQuery can make things easier for you without scalability and performance compromises.


Labels: , , , ,