XML Connections

Thursday, March 27, 2008

Joining XML with RDBMS, how and what are the consequences?

A few months ago, my friend Minollo blogged about XQuery against RDBMS: let the engine optimize your SQL. And last week again, a question on some newsgroup attracted our attention.

Imagine you have an XML document with customers, there might be one or a few hundred, and want to retrieve all orders for these customers out of your Oracle database. One could write something as follows, using a general comparison in the where clause.

for $order in collection("ORDERS")/ORDERS
where $order/O_CUSTKEY = doc("customers.xml")/customers/customer
return
<order id="{$order/O_ORDERKEY}">
<customer>{$order/O_CUSTKEY/data(.)}</customer>
<date>{$order/O_ORDERDATE/data(.)}</date>
<price>{$order/O_TOTALPRICE/data(.)}</price>
</order>

As explained in the white paper Implementing XQuery efficiently for relational databases, DataDirect XQuery always handles general comparisons efficiently and translates them into SQL - taking advantage of database indexes and only retrieving the relevant data out of the database.

Most other XQuery implementations don't translate such general comparisons into SQL, because the second argument of the general comparison is a sequence of customers. In other words, the where clause is not evaluated by the database engine, with all the negative performance and scalability consequences that implies. Basically, all records in the ORDERS table are retrieved out of the database, and filtering is accomplished in memory.

As advised on the newsgroup, one can rewrite the query as follows.

for $customer in doc("customers.xml")/customers/customer
for $order in collection("ORDERS")/ORDERS
where $order/O_CUSTKEY = $customer
return
<order id="{$order/O_ORDERKEY}">
<customer>{$order/O_CUSTKEY/data(.)}</customer>
<date>{$order/O_ORDERDATE/data(.)}</date>
<price>{$order/O_TOTALPRICE/data(.)}</price>
</order>

Both queries yield the same result. Well, to be technically accurate, there are some differences.

  • the order of the returned orders might be different. With the second query, the orders are grouped per customer
  • in order to make both queries equivalent, with the second alternative, you need to get distinct values out of the customers.xml
But ok, let's assume for a moment both queries yield the same result.

One interesting thing about this second query, compared to the first, is that most XQuery implementations are capable of translating this where clause in SQL as both arguments of the comparison are single values. And you end up with SQL as follows, executed through a nested loop algorithm.

select  O_ORDERKEY, O_CUSTKEY, O_ORDERDATE, O_TOTALPRICE
from ORDERS
where O_CUSTKEY = ?

Further, better implementations will try to minimize the number of roundtrips to the database, and combine multiple requests in a single SQL statement. Smells like a topic for another post, anyway...

Looks great, our initial concerns are answered, the where clause is evaluated by the SQL engine. We're perfectly happy, right?

Unfortunately, there is a significant drawback with the second query! Suppose, for example, we want to order our results by ORDERDATE and TOTALPRICE.

for $customer in doc("customers.xml")/customers/customer
for $order in collection("ORDERS")/ORDERS
where $order/O_CUSTKEY = $customer
order by $order/O_ORDERDATE, $order/O_TOTALPRICE
return
<order id="{$order/O_ORDERKEY}">
<customer>{$order/O_CUSTKEY/data(.)}</customer>
<date>{$order/O_ORDERDATE/data(.)}</date>
<price>{$order/O_TOTALPRICE/data(.)}</price>
</order>

Although the where clause is translated into SQL, the order by clause is not. It is simply not possible! Remember that multiple SQL statements will be executed in a nested loop, all these results are buffered in memory, after which we need to order the buffered results. This is clearly a significant drawback of the proposed alternative. Handing order by clauses out of your database engine, implies a serious performance penalty and excessive memory consumption, even with medium sized query results.

Let's go back to our initial query and add the order by clause.

for $order in collection("ORDERS")/ORDERS
where $order/O_CUSTKEY = doc("customers.xml")/customers/customer
order by $order/O_ORDERDATE, $order/O_TOTALPRICE
return
<order id="{$order/O_ORDERKEY}">
<customer>{$order/O_CUSTKEY/data(.)}</customer>
<date>{$order/O_ORDERDATE/data(.)}</date>
<price>{$order/O_TOTALPRICE/data(.)}</price>
</order>

If you have an XQuery implementation that is capable of translating such where clauses into SQL, the topic this post started with, it can as a consequence go further and also translate the order by clause into SQL! With DataDirect XQuery, the execution plan for this last query is as follows. Note that both the where and order by clause are pushed into SQL and processed by your Oracle database.

As you see, choosing the right XQuery implementation can make a huge difference. If you want to find out more about how DataDirect XQuery generates SQL and leverages the strengths of your current database investment, check out this white paper.

Labels: , , , ,

Thursday, March 20, 2008

Using XQuery to join two Excel spreadsheets? Why not?


This afternoon I was having some fun (not!) reviewing some third party license agreements, when someone emailed me a couple of Excel spreadsheets saying "Hey Minollo! I have these two spreadsheets; one has a list of all employees since 1990 in Progress Corporation and the cities where they lived when they were hired; the other one has a list of the current employees and in which Progress division they work; I need to create a list of all the current employees with details about the division they work in and the town the live (or lived) in. Is there anything that "XQuery thing" you always talk me about can do for me here?"

Well, the temptation to keep reviewing license agreements rather than playing with XQuery was strong... but I resisted! I had to help this guy! :)

I was happy to notice that the spreadsheets had been saved using the Office Open XML (OOXML) format; OOXML is basically a package of XML files describing the document (text, spreadsheet, slides) that it represents. Being XML-based, it's quite easy to inspect it using an XML query language; the package is a zip format, so individual documents can be retrieved using the standard "jar:" file scheme, for example.

So, coming back to the guy asking for help joining two spreadsheets; what can we do for him? If you look at the structure of the two spreadsheets he sent me (attached to this post), you'll see two XML documents that are relevant to us:

  • xl/worksheets/sheet1.xml
    This file contains the content of each non-empty cell in the spreadsheet. The document is split in elements and each element has a number of cells represented by elements. When the content of a element is a character string, like in this particular case, the element contains a element whose value is the index of the corresponding character string in a list of all the character strings used in the spreadsheet

  • xl/sharedStrings.xml
    This file contains all the character strings used as content of cells in the spreadsheet; the order is relevant, as the character strings are referenced using an index (like described above)

To retrieve the whole content of xl/worksheets/sheet1.xml from c:\doc1.xlsx using a Java-based XQuery processor like DataDirect XQuery, you can just us the standard fn:doc() function:

fn:doc("jar:file:///c:/doc1.xlsx!/xl/worksheets/sheet1.xml")

The index corresponding to the character string contained in the first cell of the first row in the spreadsheet can be retrieved doing:

fn:doc("jar:file:///c:/doc1.xlsx!/xl/worksheets/sheet1.xml")//
   ooxml:sheetData/ooxml:row[1]/ooxml:c[1]/ooxml:v

...where ooxml is a namespace prefix associated to the URI http://schemas.openxmlformats.org/spreadsheetml/2006/main .

And if you have the index of the character string, the character string itself can be retrieved as:

fn:doc("jar:file:///c:/doc1.xlsx!/xl/sharedStrings.xml")//
   ooxml:si[$theIndex+1]/ooxml:t/text()

Easy, isn't it!?

Add some syntax, a couple of FLWOR expressions, a few HTML tags, and here you go! This XQuery just joins the two spreadsheets and creates an easy to read HTML report:


 declare namespace ooxml=
    "
http://schemas.openxmlformats.org/spreadsheetml/2006/main";
 let $doc1Strings := doc("jar:file:///c:/doc1.xlsx!/xl/sharedStrings.xml")//ooxml:si
 let $doc2Strings := doc("jar:file:///c:/doc2.xlsx!/xl/sharedStrings.xml")//ooxml:si
 return
  <table> {
   <tr style="font-weight: bold"><td>Name</td><td>Division</td><td>City</td></tr>,
   for $row1 in
    doc("jar:file:///c:/doc1.xlsx!/xl/worksheets/sheet1.xml")//
          ooxml:sheetData/ooxml:row,
       $row2 in
    doc("jar:file:///c:/doc2.xlsx!/xl/worksheets/sheet1.xml")//
          ooxml:sheetData/ooxml:row
   let $col1-1 := $doc1Strings[$row1/ooxml:c[1]/ooxml:v/number()+1]/ooxml:t/text()
   let $col2-1 := $doc2Strings[$row2/ooxml:c[1]/ooxml:v/number()+1]/ooxml:t/text()
   where $col1-1 = $col2-1
   return
    let $col1-2 := $doc1Strings[$row1/ooxml:c[2]/ooxml:v/number()+1]/ooxml:t/text()
    let $col2-2 := $doc2Strings[$row2/ooxml:c[2]/ooxml:v/number()+1]/ooxml:t/text()
    return
     <tr>
      <td>{$col1-1}</td>
      <td>{$col2-2}</td>
      <td>{$col1-2}</td>
     </tr>
  } </table>

It's that simple! Here are the two spreadsheets, if you want to try it yourself: doc1.xlsx, doc2.xlsx

Labels: , , ,

Tuesday, March 11, 2008

A free XQuery workshop close to you!

Many of you have been writing us positive comments about the technical content of this Blog. We are happy to hear you find the information in this pages useful; both Marc and I always try hard posting on topics about which we hear a lot of interest when we talk to XQuery developers and users of our XML products, DataDirect XQuery and XML Converters.

One of the things we have been hearing from you is that you would like us to offer workshops or tutorials on how to use XQuery in real life scenarios; and I'm glad to let you know that we have recently announced a tour of a few cities in the United States and Europe to offer you all a *free* workshop on XQuery! During this hands-on workshop we will be writing, testing and deploying real XML and XQuery-based solutions, live in front of you!

We will start with some typical scenarios where a hypothetical IT organization needs to create a set of data services exposing various functionality. The workshop covers:
  • Transforming XML documents into user-defined formats
  • Aggregating XML documents with relational data
  • Consuming Web services in XQuery
  • Exposing XQuery as data services
  • Integrating with non-XML B2B standards, like EDI
  • Adhering to XML industry standards, such as ACORD for insurance

The workshop is highly interactive, and we really need your help to solve the various problems that we introduce! Again, these events are entirely free and you just need to register in advance to attend; dates, locations and registration details are available here.

We all hope to see you soon in one of the cities we are visiting.

Labels: , ,

Monday, March 10, 2008

XQuery and Web Services: some practical examples


We have been mentioning in several occasions - like here - that using XQuery in the context of Web services is an extremely powerful and flexible approach. The fact that XQuery and Web services share the same underlying data model (at least in the vast majority of practical cases), it makes it extremely appealing to use XQuery to access and merge information returned by one or more Web services, maybe aggregate it with data available in RDBMS and maybe even expose the resulting XQuery itself as a Web service.

As most of us have an engineering background rather than a marketing one, we often find it much easier to show something than to talk about it. That's why we have created a few pages on our website that describe how DataDirect XQuery can be exposed as a Web service without any additional coding, supporting both WSDL/SOAP or REST interfaces; and how you can use DataDirect XQuery to consume other Web services.

We will keep adding Web service-based examples over time; so make sure you visit often!

Tuesday, March 4, 2008

Approval for XQJ 1.0

I blogged last week on submitting XQJ to the JCP for final approval. Yesterday, it successfully passed the Final Approval Ballot in the JCP Executive committee.

XQJ 1.0 will be released in the coming weeks! I'll keep you informed when the download page becomes available, and about DataDirect's plans to officially support XQJ 1.0.

I would like to use this opportunity, and thank everyone in the JSR-225 Expert Group for their efforts on this specification.

Tech Tags:

Labels: