XML Connections

Wednesday, April 30, 2008

Transforming XML using XQuery updates

In one of my previous posts I blogged about the new W3C XML Query publications, including XQuery Update Facility.

Talking to people, I feel a majority relate XQuery Update Facility with updating XML documents in some persistent store. But there is actually more, it is also about transforming and enriching (transient) documents. In fact a use case hard to handle with XQuery 1.0.

Note the XQuery Update Facility is today in W3C Candidate Recommendation status. Although not yet supported by DataDirect XQuery, we are following the progression of the specification closely, and hope to communicate about our plans soon, when the specification moves into Proposed Recommendation status.

Anyway, I want to give a feel on what will be possible in the near future.

Let’s consider a concrete example. Not so long ago I posted a series on XQuery generating multiple XML documents, let’s continue with this theme. Assume you have a directory with XML documents, and these need to be enriched with data coming out of your relational database. For example, orders need to be completed with the shipping address.

Let’s enrich all our orders, each a seperate XML document. Using the customer’s name available in the order, we lookup the shipping address in our relational database and enrich subsequently the order with that shipping address.

In the margin, we use the utility function local:get-file-name, which we introduced before.

for $order in fn:collection("file:///C:/inputorders?select=*.xml")
let $filename := concat("file:///C:/outputorders/",
local:get-file-name(document-uri($order)))
let $newOrder :=
<order id="{$order/order/@id}">{
$order/order/customer,
<address>{
fn:collection("CUSTOMERS")/CUSTOMERS[NAME = $order/order/customer]/ADDRESS/text()
}</address>,
$order/order/totalprice,
$order/order/priority,
$order/order/deliverydate,
$order/order/notes,
$order/order/lineitems
}</order>
return
ddtek:serialize-to-url($newOrder, $filename, "")

Basically we read information out of the input document, and recreate the output using the same structure, but extending it with the shipping address.
There are a number of issues with such approach.

  • The more complex the input documents are, the more complex and longer the query will be.
  • The query is specific to the document structure. When the document structure evolves, the query needs to be updated.
  • If you need to handle different document types, either your query becomes more complex or multiple queries are to be used.
And as you can imagine, the more complex your documents are, the more these issues are relevant. The example above is in fact over simplified. Typical XML documents in the industry are much more complex, think for example on some of the ones we use in our tutorial How DataDirect XQuery Helps the Insurance Industry Deal with ACORD Standards.

Using XQuery Update Facility this becomes much simpler. It helps to resolve the raised concerns, resulting in more maintainable queries, and eventually increased productivity.
In the following query we simply enrich all XML documents by adding a corresponding address element after each customer element. Such approach is much more robust, and is ready to handle future revisions of the orders XML Schema.

for $order in fn:collection("file:///C:/inputorders?select=*.xml")
let $filename := concat("file:///C:/outputorders/",
local:get-file-name(document-uri($order)))
return
copy $newOrder := $order
modify
insert node
<address>{
fn:collection("CUSTOMERS")/CUSTOMERS[NAME = $order/order/customer]/ADDRESS/text()}
</address>
after $newOrder/order/customer
return
ddtek:serialize-to-url($newOrder, $filename, "")

Queries like the above one are already fully functional in our development lab. At DataDirect we are excited about all the new functionality the XQuery Update Facility will bring to the XML development community.

Tech Tags:

Labels: ,

Tuesday, April 22, 2008

Stock quotes, CSV, XQuery and Web services

Not long ago I could rely on a variety of free Web services to fetch the semi-live quotes of my preferred stock tickers. Unfortunately, one by one all those Web services have either disappeared or started requiring a subscription... and for writing a few demos now and then, the subscription model doesn't work that well for me.

I started thinking about alternatives; one option could be to fetch an HTML page from one of the many Web sites that expose stock quotes; but then extracting the information I'm looking for, reformatting it, or mashing it up or reshaping it in a different XML structure would be too painful. HTML layouts change frequently, and navigating HTML even after converting it to well formed XML (XHTML) is usually error prone.

When I mentioned that to one of our lead engineers, Tony Lavinio pointed out that Yahoo! exposes a service which returns a comma separated value (CSV) file given a list of tickers and a few options (that you can find described in a few places on the Internet). It's as simple as invoking a URL; for example, try this:

http://finance.yahoo.com/d/quotes.csv?s=PRGS,AAPL,JAVA,MSFT&f=snl1

If Microsoft Excel is installed on your computer, you will most likely get the results back in an Excel spreadsheet. The options at the end control how many and what kind of "columns" you get back ("s" is the symbol, "n" is the company name, "l1" is the latest value). There are more options, as I mentioned; the following URL will return you more information for each stock ticker:

http://finance.yahoo.com/d/quotes.csv?s=PRGS,AAPL,JAVA,MSFT&f=snl1chgopvd1t1jkxerba

Well, OK; that's interesting; but how does it help me to create my own HTML mash-ups, or my own XML reports? The Yahoo! service is very nice, but it returns a CSV format, not XML - which, as you can guess, is what I would like to manipulate. Once again XML Converters come to the rescue: open your Stylus Studio IDE, create a new XQuery and copy and paste this:

doc("converter:CSV?http://finance.yahoo.com/d/quotes.csv?s=PRGS,AAPL,JAVA,MSFT&amp;f=snl1")

This time you'll get an XML document back:

<table>
  <row>
    <column.0>PRGS</column.0>
    <column.1>PROGRESS SOFTWARE</column.1>
    <column.2>30.07</column.2>
  </row>
  <row>
    <column.0>AAPL</column.0>
    <column.1>APPLE INC</column.1>
    <column.2>168.16</column.2>
  </row>
  <row>
    <column.0>JAVA</column.0>
    <column.1>SUN MICROSYSTEMS </column.1>
    <column.2>15.78</column.2>
  </row>
  <row>
    <column.0>MSFT</column.0>
    <column.1>MICROSOFT CP</column.1>
    <column.2>30.42</column.2>
  </row>
</table>

That's something I can use to create my own HTML content or my XML report! I can just change the XQuery a bit to get that XML formatted into an HTML table, for example:

<table cellspacing= "1" cellpadding="4" border="1">{
  for $row in doc("converter:CSV?http://finance.yahoo.com/d/quotes.csv?s=PRGS,AAPL,JAVA,MSFT&amp;f=snl1hg")/table/row
  return
  <tr>
    <td>{ (:s symbol:) $row/column.0/text() }</td>
    <td>{ (:n name:) $row/column.1/text() }</td>
    <td>{ (:l1 last value:) $row/column.2/text() }</td>
    <td>{ (:h day's high:) $row/column.3/text() }</td>
    <td>{ (:g day's low:) $row/column.4/text() }</td>
  </tr>
}</table>

The result will look something like this:

PRGS PROGRESS SOFTWARE 29.58 29.87 29.58
AAPL APPLE INC 167.527 168.00 167.05
JAVA SUN MICROSYSTEMS 15.64 15.70 15.60
MSFT MICROSOFT CP 30.46 30.64 30.43

Now, what if I want to create my own Web service that returns information about one or more stock tickers? If you have read this blog entry about how to expose and consume Web service with DataDirect XQuery, you will know that creating an XQuery-based stock quote Web service is as easy as creating a simple XQuery with one external variable that accepts a list of stock tickers:

getQuotes.xquery:

declare variable $tickers as xs:string external;
<quotes>{
  for $row in doc(concat("converter:CSV?http://finance.yahoo.com/d/quotes.csv?s=", $tickers, "&amp;f=snl1hg"))/table/row
  return
  <stock>
    <symbol>{ $row/column.0/text() }</symbol>
    <name>{ $row/column.1/text() }</name>
    <last-value>{ $row/column.2/text() }</last-value>
    <day-high>{ $row/column.3/text() }</day-high>
    <day-low>{ $row/column.4/text() }</day-low>
  </stock>
}</quotes>


Deploying that XQuery in the DataDirect XQuery Web service framework will expose a new WSDL/SOAP operation, and a new REST service that we can use from any environment that supports Web service invocation. Do you want to try it? Point your browser at:

http://examples.xquery.com/stock-quotes

From there you can retrieve the WSDL corresponding to the XQuery above (getQuotes.xquery), or try the service itself through the Web interface itself! To achieve that I didn't do any more coding than writing the XQuery listed above! As you can see, I also added there a second XQuery (getQuotesEx.xquery) which allows you to specify different options:

getQuotesEx.xquery:

declare variable $tickers as xs:string external;
declare variable $options as xs:string external;
<quotes>{
  for $row in doc(concat("converter:CSV?http://finance.yahoo.com/d/quotes.csv?s=", $tickers, "&amp;f=", $options))/table/row
  return $row
}</quotes>


The XML returned might be formatted in a more pleasant way adding some translation of each option in a reasonable element name returned as result, but I'll leave that exercise to someone else.

Thanks to the power and flexibility of DataDirect XQuery and XML Converters, and thanks to the Yahoo! Finance CSV service, I now have again Web services I can use to retrieve all the information I need about any stock ticker!

Labels: , , , ,

Saturday, April 5, 2008

W3C XML Query Working Group published new documents

Several new W3C XML Query publications have hit the street the last weeks.

First of all the XML Query Working Group published a Candidate Recommendation of XQuery Update Facility 1.0. Together with this Candidate Recommendation, the Working Group published an update of the XQuery Update Facility 1.0 Requirements and XQuery Update Facility 1.0 Use Cases.

Where XQuery 1.0 only allows to "query", the XQuery Update Facility adds "update" capabilities to XQuery. The obvious use case is to update XML documents stored in a database, but XQuery Update Facility is about more!
It also enables transformations and enrichment of both transient and persisted XML documents. Such operations are rather complex with XQuery 1.0. I will talk about some of these uses cases in a future post, and show what XQuery Update Facility adds to the XQuery world when it comes to transformations and data enrichment.

When will DataDirect XQuery support the XQuery Update Facility? If you were hoping to read here about an a concrete date, I have to disappoint you. All I can say right now is that DataDirect follows closely the XQuery Update Facility development process. We will be able to make this more concrete later this year, when the XQuery Update Facility moves into Proposed Recommendation.
In the meantime, don’t forget that DataDirect XQuery offers today the ability to update your database through XQuery, Minollo blogged about a nice use case not so long ago.

I believe a significant limitations of the XQuery Update Facility is that an expressions is either "updating" or "non updating". In other words, within a single query it is not possible to both perform an update and return a query result. Looks like a common use case to me. The XML Query Working Group is trying to answer this through XQuery Scripting Extensions.

A first Working Draft of XQuery Scripting Extension 1.0 made it to the public recently. As this is a first Working Draft, don’t expect things to be written in stone. Everything is open for change and improvements. As with most of the XQuery specifications, this one also is accompanied by a Use Cases document, it provides the usage scenarios that motivate the changes in the XQuery Scripting Extension. I’m a fan of the use cases approach, it’s a good source for learning.

As said above, the ability to both "update" and "query" is required, and this is definitely the most important functionality I see in the XQuery Scripting Extensions. Some of the procedural extensions (assignments, while-loop, etc) are somehow questionable. It has to be seen how these will be used in a declarative language like XQuery. Will such concepts result in confusion for the user? Will productivity of XQuery programming improve through the procedural extensions? The future will tell...

The XML Query Working Group has been busy lately, and also started with XQuery 1.1. Still preliminary, but the Requirements and a first version of the Uses Cases were published mid March.

Tech Tags:

Labels:

Thursday, March 27, 2008

Joining XML with RDBMS, how and what are the consequences?

A few months ago, my friend Minollo blogged about XQuery against RDBMS: let the engine optimize your SQL. And last week again, a question on some newsgroup attracted our attention.

Imagine you have an XML document with customers, there might be one or a few hundred, and want to retrieve all orders for these customers out of your Oracle database. One could write something as follows, using a general comparison in the where clause.

for $order in collection("ORDERS")/ORDERS
where $order/O_CUSTKEY = doc("customers.xml")/customers/customer
return
<order id="{$order/O_ORDERKEY}">
<customer>{$order/O_CUSTKEY/data(.)}</customer>
<date>{$order/O_ORDERDATE/data(.)}</date>
<price>{$order/O_TOTALPRICE/data(.)}</price>
</order>

As explained in the white paper Implementing XQuery efficiently for relational databases, DataDirect XQuery always handles general comparisons efficiently and translates them into SQL - taking advantage of database indexes and only retrieving the relevant data out of the database.

Most other XQuery implementations don't translate such general comparisons into SQL, because the second argument of the general comparison is a sequence of customers. In other words, the where clause is not evaluated by the database engine, with all the negative performance and scalability consequences that implies. Basically, all records in the ORDERS table are retrieved out of the database, and filtering is accomplished in memory.

As advised on the newsgroup, one can rewrite the query as follows.

for $customer in doc("customers.xml")/customers/customer
for $order in collection("ORDERS")/ORDERS
where $order/O_CUSTKEY = $customer
return
<order id="{$order/O_ORDERKEY}">
<customer>{$order/O_CUSTKEY/data(.)}</customer>
<date>{$order/O_ORDERDATE/data(.)}</date>
<price>{$order/O_TOTALPRICE/data(.)}</price>
</order>

Both queries yield the same result. Well, to be technically accurate, there are some differences.

  • the order of the returned orders might be different. With the second query, the orders are grouped per customer
  • in order to make both queries equivalent, with the second alternative, you need to get distinct values out of the customers.xml
But ok, let's assume for a moment both queries yield the same result.

One interesting thing about this second query, compared to the first, is that most XQuery implementations are capable of translating this where clause in SQL as both arguments of the comparison are single values. And you end up with SQL as follows, executed through a nested loop algorithm.

select  O_ORDERKEY, O_CUSTKEY, O_ORDERDATE, O_TOTALPRICE
from ORDERS
where O_CUSTKEY = ?

Further, better implementations will try to minimize the number of roundtrips to the database, and combine multiple requests in a single SQL statement. Smells like a topic for another post, anyway...

Looks great, our initial concerns are answered, the where clause is evaluated by the SQL engine. We're perfectly happy, right?

Unfortunately, there is a significant drawback with the second query! Suppose, for example, we want to order our results by ORDERDATE and TOTALPRICE.

for $customer in doc("customers.xml")/customers/customer
for $order in collection("ORDERS")/ORDERS
where $order/O_CUSTKEY = $customer
order by $order/O_ORDERDATE, $order/O_TOTALPRICE
return
<order id="{$order/O_ORDERKEY}">
<customer>{$order/O_CUSTKEY/data(.)}</customer>
<date>{$order/O_ORDERDATE/data(.)}</date>
<price>{$order/O_TOTALPRICE/data(.)}</price>
</order>

Although the where clause is translated into SQL, the order by clause is not. It is simply not possible! Remember that multiple SQL statements will be executed in a nested loop, all these results are buffered in memory, after which we need to order the buffered results. This is clearly a significant drawback of the proposed alternative. Handing order by clauses out of your database engine, implies a serious performance penalty and excessive memory consumption, even with medium sized query results.

Let's go back to our initial query and add the order by clause.

for $order in collection("ORDERS")/ORDERS
where $order/O_CUSTKEY = doc("customers.xml")/customers/customer
order by $order/O_ORDERDATE, $order/O_TOTALPRICE
return
<order id="{$order/O_ORDERKEY}">
<customer>{$order/O_CUSTKEY/data(.)}</customer>
<date>{$order/O_ORDERDATE/data(.)}</date>
<price>{$order/O_TOTALPRICE/data(.)}</price>
</order>

If you have an XQuery implementation that is capable of translating such where clauses into SQL, the topic this post started with, it can as a consequence go further and also translate the order by clause into SQL! With DataDirect XQuery, the execution plan for this last query is as follows. Note that both the where and order by clause are pushed into SQL and processed by your Oracle database.

As you see, choosing the right XQuery implementation can make a huge difference. If you want to find out more about how DataDirect XQuery generates SQL and leverages the strengths of your current database investment, check out this white paper.

Labels: , , , ,

Thursday, March 20, 2008

Using XQuery to join two Excel spreadsheets? Why not?


This afternoon I was having some fun (not!) reviewing some third party license agreements, when someone emailed me a couple of Excel spreadsheets saying "Hey Minollo! I have these two spreadsheets; one has a list of all employees since 1990 in Progress Corporation and the cities where they lived when they were hired; the other one has a list of the current employees and in which Progress division they work; I need to create a list of all the current employees with details about the division they work in and the town the live (or lived) in. Is there anything that "XQuery thing" you always talk me about can do for me here?"

Well, the temptation to keep reviewing license agreements rather than playing with XQuery was strong... but I resisted! I had to help this guy! :)

I was happy to notice that the spreadsheets had been saved using the Office Open XML (OOXML) format; OOXML is basically a package of XML files describing the document (text, spreadsheet, slides) that it represents. Being XML-based, it's quite easy to inspect it using an XML query language; the package is a zip format, so individual documents can be retrieved using the standard "jar:" file scheme, for example.

So, coming back to the guy asking for help joining two spreadsheets; what can we do for him? If you look at the structure of the two spreadsheets he sent me (attached to this post), you'll see two XML documents that are relevant to us:

  • xl/worksheets/sheet1.xml
    This file contains the content of each non-empty cell in the spreadsheet. The document is split in elements and each element has a number of cells represented by elements. When the content of a element is a character string, like in this particular case, the element contains a element whose value is the index of the corresponding character string in a list of all the character strings used in the spreadsheet

  • xl/sharedStrings.xml
    This file contains all the character strings used as content of cells in the spreadsheet; the order is relevant, as the character strings are referenced using an index (like described above)

To retrieve the whole content of xl/worksheets/sheet1.xml from c:\doc1.xlsx using a Java-based XQuery processor like DataDirect XQuery, you can just us the standard fn:doc() function:

fn:doc("jar:file:///c:/doc1.xlsx!/xl/worksheets/sheet1.xml")

The index corresponding to the character string contained in the first cell of the first row in the spreadsheet can be retrieved doing:

fn:doc("jar:file:///c:/doc1.xlsx!/xl/worksheets/sheet1.xml")//
   ooxml:sheetData/ooxml:row[1]/ooxml:c[1]/ooxml:v

...where ooxml is a namespace prefix associated to the URI http://schemas.openxmlformats.org/spreadsheetml/2006/main .

And if you have the index of the character string, the character string itself can be retrieved as:

fn:doc("jar:file:///c:/doc1.xlsx!/xl/sharedStrings.xml")//
   ooxml:si[$theIndex+1]/ooxml:t/text()

Easy, isn't it!?

Add some syntax, a couple of FLWOR expressions, a few HTML tags, and here you go! This XQuery just joins the two spreadsheets and creates an easy to read HTML report:


 declare namespace ooxml=
    "
http://schemas.openxmlformats.org/spreadsheetml/2006/main";
 let $doc1Strings := doc("jar:file:///c:/doc1.xlsx!/xl/sharedStrings.xml")//ooxml:si
 let $doc2Strings := doc("jar:file:///c:/doc2.xlsx!/xl/sharedStrings.xml")//ooxml:si
 return
  <table> {
   <tr style="font-weight: bold"><td>Name</td><td>Division</td><td>City</td></tr>,
   for $row1 in
    doc("jar:file:///c:/doc1.xlsx!/xl/worksheets/sheet1.xml")//
          ooxml:sheetData/ooxml:row,
       $row2 in
    doc("jar:file:///c:/doc2.xlsx!/xl/worksheets/sheet1.xml")//
          ooxml:sheetData/ooxml:row
   let $col1-1 := $doc1Strings[$row1/ooxml:c[1]/ooxml:v/number()+1]/ooxml:t/text()
   let $col2-1 := $doc2Strings[$row2/ooxml:c[1]/ooxml:v/number()+1]/ooxml:t/text()
   where $col1-1 = $col2-1
   return
    let $col1-2 := $doc1Strings[$row1/ooxml:c[2]/ooxml:v/number()+1]/ooxml:t/text()
    let $col2-2 := $doc2Strings[$row2/ooxml:c[2]/ooxml:v/number()+1]/ooxml:t/text()
    return
     <tr>
      <td>{$col1-1}</td>
      <td>{$col2-2}</td>
      <td>{$col1-2}</td>
     </tr>
  } </table>

It's that simple! Here are the two spreadsheets, if you want to try it yourself: doc1.xlsx, doc2.xlsx

Labels: , , ,

Tuesday, March 11, 2008

A free XQuery workshop close to you!

Many of you have been writing us positive comments about the technical content of this Blog. We are happy to hear you find the information in this pages useful; both Marc and I always try hard posting on topics about which we hear a lot of interest when we talk to XQuery developers and users of our XML products, DataDirect XQuery and XML Converters.

One of the things we have been hearing from you is that you would like us to offer workshops or tutorials on how to use XQuery in real life scenarios; and I'm glad to let you know that we have recently announced a tour of a few cities in the United States and Europe to offer you all a *free* workshop on XQuery! During this hands-on workshop we will be writing, testing and deploying real XML and XQuery-based solutions, live in front of you!

We will start with some typical scenarios where a hypothetical IT organization needs to create a set of data services exposing various functionality. The workshop covers:
  • Transforming XML documents into user-defined formats
  • Aggregating XML documents with relational data
  • Consuming Web services in XQuery
  • Exposing XQuery as data services
  • Integrating with non-XML B2B standards, like EDI
  • Adhering to XML industry standards, such as ACORD for insurance

The workshop is highly interactive, and we really need your help to solve the various problems that we introduce! Again, these events are entirely free and you just need to register in advance to attend; dates, locations and registration details are available here.

We all hope to see you soon in one of the cities we are visiting.

Labels: , ,

Monday, March 10, 2008

XQuery and Web Services: some practical examples


We have been mentioning in several occasions - like here - that using XQuery in the context of Web services is an extremely powerful and flexible approach. The fact that XQuery and Web services share the same underlying data model (at least in the vast majority of practical cases), it makes it extremely appealing to use XQuery to access and merge information returned by one or more Web services, maybe aggregate it with data available in RDBMS and maybe even expose the resulting XQuery itself as a Web service.

As most of us have an engineering background rather than a marketing one, we often find it much easier to show something than to talk about it. That's why we have created a few pages on our website that describe how DataDirect XQuery can be exposed as a Web service without any additional coding, supporting both WSDL/SOAP or REST interfaces; and how you can use DataDirect XQuery to consume other Web services.

We will keep adding Web service-based examples over time; so make sure you visit often!