XML Connections

Tuesday, May 13, 2008

XQuery Update Facility versus XSLT?

Last month we discussed Transforming XML using XQuery updates. With XQuery 1.0 update and transform operations are rather challenging to implement. Or you can use some library to get there, like we explained in Updating XML with XQuery 1.0. In any case, we have to admit, compared to XSLT this is a shortcoming.

With the XQuery Update Facility, things will change drastically. Let's have a closer look at a concrete usage scenario from a recent question on SSDN.

If preceding-sibling type="zMADDRESS", type="zZip" value
should be left unchanged. Else if, preceding-sibling
type="zADDRESS", type="zZip" items should be removed.
And the comma and space which always seem to precede
the zZip in ZAddress or zNeighb must also be removed--
if that is all within a zADDRESS.
For examples:
1. Before:
<tps:c type="zStreet">20 West Row</tps:c>
<tps:c type="zAddress">,</tps:c>
<tps:c type="zNeighb">Canberra City,</tps:c>
<tps:c type="zZip">2600</tps:c>
1. After:
<tps:c type="zStreet">20 West Row</tps:c>
<tps:c type="zAddress">,</tps:c>
<tps:c type="zNeighb">Canberra City</tps:c>
2. Before:
<tps:c type="zStreet">82 Northbourne Ave.</tps:c>
<tps:c type="zAddress">,</tps:c>
<tps:c type="zNeighb">Braddon</tps:c>
<tps:c type="zAddress">,</tps:c>
<tps:c type="zZip">2601</tps:c>
2. After:
<tps:c type="zStreet">82 Northbourne Ave.</tps:c>
<tps:c type="zAddress">,</tps:c>
<tps:c type="zNeighb">Braddon</tps:c>
3. Beofre:
<tps:c type="zMaddress">Box 544, Burra Creek,</tps:c>
<tps:c type="zCity">Queanbeyan,</tps:c>
<tps:c type="zZip">2620</tps:c>
3. After, no change because one its preceding-siblings is "zMaddress".

The proposed XSLT solution is as follows,

<?xml version='1.0'?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:tps="http://www.typefi.com/ContentXML">

<xsl:template match="/">
<xsl:apply-templates/>
</xsl:template>

<xsl:template match="*">
<xsl:element name="{name()}">
<xsl:for-each select="@*">
<xsl:attribute name="{name()}"><xsl:value-of select="."/></xsl:attribute>
</xsl:for-each>
<xsl:apply-templates select="*|text()"/>
</xsl:element>
</xsl:template>

<xsl:template match="tps:c[@type='zZip']">
<xsl:choose>
<xsl:when test="preceding-sibling::tps:c[@type='zAddress']">
<!-- removed -->
</xsl:when>
<xsl:otherwise><xsl:copy-of select="."/></xsl:otherwise>
</xsl:choose>
</xsl:template>

<xsl:template match="tps:c[@type='zAddress']">
<xsl:choose>
<xsl:when test="text() = ',' and following-sibling::tps:c[1][@type='zZip']">
<!-- removed -->
</xsl:when>
<xsl:otherwise><xsl:copy-of select="."/></xsl:otherwise>
</xsl:choose>
</xsl:template>

</xsl:stylesheet>

What would this look like using XQuery Update Facility?

declare namespace tps = "http://www.typefi.com/ContentXML";

copy $doc := .
modify
(delete node $doc//tps:c[@type='zZip']
[preceding-sibling::tps:c[@type='zAddress']],
delete node $doc//tps:c[@type='zAddress']
[.=(","," ")]
[following-sibling::tps:c[1][@type='zZip']])
return $doc

I don't want to end up in one of those endless XQuery versus XSLT discussions. But beside the fact that XQuery Update Facility adds a nice palette of new functionality to XQuery, I believe it offers concise, well readable solutions.

Tech Tags:

Labels:

Wednesday, April 30, 2008

Transforming XML using XQuery updates

In one of my previous posts I blogged about the new W3C XML Query publications, including XQuery Update Facility.

Talking to people, I feel a majority relate XQuery Update Facility with updating XML documents in some persistent store. But there is actually more, it is also about transforming and enriching (transient) documents. In fact a use case hard to handle with XQuery 1.0.

Note the XQuery Update Facility is today in W3C Candidate Recommendation status. Although not yet supported by DataDirect XQuery, we are following the progression of the specification closely, and hope to communicate about our plans soon, when the specification moves into Proposed Recommendation status.

Anyway, I want to give a feel on what will be possible in the near future.

Let’s consider a concrete example. Not so long ago I posted a series on XQuery generating multiple XML documents, let’s continue with this theme. Assume you have a directory with XML documents, and these need to be enriched with data coming out of your relational database. For example, orders need to be completed with the shipping address.

Let’s enrich all our orders, each a seperate XML document. Using the customer’s name available in the order, we lookup the shipping address in our relational database and enrich subsequently the order with that shipping address.

In the margin, we use the utility function local:get-file-name, which we introduced before.

for $order in fn:collection("file:///C:/inputorders?select=*.xml")
let $filename := concat("file:///C:/outputorders/",
local:get-file-name(document-uri($order)))
let $newOrder :=
<order id="{$order/order/@id}">{
$order/order/customer,
<address>{
fn:collection("CUSTOMERS")/CUSTOMERS[NAME = $order/order/customer]/ADDRESS/text()
}</address>,
$order/order/totalprice,
$order/order/priority,
$order/order/deliverydate,
$order/order/notes,
$order/order/lineitems
}</order>
return
ddtek:serialize-to-url($newOrder, $filename, "")

Basically we read information out of the input document, and recreate the output using the same structure, but extending it with the shipping address.
There are a number of issues with such approach.

  • The more complex the input documents are, the more complex and longer the query will be.
  • The query is specific to the document structure. When the document structure evolves, the query needs to be updated.
  • If you need to handle different document types, either your query becomes more complex or multiple queries are to be used.
And as you can imagine, the more complex your documents are, the more these issues are relevant. The example above is in fact over simplified. Typical XML documents in the industry are much more complex, think for example on some of the ones we use in our tutorial How DataDirect XQuery Helps the Insurance Industry Deal with ACORD Standards.

Using XQuery Update Facility this becomes much simpler. It helps to resolve the raised concerns, resulting in more maintainable queries, and eventually increased productivity.
In the following query we simply enrich all XML documents by adding a corresponding address element after each customer element. Such approach is much more robust, and is ready to handle future revisions of the orders XML Schema.

for $order in fn:collection("file:///C:/inputorders?select=*.xml")
let $filename := concat("file:///C:/outputorders/",
local:get-file-name(document-uri($order)))
return
copy $newOrder := $order
modify
insert node
<address>{
fn:collection("CUSTOMERS")/CUSTOMERS[NAME = $order/order/customer]/ADDRESS/text()}
</address>
after $newOrder/order/customer
return
ddtek:serialize-to-url($newOrder, $filename, "")

Queries like the above one are already fully functional in our development lab. At DataDirect we are excited about all the new functionality the XQuery Update Facility will bring to the XML development community.

Tech Tags:

Labels: ,

Tuesday, April 22, 2008

Stock quotes, CSV, XQuery and Web services

Not long ago I could rely on a variety of free Web services to fetch the semi-live quotes of my preferred stock tickers. Unfortunately, one by one all those Web services have either disappeared or started requiring a subscription... and for writing a few demos now and then, the subscription model doesn't work that well for me.

I started thinking about alternatives; one option could be to fetch an HTML page from one of the many Web sites that expose stock quotes; but then extracting the information I'm looking for, reformatting it, or mashing it up or reshaping it in a different XML structure would be too painful. HTML layouts change frequently, and navigating HTML even after converting it to well formed XML (XHTML) is usually error prone.

When I mentioned that to one of our lead engineers, Tony Lavinio pointed out that Yahoo! exposes a service which returns a comma separated value (CSV) file given a list of tickers and a few options (that you can find described in a few places on the Internet). It's as simple as invoking a URL; for example, try this:

http://finance.yahoo.com/d/quotes.csv?s=PRGS,AAPL,JAVA,MSFT&f=snl1

If Microsoft Excel is installed on your computer, you will most likely get the results back in an Excel spreadsheet. The options at the end control how many and what kind of "columns" you get back ("s" is the symbol, "n" is the company name, "l1" is the latest value). There are more options, as I mentioned; the following URL will return you more information for each stock ticker:

http://finance.yahoo.com/d/quotes.csv?s=PRGS,AAPL,JAVA,MSFT&f=snl1chgopvd1t1jkxerba

Well, OK; that's interesting; but how does it help me to create my own HTML mash-ups, or my own XML reports? The Yahoo! service is very nice, but it returns a CSV format, not XML - which, as you can guess, is what I would like to manipulate. Once again XML Converters come to the rescue: open your Stylus Studio IDE, create a new XQuery and copy and paste this:

doc("converter:CSV?http://finance.yahoo.com/d/quotes.csv?s=PRGS,AAPL,JAVA,MSFT&amp;f=snl1")

This time you'll get an XML document back:

<table>
  <row>
    <column.0>PRGS</column.0>
    <column.1>PROGRESS SOFTWARE</column.1>
    <column.2>30.07</column.2>
  </row>
  <row>
    <column.0>AAPL</column.0>
    <column.1>APPLE INC</column.1>
    <column.2>168.16</column.2>
  </row>
  <row>
    <column.0>JAVA</column.0>
    <column.1>SUN MICROSYSTEMS </column.1>
    <column.2>15.78</column.2>
  </row>
  <row>
    <column.0>MSFT</column.0>
    <column.1>MICROSOFT CP</column.1>
    <column.2>30.42</column.2>
  </row>
</table>

That's something I can use to create my own HTML content or my XML report! I can just change the XQuery a bit to get that XML formatted into an HTML table, for example:

<table cellspacing= "1" cellpadding="4" border="1">{
  for $row in doc("converter:CSV?http://finance.yahoo.com/d/quotes.csv?s=PRGS,AAPL,JAVA,MSFT&amp;f=snl1hg")/table/row
  return
  <tr>
    <td>{ (:s symbol:) $row/column.0/text() }</td>
    <td>{ (:n name:) $row/column.1/text() }</td>
    <td>{ (:l1 last value:) $row/column.2/text() }</td>
    <td>{ (:h day's high:) $row/column.3/text() }</td>
    <td>{ (:g day's low:) $row/column.4/text() }</td>
  </tr>
}</table>

The result will look something like this:

PRGS PROGRESS SOFTWARE 29.58 29.87 29.58
AAPL APPLE INC 167.527 168.00 167.05
JAVA SUN MICROSYSTEMS 15.64 15.70 15.60
MSFT MICROSOFT CP 30.46 30.64 30.43

Now, what if I want to create my own Web service that returns information about one or more stock tickers? If you have read this blog entry about how to expose and consume Web service with DataDirect XQuery, you will know that creating an XQuery-based stock quote Web service is as easy as creating a simple XQuery with one external variable that accepts a list of stock tickers:

getQuotes.xquery:

declare variable $tickers as xs:string external;
<quotes>{
  for $row in doc(concat("converter:CSV?http://finance.yahoo.com/d/quotes.csv?s=", $tickers, "&amp;f=snl1hg"))/table/row
  return
  <stock>
    <symbol>{ $row/column.0/text() }</symbol>
    <name>{ $row/column.1/text() }</name>
    <last-value>{ $row/column.2/text() }</last-value>
    <day-high>{ $row/column.3/text() }</day-high>
    <day-low>{ $row/column.4/text() }</day-low>
  </stock>
}</quotes>


Deploying that XQuery in the DataDirect XQuery Web service framework will expose a new WSDL/SOAP operation, and a new REST service that we can use from any environment that supports Web service invocation. Do you want to try it? Point your browser at:

http://examples.xquery.com/stock-quotes

From there you can retrieve the WSDL corresponding to the XQuery above (getQuotes.xquery), or try the service itself through the Web interface itself! To achieve that I didn't do any more coding than writing the XQuery listed above! As you can see, I also added there a second XQuery (getQuotesEx.xquery) which allows you to specify different options:

getQuotesEx.xquery:

declare variable $tickers as xs:string external;
declare variable $options as xs:string external;
<quotes>{
  for $row in doc(concat("converter:CSV?http://finance.yahoo.com/d/quotes.csv?s=", $tickers, "&amp;f=", $options))/table/row
  return $row
}</quotes>


The XML returned might be formatted in a more pleasant way adding some translation of each option in a reasonable element name returned as result, but I'll leave that exercise to someone else.

Thanks to the power and flexibility of DataDirect XQuery and XML Converters, and thanks to the Yahoo! Finance CSV service, I now have again Web services I can use to retrieve all the information I need about any stock ticker!

Labels: , , , ,

Saturday, April 5, 2008

W3C XML Query Working Group published new documents

Several new W3C XML Query publications have hit the street the last weeks.

First of all the XML Query Working Group published a Candidate Recommendation of XQuery Update Facility 1.0. Together with this Candidate Recommendation, the Working Group published an update of the XQuery Update Facility 1.0 Requirements and XQuery Update Facility 1.0 Use Cases.

Where XQuery 1.0 only allows to "query", the XQuery Update Facility adds "update" capabilities to XQuery. The obvious use case is to update XML documents stored in a database, but XQuery Update Facility is about more!
It also enables transformations and enrichment of both transient and persisted XML documents. Such operations are rather complex with XQuery 1.0. I will talk about some of these uses cases in a future post, and show what XQuery Update Facility adds to the XQuery world when it comes to transformations and data enrichment.

When will DataDirect XQuery support the XQuery Update Facility? If you were hoping to read here about an a concrete date, I have to disappoint you. All I can say right now is that DataDirect follows closely the XQuery Update Facility development process. We will be able to make this more concrete later this year, when the XQuery Update Facility moves into Proposed Recommendation.
In the meantime, don’t forget that DataDirect XQuery offers today the ability to update your database through XQuery, Minollo blogged about a nice use case not so long ago.

I believe a significant limitations of the XQuery Update Facility is that an expressions is either "updating" or "non updating". In other words, within a single query it is not possible to both perform an update and return a query result. Looks like a common use case to me. The XML Query Working Group is trying to answer this through XQuery Scripting Extensions.

A first Working Draft of XQuery Scripting Extension 1.0 made it to the public recently. As this is a first Working Draft, don’t expect things to be written in stone. Everything is open for change and improvements. As with most of the XQuery specifications, this one also is accompanied by a Use Cases document, it provides the usage scenarios that motivate the changes in the XQuery Scripting Extension. I’m a fan of the use cases approach, it’s a good source for learning.

As said above, the ability to both "update" and "query" is required, and this is definitely the most important functionality I see in the XQuery Scripting Extensions. Some of the procedural extensions (assignments, while-loop, etc) are somehow questionable. It has to be seen how these will be used in a declarative language like XQuery. Will such concepts result in confusion for the user? Will productivity of XQuery programming improve through the procedural extensions? The future will tell...

The XML Query Working Group has been busy lately, and also started with XQuery 1.1. Still preliminary, but the Requirements and a first version of the Uses Cases were published mid March.

Tech Tags:

Labels:

Thursday, March 27, 2008

Joining XML with RDBMS, how and what are the consequences?

A few months ago, my friend Minollo blogged about XQuery against RDBMS: let the engine optimize your SQL. And last week again, a question on some newsgroup attracted our attention.

Imagine you have an XML document with customers, there might be one or a few hundred, and want to retrieve all orders for these customers out of your Oracle database. One could write something as follows, using a general comparison in the where clause.

for $order in collection("ORDERS")/ORDERS
where $order/O_CUSTKEY = doc("customers.xml")/customers/customer
return
<order id="{$order/O_ORDERKEY}">
<customer>{$order/O_CUSTKEY/data(.)}</customer>
<date>{$order/O_ORDERDATE/data(.)}</date>
<price>{$order/O_TOTALPRICE/data(.)}</price>
</order>

As explained in the white paper Implementing XQuery efficiently for relational databases, DataDirect XQuery always handles general comparisons efficiently and translates them into SQL - taking advantage of database indexes and only retrieving the relevant data out of the database.

Most other XQuery implementations don't translate such general comparisons into SQL, because the second argument of the general comparison is a sequence of customers. In other words, the where clause is not evaluated by the database engine, with all the negative performance and scalability consequences that implies. Basically, all records in the ORDERS table are retrieved out of the database, and filtering is accomplished in memory.

As advised on the newsgroup, one can rewrite the query as follows.

for $customer in doc("customers.xml")/customers/customer
for $order in collection("ORDERS")/ORDERS
where $order/O_CUSTKEY = $customer
return
<order id="{$order/O_ORDERKEY}">
<customer>{$order/O_CUSTKEY/data(.)}</customer>
<date>{$order/O_ORDERDATE/data(.)}</date>
<price>{$order/O_TOTALPRICE/data(.)}</price>
</order>

Both queries yield the same result. Well, to be technically accurate, there are some differences.

  • the order of the returned orders might be different. With the second query, the orders are grouped per customer
  • in order to make both queries equivalent, with the second alternative, you need to get distinct values out of the customers.xml
But ok, let's assume for a moment both queries yield the same result.

One interesting thing about this second query, compared to the first, is that most XQuery implementations are capable of translating this where clause in SQL as both arguments of the comparison are single values. And you end up with SQL as follows, executed through a nested loop algorithm.

select  O_ORDERKEY, O_CUSTKEY, O_ORDERDATE, O_TOTALPRICE
from ORDERS
where O_CUSTKEY = ?

Further, better implementations will try to minimize the number of roundtrips to the database, and combine multiple requests in a single SQL statement. Smells like a topic for another post, anyway...

Looks great, our initial concerns are answered, the where clause is evaluated by the SQL engine. We're perfectly happy, right?

Unfortunately, there is a significant drawback with the second query! Suppose, for example, we want to order our results by ORDERDATE and TOTALPRICE.

for $customer in doc("customers.xml")/customers/customer
for $order in collection("ORDERS")/ORDERS
where $order/O_CUSTKEY = $customer
order by $order/O_ORDERDATE, $order/O_TOTALPRICE
return
<order id="{$order/O_ORDERKEY}">
<customer>{$order/O_CUSTKEY/data(.)}</customer>
<date>{$order/O_ORDERDATE/data(.)}</date>
<price>{$order/O_TOTALPRICE/data(.)}</price>
</order>

Although the where clause is translated into SQL, the order by clause is not. It is simply not possible! Remember that multiple SQL statements will be executed in a nested loop, all these results are buffered in memory, after which we need to order the buffered results. This is clearly a significant drawback of the proposed alternative. Handing order by clauses out of your database engine, implies a serious performance penalty and excessive memory consumption, even with medium sized query results.

Let's go back to our initial query and add the order by clause.

for $order in collection("ORDERS")/ORDERS
where $order/O_CUSTKEY = doc("customers.xml")/customers/customer
order by $order/O_ORDERDATE, $order/O_TOTALPRICE
return
<order id="{$order/O_ORDERKEY}">
<customer>{$order/O_CUSTKEY/data(.)}</customer>
<date>{$order/O_ORDERDATE/data(.)}</date>
<price>{$order/O_TOTALPRICE/data(.)}</price>
</order>

If you have an XQuery implementation that is capable of translating such where clauses into SQL, the topic this post started with, it can as a consequence go further and also translate the order by clause into SQL! With DataDirect XQuery, the execution plan for this last query is as follows. Note that both the where and order by clause are pushed into SQL and processed by your Oracle database.

As you see, choosing the right XQuery implementation can make a huge difference. If you want to find out more about how DataDirect XQuery generates SQL and leverages the strengths of your current database investment, check out this white paper.

Labels: , , , ,

Thursday, March 20, 2008

Using XQuery to join two Excel spreadsheets? Why not?


This afternoon I was having some fun (not!) reviewing some third party license agreements, when someone emailed me a couple of Excel spreadsheets saying "Hey Minollo! I have these two spreadsheets; one has a list of all employees since 1990 in Progress Corporation and the cities where they lived when they were hired; the other one has a list of the current employees and in which Progress division they work; I need to create a list of all the current employees with details about the division they work in and the town the live (or lived) in. Is there anything that "XQuery thing" you always talk me about can do for me here?"

Well, the temptation to keep reviewing license agreements rather than playing with XQuery was strong... but I resisted! I had to help this guy! :)

I was happy to notice that the spreadsheets had been saved using the Office Open XML (OOXML) format; OOXML is basically a package of XML files describing the document (text, spreadsheet, slides) that it represents. Being XML-based, it's quite easy to inspect it using an XML query language; the package is a zip format, so individual documents can be retrieved using the standard "jar:" file scheme, for example.

So, coming back to the guy asking for help joining two spreadsheets; what can we do for him? If you look at the structure of the two spreadsheets he sent me (attached to this post), you'll see two XML documents that are relevant to us:

  • xl/worksheets/sheet1.xml
    This file contains the content of each non-empty cell in the spreadsheet. The document is split in elements and each element has a number of cells represented by elements. When the content of a element is a character string, like in this particular case, the element contains a element whose value is the index of the corresponding character string in a list of all the character strings used in the spreadsheet

  • xl/sharedStrings.xml
    This file contains all the character strings used as content of cells in the spreadsheet; the order is relevant, as the character strings are referenced using an index (like described above)

To retrieve the whole content of xl/worksheets/sheet1.xml from c:\doc1.xlsx using a Java-based XQuery processor like DataDirect XQuery, you can just us the standard fn:doc() function:

fn:doc("jar:file:///c:/doc1.xlsx!/xl/worksheets/sheet1.xml")

The index corresponding to the character string contained in the first cell of the first row in the spreadsheet can be retrieved doing:

fn:doc("jar:file:///c:/doc1.xlsx!/xl/worksheets/sheet1.xml")//
   ooxml:sheetData/ooxml:row[1]/ooxml:c[1]/ooxml:v

...where ooxml is a namespace prefix associated to the URI http://schemas.openxmlformats.org/spreadsheetml/2006/main .

And if you have the index of the character string, the character string itself can be retrieved as:

fn:doc("jar:file:///c:/doc1.xlsx!/xl/sharedStrings.xml")//
   ooxml:si[$theIndex+1]/ooxml:t/text()

Easy, isn't it!?

Add some syntax, a couple of FLWOR expressions, a few HTML tags, and here you go! This XQuery just joins the two spreadsheets and creates an easy to read HTML report:


 declare namespace ooxml=
    "
http://schemas.openxmlformats.org/spreadsheetml/2006/main";
 let $doc1Strings := doc("jar:file:///c:/doc1.xlsx!/xl/sharedStrings.xml")//ooxml:si
 let $doc2Strings := doc("jar:file:///c:/doc2.xlsx!/xl/sharedStrings.xml")//ooxml:si
 return
  <table> {
   <tr style="font-weight: bold"><td>Name</td><td>Division</td><td>City</td></tr>,
   for $row1 in
    doc("jar:file:///c:/doc1.xlsx!/xl/worksheets/sheet1.xml")//
          ooxml:sheetData/ooxml:row,
       $row2 in
    doc("jar:file:///c:/doc2.xlsx!/xl/worksheets/sheet1.xml")//
          ooxml:sheetData/ooxml:row
   let $col1-1 := $doc1Strings[$row1/ooxml:c[1]/ooxml:v/number()+1]/ooxml:t/text()
   let $col2-1 := $doc2Strings[$row2/ooxml:c[1]/ooxml:v/number()+1]/ooxml:t/text()
   where $col1-1 = $col2-1
   return
    let $col1-2 := $doc1Strings[$row1/ooxml:c[2]/ooxml:v/number()+1]/ooxml:t/text()
    let $col2-2 := $doc2Strings[$row2/ooxml:c[2]/ooxml:v/number()+1]/ooxml:t/text()
    return
     <tr>
      <td>{$col1-1}</td>
      <td>{$col2-2}</td>
      <td>{$col1-2}</td>
     </tr>
  } </table>

It's that simple! Here are the two spreadsheets, if you want to try it yourself: doc1.xlsx, doc2.xlsx

Labels: , , ,

Tuesday, March 11, 2008

A free XQuery workshop close to you!

Many of you have been writing us positive comments about the technical content of this Blog. We are happy to hear you find the information in this pages useful; both Marc and I always try hard posting on topics about which we hear a lot of interest when we talk to XQuery developers and users of our XML products, DataDirect XQuery and XML Converters.

One of the things we have been hearing from you is that you would like us to offer workshops or tutorials on how to use XQuery in real life scenarios; and I'm glad to let you know that we have recently announced a tour of a few cities in the United States and Europe to offer you all a *free* workshop on XQuery! During this hands-on workshop we will be writing, testing and deploying real XML and XQuery-based solutions, live in front of you!

We will start with some typical scenarios where a hypothetical IT organization needs to create a set of data services exposing various functionality. The workshop covers:
  • Transforming XML documents into user-defined formats
  • Aggregating XML documents with relational data
  • Consuming Web services in XQuery
  • Exposing XQuery as data services
  • Integrating with non-XML B2B standards, like EDI
  • Adhering to XML industry standards, such as ACORD for insurance

The workshop is highly interactive, and we really need your help to solve the various problems that we introduce! Again, these events are entirely free and you just need to register in advance to attend; dates, locations and registration details are available here.

We all hope to see you soon in one of the cities we are visiting.

Labels: , ,

Monday, February 18, 2008

Converting proprietary file formats to specific XML structures


We have talked in the past about how you can use XQuery to deal with EDI message (transform a directory of EDI messages, create EDI messages out of your database, converting tab delimited files into EDI). But sometimes you are not dealing with EDI; it's not infrequent that companies still rely on proprietary file formats to exchange or circulate information.
For example, a few days ago we received a request from a user who needed to convert a file in a proprietary format into a specific XML structure. The sample file he sent us looked something like this:


10       0123456789MR JOHN            SMITH
20       0123456789ACCT1
30       0123456789SALARY       500000
30       0123456789BONUS          1000
20       0123456789ACCT2
30       0123456789OTHER           100
10       1234566790MR DAVID           DOE
20       1234567890ACCT1
30       1234567890SALARY        10000


This isn't EDI, CSV, tab delimited or any of the other standard or semi-standard ways to represent information in text files; it's a proprietary format based on a specific interpretation of the size and position of the fields. This is the description of records and fields that the user shared with us:


one block is made of:
  one record 10 (customer)
  1 to N record 20 per record 10 (accounts)
  1 to N record 30 per record 20 (transactions)

record 10
  record type
on 10 characters
  customer id on 10 characters
  customer lastname on 20 characters
  customer firstname on 30 characters

record 20
  record type
on 10 characters
  customer id on 10 characters
  account id on 10 characters

record 30
  record type
on 10 characters
  customer id on 10 characters
  label on 10 characters
  value on 10 characters

How can we deal with something like this without building our own parser? DataDirect XML Converters and Stylus Studio can definitely help! XML Converters support what we call "custom conversions"; a custom conversion defines how files belonging to the same format family should be converted to XML by XML Converters. As the definition of a custom conversion can be a tedious task, Stylus Studio provides an editor that allows you to create custom conversion through an intuitive graphical interface. Explaining how that would work in this case is probably more complicated than actually doing it... so, here is a short video which describes all the operations you would do in Stylus Studio to convert the file provided by the user into XML.


Thanks to the custom conversion created using Stylus Studio, we are now able to manipulate the proprietary format file as XML; but, as mentioned above, the user needs to convert the file into a specific XML format. Using the example above, the desired XML result should look like this:


<root>
    <record_10>
        <type>10</type>
        <customer_id>0123456789</customer_id>
        <lastname>MR JOHN</lastname>
        <firstname>SMITH</firstname>
    </record_10>
    <record_20>
        <type>20</type>
        <customer_id>0123456789</customer_id>
        <account_id>ACCT1</account_id>
    </record_20>
    <record_30>
        <type>30</type>
        <customer_id>0123456789</customer_id>
        <label>SALARY</label>
        <value>500000</value>
    </record_30>
    <record_30>
        <type>30</type>
        <customer_id>0123456789</customer_id>
        <label>BONUS</label>
        <value>1000</value>
    </record_30>
    <record_20>
        <type>20</type>
        <customer_id>0123456789</customer_id>
        <account_id>ACCT2</account_id>
    </record_20>
    <record_30>
        <type>30</type>
        <customer_id>0123456789</customer_id>
        <label>OTHER</label>
        <value>100</value>
    </record_30>
    <record_10>
        <type>10</type>
        <customer_id>1234566790</customer_id>
        <lastname>MR DAVID</lastname>
        <firstname>DOE</firstname>
    </record_10>
    <record_20>
        <type>20</type>
        <customer_id>1234567890</customer_id>
        <account_id>ACCT1</account_id>
    </record_20>
    <record_30>
        <type>30</type>
        <customer_id>1234567890</customer_id>
        <label>SALARY</label>
        <value>10000</value>
    </record_30></root>

This is where the combination of DataDirect XQuery and XML Converters proves particularly helpful: not only we are able to move from a proprietary file format to well formed XML, but we also have all the power, performance and scalability of XQuery to deal with such XML to transform it (potentially, even augment it!) into the final format that we need. The following XQuery - which implicitly relies on the custom converter described above - does the trick; notice that we created a couple of functions to take care of the positional grouping which is required to generate the desired output; that makes the XQuery much easier to read and more reusable:


declare function local:getRelated20s($item) {
    let $nextItem :=$item/following-sibling::*[local-name()!="record_20"and local-name()!="record_30"][1]
    for $related in $item/following-sibling::*[local-name()="record_20"]
    where if ($nextItem) then $related << $nextItem else true()
    return $related
};
declare function local:getRelated30s($item) {
    let $nextItem := $item/following-sibling::*[local-name()!="record_30"][1]
    for $related in $item/following-sibling::*[local-name()="record_30"]
    where if($nextItem) then $related << $nextItem else true()
    return $related
};
<DataSets>
    {
    for $record_10 in doc('converter:file:///c:/blog1/sample.conv?sampleinput.txt')/root/record_10
    return
    <DataSet>
        <CustomerId>
            {$record_10/customer_id/text()}
        </CustomerId>
        <CustomerName>
            {$record_10/lastname/text()}
        </CustomerName>
        <CustomerFirstName>
            {$record_10/firstname/text()}
        </CustomerFirstName>
        <Accounts>
            {
            for $record_20 in local:getRelated20s($record_10)
            return
            <Account>
                <AccountId>
                    {$record_20/account_id/text()}
                </AccountId>
                <Transactions> {
                    for $record_30 in local:getRelated30s($record_20)
                    return
                        <Transaction>
                            <Label>{$record_30/label/text()}</Label>
                            <Value>{$record_30/value/text()}</Value>
                        </Transaction>
                } </Transactions>
            </Account>
            }
        </Accounts>
    </DataSet>
    }
</DataSets>

Once again, thanks to the fact that DataDirect XQuery and XML Converters make it possible to manipulate a wide variety of data sources (even proprietary ones!), you are able to deal with transformation (and possibly aggregation) problems from an XML point of view, shielding as much as possible those tasks from the low level details of the data sources.

Labels: , , , ,

Wednesday, February 6, 2008

XQuery creates HTML including images

This is the fourth post in the series XQuery generating multiple documents. Today we show how to create an HTML document referencing JPEG files through XQuery. The DataDirect XQuery built-in function ddtek:serialize-to-url is used to save the JPEG images.

The idea is that all the data, including the images, are stored in a relational database. The queries in this post have been tested with DataDirect XQuery against SQL Server, but as we will demonstrate this works perfectly against any other database supported by DataDirect XQuery - Oracle, DB2, Sybase, MySQL, etc.

XQuery is known to query and produce XML, but how can it be used to produce images, JPEG files in this particular example? This is another example where the DataDirect XML Converters come to rescue.

The JPEG images are stored in our database, and are retrieved by DataDirect XQuery as xs:hexbinary data. Subsequently the DataDirect XML Converters are used to convert the xs:hexbinary into a binary stream, which is saved as a JPEG file.

The following query shows how to generate the main HTML page which references the JPEG files. The idea here is to save the HTML document in the same directory as the JPEG files.
The table in our SQL Server database is fairly simple, having 3 columns with the name of the image, a small description and the actual picture.

<html>{
"My favourite pictures.",<p/>,
(: loop over all images in the database:)
for $image in collection("images")/images
(: the file name of the JPEG file :)
let $filename := concat($image/name, ".jpg")
(: create some virtual XML, which will be
processed by the XML Converters :)
let $jpeg := <jpeg>{$image/image/text()}</jpeg>
return
(<a href="./{$filename}">{
$image/description,
(: create the JPEG file :)
ddtek:serialize-to-url($jpeg,
concat("C:/test", $filename),
"method=binary")
}</a>,
<br/>)
}</html>

In this example we've used JPEG files, but you can apply the outlined technique with any other binary format.

Tech Tags:

Labels:

Tuesday, January 29, 2008

Converting tab delimited files into EDI


A few days ago I received a email saying "I am a DataDirect XQuery user, and I'm successfully using your product to create XML-based reports from incoming EDI messages. Now I'm in need to do something somewhat different: it may happen that my company receives tab-delimited messages, and those messages need to be translated to EDI X12 831, rather than converted to XML. Can your product help me handling tab-delimited to EDI conversions?".

Not being a marketing person, I usually like showing how things are done, rather than just claiming that they can be easily done using our products... :)
So, I asked the user for an example of his data, a tab-delimited file. If you have read some of the other posts in this Blog, you will know that DataDirect XQuery can access tab-delimited files thanks to the XML Converters technology. So, consider this tab-delimited file:

id description quantity unitprice
001 Telephone 1 23.00
002 Desk 1 129.00
003 Keyboard 2 21.00

The following XQuery will return, for example, all the description fields in the tab-delimited input message:

 <items> {
  for $item in doc("converter:TAB:first=yes?file:///c:/sample.tab")//row
  return $item/description
 } </items>

The result is:

 <items>
  <description>Telephone</description>
  <description>Desk</description>
  <description>Keyboard</description>
 </items>

That shows that we can handle a tab-delimited file in DataDirect XQuery and generate XML; but it doesn't help much our user: he needs to output EDI.
But we know how to do that; Marc has shown us in this post how you can create EDI messages from XQuery. We just need to apply the same idea to our example; the following XQuery uses the incoming tab-delimited file to apparently generate an XML fragment consistent with a specific vocabulary; that vocabulary is what the underlying XML Converters technology knows how to translate in the corresponding EDI message (an X12 832 message in this case, which is a price/sales catalog). The serialization option at the beginning of the XQuery is instructing DataDirect XQuery to output a raw EDI message rather than XML; that way we are moving data from a tab-delimited file format into an EDI message manipulating it as if it was all XML, but without ever materializing it as XML! Many of the EDI fields typically specified are omitted here; but this example will run if you want to see how it works.

 declare option ddtek:serialize "method=EDI,long=yes";
 <X12>
  <ISA>
   <ISA06-InterchangeSenderId>1515151515</ISA06-InterchangeSenderId>
   <ISA08-InterchangeReceiverId>5151515151</ISA08-InterchangeReceiverId>
   <ISA11-RepetitionSeparator>^</ISA11-RepetitionSeparator>
   <ISA13-InterchangeControlNumber>000032123</ISA13-InterchangeControlNumber>
   <ISA14-AcknowledgmentRequested>0</ISA14-AcknowledgmentRequested>
   <ISA15-UsageIndicator>P</ISA15-UsageIndicator>
   <ISA16-ComponentElementSeparator>*</ISA16-ComponentElementSeparator>
  </ISA>
  <GS>
   <GS01-FunctionalIdentifierCode>CT</GS01-FunctionalIdentifierCode>
   <GS02-ApplicationSendersCode>9988776655</GS02-ApplicationSendersCode>
   <GS03-ApplicationReceiversCode>1122334455</GS03-ApplicationReceiversCode>
   <GS06-GroupControlNumber>128</GS06-GroupControlNumber>
   <GS07-ResponsibleAgencyCode>X</GS07-ResponsibleAgencyCode>
   <GS08-VersionReleaseIndustry>004030</GS08-VersionReleaseIndustry>
  </GS>
  <TS_832>
   <ST>
    <ST01-TransactionSetIdentifierCode>832</ST01-TransactionSetIdentifierCode>
    <ST02-TransactionSetControlNumber>12345</ST02-TransactionSetControlNumber>
   </ST>
   <BCT>
    <BCT01-CatalogPurposeCode>CP</BCT01-CatalogPurposeCode>
    <BCT02-CatalogNumber>GOV56789</BCT02-CatalogNumber>
    <BCT10-TransactionSetPurposeCode>00</BCT10-TransactionSetPurposeCode>
   </BCT>
   {
    for $row in doc('converter:TAB:first=yes?file:///c:/sample.tab')/table/row
    return
     <GROUP_5>
      <LIN>
       <LIN02-ProductServiceIdQualifier>MF</LIN02-ProductServiceIdQualifier>
       <LIN03-ProductServiceId>{$row/id/text()}</LIN03-ProductServiceId>
       <LIN07-ProductServiceId>{$row/description/text()}</LIN07-ProductServiceId>
       <LIN09-ProductServiceId>{$row/quantity/text()}</LIN09-ProductServiceId>
      </LIN>
      <GROUP_6>
       <CTP>
        <CTP03-UnitPrice>{$row/unitprice/text()}</CTP03-UnitPrice>
       </CTP>
      </GROUP_6>
     </GROUP_5>
   }
   <SE/>
  </TS_832>
  <GE/>
 </X12>

Curious about the result? Here you go:
ISA+00+ +00+ +ZZ+1515151515 +ZZ+5151515151 +080129+1710+U+00000+000032123+0+P+*'
GS+CT+9988776655+1122334455+20080129+1710+128+X+004030'
ST+832+12345'
BCT+CP+GOV56789++++++++00'
LIN++MF+001++++Telephone++1'
CTP+++23'
LIN++MF+002++++Desk++1'
CTP+++129'
LIN++MF+003++++Keyboard++2'
CTP+++21'
SE+9+12345'
GE+1+128'
IEA+1+000032123'

The beauty of this approach is that you can use it in conjunction with all the other data aggregation capabilities of DataDirect XQuery, like possibly adding data to the outgoing EDI message which is not available in the incoming tab-delimited file but available inside a relational database (like a encoded field that may need to be looked up in a table); and this will all work taking advantage of the performance and scalability features of DataDirect XQuery and XML Converters: the example above will seamlessly stream from tab-delimited to EDI without ever allocating in memory for than a few items per time.

The same technique applies of course to all the data sources supported by DataDirect XQuery and XML Converters; you can find more details at the following links:

Labels: , , , , ,

Thursday, January 17, 2008

Split up an XML document

In the series XQuery generating multiple XML documents we talk today about splitting a large XML document in several smaller ones. In some scenarios such huge documents are simply unmanageable, requiring the need to split them up.

Consider one of the Shakespeare plays marked up in XML, for example Hamlet. We want to split for this play each speech in a separate XML document.

for $speech at $i in
doc("http://www.andrew.cmu.edu/user/akj/shakespeare/hamlet.xml")
/PLAY//SPEECH
let $url := concat("C:/SHAKESPEAR/",
string-join($speech/SPEAKER,"_"),
$i, ".xml")
return ddtek:serialize-to-url($speech, $url,
"omit-xml-declaration=no,indent=yes")

It outputs 1138 files in the C:\SHAKESPEAR directory.

Or you might need to generate an HTML file for each speech, and have a document referencing them all. In the next example, each of the speech HTML files are created through ddtek:serialize-to-url, the query result is the master HTML referencing the others.

<html>{
for $speech at $i in
doc("http://www.andrew.cmu.edu/user/akj/shakespeare/hamlet.xml")
/PLAY//SPEECH
let $url := concat("C:/SHAKESPEAR/",
string-join($speech/SPEAKER,"_"),
$i, ".html")
let $htmlspeech := <html>
<b>{$speech/SPEAKER/text()}</b>:<br/>{
for $line in $speech/LINE
return
($line/text(),<br/>)
}</html>
return
(
ddtek:serialize-to-url($htmlspeech, $url, "method=html"),
<a href="{$url}">{
$speech/SPEAKER/text()
}</a>,
<br/>
)
}</html>

As you see, using the ddtek:serialize-to-url function it becomes very natural to split XML documents in multiple ones. And of course, all this taking advantage of DataDirect XQuery's XML Streaming capabilities, enabling support for huge XML documents.

Tech Tags:

Labels: ,

Tuesday, January 8, 2008

Standard XQuery support for MySQL(R): it's available now!


As you may have noticed if you paid attention to our recent new releases, DataDirect XQuery now fully supports MySQL(R).
That means that finally MySQL users have access to a standard way (XQuery) to get access to their data as XML. To see how all this works, and how MySQL users can take advantage of this opportunity, we have created a few examples on our web site; see how easy it is to extract data from MySQL as XML, how to aggregate it with XML documents or even EDI documents; and how you can even update your MySQL data directly from XQuery.

Labels: , , , ,

Friday, January 4, 2008

Bulk load of XML data into a relational database


Recently I received a question about the following topic:
We are trying to come up with a solution to bulk insert XML data into a flat
structure table in an Oracle database. Our questions are:
1. Is the performance favorable using DataDirect XQuery?
2. Do you have any sample code we can use for a proof of concept?


Those are two good questions that are becoming more and more popular among our users. And DataDirect XQuery has very good answers for them! What about performance? Well, performance always depends on many factors, of course; how is the data structured in the XML document(s), for example; how does it need to be shred into columns, maybe into multiple tables; what kind of "massage" data needs to go through, and so on. But in general, users can rely on the DataDirect XQuery streaming and projection capabilities to boost scalability and on its ability to work efficiently with relational data sources to leverage performance.

What about some example? Consider the following hypothetical scenario: your warehouse receives daily shipments of books; details about the books are available in XML documents structured like this:
 <shipment>
  <book bookid="1" quantity="15">
   <title>Java Message Service</title>
   <isbn>0596000685</isbn>
   <subject>Java Message Service (JMS)</subject>
  </book>
  <book bookid="2" quantity="25">
   <title>Instant UNIX</title>
   <isbn>1874416656</isbn>
   <subject>GNU/Linux</subject>
  </book>
 </shipment>

Real shipment lists are actually made of hundreds of thousands of entries.
The SQL database used by your warehouse has a "shipments" table which is defined like this:
 CREATE TABLE [dbo].[shipments](
   [DATE] [datetime] NOT NULL,
   [ISBN] [nchar](10) NOT NULL,
   [QUANTITY] [decimal](18, 0) NOT NULL,
   [TITLE] [nvarchar](max) NOT NULL
 ) ON [PRIMARY]

You can write an XQuery loading all the data in the XML document into the "shipments" table quite easily, relying on the DataDirect XQuery RDBMS update capabilities:
 declare variable $shipment
     as document-node(element(*, xs:untyped)) external;
 for $book in $shipment/order/book
 return
  ddtek:sql-insert("shipments",
    "DATE", current-dateTime(),
    "ISBN", $book/isbn,
    "TITLE", $book/title,
    "QUANTITY", $book/@quantity)

Of course there may be cases in which the data been inserted in the database needs to be somewhat modified; for example, support the "TITLE" columns requires all capital case letters:
 for $book in $shipment/order/book
 return
  ddtek:sql-insert("shipments",
    "DATE", current-dateTime(),
    "ISBN", $book/isbn,
    "TITLE", upper-case($book/title),
    "QUANTITY", $book/@quantity)

...or maybe something a little bit more interesting: suppose the @quantity attribute in the input XML document is optional, and if it's missing it means quantity is actually 1:
 for $book in $shipment/order/book
 return
  ddtek:sql-insert("shipments",
    "DATE", current-dateTime(),
    "ISBN", $book/isbn,
    "TITLE", upper-case($book/title),
    "QUANTITY",
      if ($book/@quantity) then $book/@quantity else 1)

Or maybe the single XML needs to be shred in multiple tables, for example doing something like this:
 let $now := current-dateTime()
 for $book in $shipment/order/book
 return (
  ddtek:sql-insert("shipments1",
    "DATE", $now,
    "ISBN", $book/isbn,
   &n