XML Connections

Wednesday, April 30, 2008

Transforming XML using XQuery updates

In one of my previous posts I blogged about the new W3C XML Query publications, including XQuery Update Facility.

Talking to people, I feel a majority relate XQuery Update Facility with updating XML documents in some persistent store. But there is actually more, it is also about transforming and enriching (transient) documents. In fact a use case hard to handle with XQuery 1.0.

Note the XQuery Update Facility is today in W3C Candidate Recommendation status. Although not yet supported by DataDirect XQuery, we are following the progression of the specification closely, and hope to communicate about our plans soon, when the specification moves into Proposed Recommendation status.

Anyway, I want to give a feel on what will be possible in the near future.

Let’s consider a concrete example. Not so long ago I posted a series on XQuery generating multiple XML documents, let’s continue with this theme. Assume you have a directory with XML documents, and these need to be enriched with data coming out of your relational database. For example, orders need to be completed with the shipping address.

Let’s enrich all our orders, each a seperate XML document. Using the customer’s name available in the order, we lookup the shipping address in our relational database and enrich subsequently the order with that shipping address.

In the margin, we use the utility function local:get-file-name, which we introduced before.

for $order in fn:collection("file:///C:/inputorders?select=*.xml")
let $filename := concat("file:///C:/outputorders/",
local:get-file-name(document-uri($order)))
let $newOrder :=
<order id="{$order/order/@id}">{
$order/order/customer,
<address>{
fn:collection("CUSTOMERS")/CUSTOMERS[NAME = $order/order/customer]/ADDRESS/text()
}</address>,
$order/order/totalprice,
$order/order/priority,
$order/order/deliverydate,
$order/order/notes,
$order/order/lineitems
}</order>
return
ddtek:serialize-to-url($newOrder, $filename, "")

Basically we read information out of the input document, and recreate the output using the same structure, but extending it with the shipping address.
There are a number of issues with such approach.

  • The more complex the input documents are, the more complex and longer the query will be.
  • The query is specific to the document structure. When the document structure evolves, the query needs to be updated.
  • If you need to handle different document types, either your query becomes more complex or multiple queries are to be used.
And as you can imagine, the more complex your documents are, the more these issues are relevant. The example above is in fact over simplified. Typical XML documents in the industry are much more complex, think for example on some of the ones we use in our tutorial How DataDirect XQuery Helps the Insurance Industry Deal with ACORD Standards.

Using XQuery Update Facility this becomes much simpler. It helps to resolve the raised concerns, resulting in more maintainable queries, and eventually increased productivity.
In the following query we simply enrich all XML documents by adding a corresponding address element after each customer element. Such approach is much more robust, and is ready to handle future revisions of the orders XML Schema.

for $order in fn:collection("file:///C:/inputorders?select=*.xml")
let $filename := concat("file:///C:/outputorders/",
local:get-file-name(document-uri($order)))
return
copy $newOrder := $order
modify
insert node
<address>{
fn:collection("CUSTOMERS")/CUSTOMERS[NAME = $order/order/customer]/ADDRESS/text()}
</address>
after $newOrder/order/customer
return
ddtek:serialize-to-url($newOrder, $filename, "")

Queries like the above one are already fully functional in our development lab. At DataDirect we are excited about all the new functionality the XQuery Update Facility will bring to the XML development community.

Tech Tags:

Labels: ,

Tuesday, January 29, 2008

Converting tab delimited files into EDI


A few days ago I received a email saying "I am a DataDirect XQuery user, and I'm successfully using your product to create XML-based reports from incoming EDI messages. Now I'm in need to do something somewhat different: it may happen that my company receives tab-delimited messages, and those messages need to be translated to EDI X12 831, rather than converted to XML. Can your product help me handling tab-delimited to EDI conversions?".

Not being a marketing person, I usually like showing how things are done, rather than just claiming that they can be easily done using our products... :)
So, I asked the user for an example of his data, a tab-delimited file. If you have read some of the other posts in this Blog, you will know that DataDirect XQuery can access tab-delimited files thanks to the XML Converters technology. So, consider this tab-delimited file:

id description quantity unitprice
001 Telephone 1 23.00
002 Desk 1 129.00
003 Keyboard 2 21.00

The following XQuery will return, for example, all the description fields in the tab-delimited input message:

 <items> {
  for $item in doc("converter:TAB:first=yes?file:///c:/sample.tab")//row
  return $item/description
 } </items>

The result is:

 <items>
  <description>Telephone</description>
  <description>Desk</description>
  <description>Keyboard</description>
 </items>

That shows that we can handle a tab-delimited file in DataDirect XQuery and generate XML; but it doesn't help much our user: he needs to output EDI.
But we know how to do that; Marc has shown us in this post how you can create EDI messages from XQuery. We just need to apply the same idea to our example; the following XQuery uses the incoming tab-delimited file to apparently generate an XML fragment consistent with a specific vocabulary; that vocabulary is what the underlying XML Converters technology knows how to translate in the corresponding EDI message (an X12 832 message in this case, which is a price/sales catalog). The serialization option at the beginning of the XQuery is instructing DataDirect XQuery to output a raw EDI message rather than XML; that way we are moving data from a tab-delimited file format into an EDI message manipulating it as if it was all XML, but without ever materializing it as XML! Many of the EDI fields typically specified are omitted here; but this example will run if you want to see how it works.

 declare option ddtek:serialize "method=EDI,long=yes";
 <X12>
  <ISA>
   <ISA06-InterchangeSenderId>1515151515</ISA06-InterchangeSenderId>
   <ISA08-InterchangeReceiverId>5151515151</ISA08-InterchangeReceiverId>
   <ISA11-RepetitionSeparator>^</ISA11-RepetitionSeparator>
   <ISA13-InterchangeControlNumber>000032123</ISA13-InterchangeControlNumber>
   <ISA14-AcknowledgmentRequested>0</ISA14-AcknowledgmentRequested>
   <ISA15-UsageIndicator>P</ISA15-UsageIndicator>
   <ISA16-ComponentElementSeparator>*</ISA16-ComponentElementSeparator>
  </ISA>
  <GS>
   <GS01-FunctionalIdentifierCode>CT</GS01-FunctionalIdentifierCode>
   <GS02-ApplicationSendersCode>9988776655</GS02-ApplicationSendersCode>
   <GS03-ApplicationReceiversCode>1122334455</GS03-ApplicationReceiversCode>
   <GS06-GroupControlNumber>128</GS06-GroupControlNumber>
   <GS07-ResponsibleAgencyCode>X</GS07-ResponsibleAgencyCode>
   <GS08-VersionReleaseIndustry>004030</GS08-VersionReleaseIndustry>
  </GS>
  <TS_832>
   <ST>
    <ST01-TransactionSetIdentifierCode>832</ST01-TransactionSetIdentifierCode>
    <ST02-TransactionSetControlNumber>12345</ST02-TransactionSetControlNumber>
   </ST>
   <BCT>
    <BCT01-CatalogPurposeCode>CP</BCT01-CatalogPurposeCode>
    <BCT02-CatalogNumber>GOV56789</BCT02-CatalogNumber>
    <BCT10-TransactionSetPurposeCode>00</BCT10-TransactionSetPurposeCode>
   </BCT>
   {
    for $row in doc('converter:TAB:first=yes?file:///c:/sample.tab')/table/row
    return
     <GROUP_5>
      <LIN>
       <LIN02-ProductServiceIdQualifier>MF</LIN02-ProductServiceIdQualifier>
       <LIN03-ProductServiceId>{$row/id/text()}</LIN03-ProductServiceId>
       <LIN07-ProductServiceId>{$row/description/text()}</LIN07-ProductServiceId>
       <LIN09-ProductServiceId>{$row/quantity/text()}</LIN09-ProductServiceId>
      </LIN>
      <GROUP_6>
       <CTP>
        <CTP03-UnitPrice>{$row/unitprice/text()}</CTP03-UnitPrice>
       </CTP>
      </GROUP_6>
     </GROUP_5>
   }
   <SE/>
  </TS_832>
  <GE/>
 </X12>

Curious about the result? Here you go:
ISA+00+ +00+ +ZZ+1515151515 +ZZ+5151515151 +080129+1710+U+00000+000032123+0+P+*'
GS+CT+9988776655+1122334455+20080129+1710+128+X+004030'
ST+832+12345'
BCT+CP+GOV56789++++++++00'
LIN++MF+001++++Telephone++1'
CTP+++23'
LIN++MF+002++++Desk++1'
CTP+++129'
LIN++MF+003++++Keyboard++2'
CTP+++21'
SE+9+12345'
GE+1+128'
IEA+1+000032123'

The beauty of this approach is that you can use it in conjunction with all the other data aggregation capabilities of DataDirect XQuery, like possibly adding data to the outgoing EDI message which is not available in the incoming tab-delimited file but available inside a relational database (like a encoded field that may need to be looked up in a table); and this will all work taking advantage of the performance and scalability features of DataDirect XQuery and XML Converters: the example above will seamlessly stream from tab-delimited to EDI without ever allocating in memory for than a few items per time.

The same technique applies of course to all the data sources supported by DataDirect XQuery and XML Converters; you can find more details at the following links:

Labels: , , , , ,

Thursday, January 17, 2008

Split up an XML document

In the series XQuery generating multiple XML documents we talk today about splitting a large XML document in several smaller ones. In some scenarios such huge documents are simply unmanageable, requiring the need to split them up.

Consider one of the Shakespeare plays marked up in XML, for example Hamlet. We want to split for this play each speech in a separate XML document.

for $speech at $i in
doc("http://www.andrew.cmu.edu/user/akj/shakespeare/hamlet.xml")
/PLAY//SPEECH
let $url := concat("C:/SHAKESPEAR/",
string-join($speech/SPEAKER,"_"),
$i, ".xml")
return ddtek:serialize-to-url($speech, $url,
"omit-xml-declaration=no,indent=yes")

It outputs 1138 files in the C:\SHAKESPEAR directory.

Or you might need to generate an HTML file for each speech, and have a document referencing them all. In the next example, each of the speech HTML files are created through ddtek:serialize-to-url, the query result is the master HTML referencing the others.

<html>{
for $speech at $i in
doc("http://www.andrew.cmu.edu/user/akj/shakespeare/hamlet.xml")
/PLAY//SPEECH
let $url := concat("C:/SHAKESPEAR/",
string-join($speech/SPEAKER,"_"),
$i, ".html")
let $htmlspeech := <html>
<b>{$speech/SPEAKER/text()}</b>:<br/>{
for $line in $speech/LINE
return
($line/text(),<br/>)
}</html>
return
(
ddtek:serialize-to-url($htmlspeech, $url, "method=html"),
<a href="{$url}">{
$speech/SPEAKER/text()
}</a>,
<br/>
)
}</html>

As you see, using the ddtek:serialize-to-url function it becomes very natural to split XML documents in multiple ones. And of course, all this taking advantage of DataDirect XQuery's XML Streaming capabilities, enabling support for huge XML documents.

Tech Tags:

Labels: ,

Thursday, December 27, 2007

Create (EDI) messages out of your database

This is our third post in a series about XQuery generating multiple XML documents.

It's not uncommon to have the needs to generate XML messages out of data stored in your relational database in some type of batch process. Obviously a first class use case for DataDirect XQuery, and the built-in function ddtek:serialize-to-url comes again to rescue.

Suppose we're a bookstore, running a MySQL database with supplier and order information. As we know, with DataDirect XQuery tables in your relational database are accessed through the fn:collection function. The following query creates an XML document for each of the suppliers, listing all orders of today, and each order includes the book's ISBN number and a quantity. The XML documents are saved in the directory C:\output, as "suppliername.xml" files.

for $supplier in collection("SUPPLIERS")/SUPPLIERS
let $orders := collection("ORDERS")/ORDERS[SUP_ID = $supplier/ID and
ORDER_DATE = current-date()]
where $orders (: ignore the supplier if there are no orders for today :)
return
let $url := concat("file:///C:/output/", $supplier/NAME, ".xml")
let $msg := <supplier name="{$supplier/NAME/}">
<orders>{
for $order in $orders
return
<order>
<isbn>{$order/ISBN/text()}</isbn>
<quantity>{$order/QUANTITY/text()}</quantity>
}</orders>
</supplier>
return
ddtek:serialize-to-url($msg, $url, "method=xml")

But now we need to communicate through EDI with our business partners. DataDirect XQuery not only integrates with DataDirect XML Converters to query and transform EDI messages as shown in a previous post, but also to generate your EDI messages.

In the example above, the third argument of ddtek:serialize-to-url is "method=xml". This instructs DataDirect XQuery to serialize the messages in XML format. By simply specifying "method=EDI", DataDirect XQuery will use the appropriate DataDirect XML Converter to produce EDI messages rather than XML.

The next example is similar to the previous, but generates EANCOM ORDERS messages for each of the suppliers.

for $supplier in collection("SUPPLIERS")/SUPPLIERS
let $orders := collection("ORDERS")/ORDERS[SUP_ID = $supplier/ID and
ORDER_DATE = current-date()]
where $orders
return
let $url := concat("file:///C:/output/", $supplier/NAME, ".edi")
let $edi := (: this message is incomplete :)
<EDIFACT>
<UNB>
<UNB01>
<UNB0101>UNOA</UNB0101>
<UNB0102>4</UNB0102>
</UNB01>
<UNB02>
<UNB0201>MYBOOKSHOP</UNB0201>
<UNB0202>666</UNB0202>
</UNB02>
<UNB03>
<UNB0301>{$supplier/NAME/text()}</UNB0301>
<UNB0302>{$supplier/BUSINESS_ID/text()}</UNB0302>
</UNB03>
<UNB05>6002</UNB05>
</UNB>
<ORDERS>{
for $order at $lineitem in $orders
return
<GROUP_28>
<LIN>
<LIN01>{$lineitem}</LIN01>
<LIN02>1</LIN02>
<LIN03>
<LIN0301>{$order/ISBN/text()}</LIN0301>
<LIN0302>IB</LIN0302>
</LIN03>
</LIN>
<QTY>
<QTY01>
<QTY0101>1</QTY0101>
<QTY0102>{$order/QUANTITY/text()}</QTY0102>
</QTY01>
</QTY>
</GROUP_28>
}</ORDERS>
</EDIFACT>
return
ddtek:serialize-to-url($edi, $url, "method=EDI")

This example creates a fixed EDI message for each of the suppliers, but this can easily be tuned as we're using XQuery. Suppose some of your suppliers require the order information as XML and others as EDI. We can combine parts of the previous two queries.

for $supplier in collection("SUPPLIERS")/SUPPLIERS
let $orders := collection("ORDERS")/ORDERS[SUP_ID = $supplier/ID and
ORDER_DATE = current-date()]
where $orders
return
if ($supplier/MESSAGE_FORMAT = "XML") then
(: continue here with the first query :)
else
(: continue here with the second query :)

I hope this gives you an idea how easy it is to generate XML or EDI messages out of your database. And this is not limited to EANCOM messages, many other standards are supported too, think of X12, EDIFACT, HL7, IATA, etc.
Using DataDirect XQuery together with DataDirect XML Converters and all the expressiveness of the XQuery language there are close to no limits.

Tech Tags:

Labels: , ,

Wednesday, December 19, 2007

Transform a directory of EDI messages

We recently announced DataDirect XQuery 3.1. It integrates now nicely with the DataDirect XML Converters, and as we have the ability to query all files in a directory, this opens a new range of use cases.

Suppose you have a directory with EDI messages and need to transform them all to XML, for example to make the messages consumable for some other business process.

In XQuery generating multiple XML documents? we learned how to query a directory of XML document, and to transform and save each of the documents into a new XML file. To refresh our mind, the following query copies all XML files from one to another directory,

declare function local:get-file-name($document-uri as xs:string){
tokenize($document-uri, "/")[last()]
};

for $doc in fn:collection("file:///C:/input?select=*.xml")
let $filename := concat("file:///C:/output/",
local:get-file-name(document-uri($doc)))
return
ddtek:serialize-to-url($doc, $filename, "")

But ok, we have now a directory of EDI messages, and want to transform them all into XML. Only a few changes to the above query are needed. The argument to fn:collection is different, in order to make DataDirect XQuery invoke the appropriate DataDirect XML Converter. And second, the local:get-file-name() is a bit more complex to generate the matching .xml filename.

declare function local:get-file-name($document-uri as xs:string){
let $file-name := tokenize($document-uri, "/")[last()]
let $file-name := replace($file-name,'^(.*)\..*','$1')
let $file-name := concat($file-name, ".", "edi")
return $file-name
};

for $doc in fn:collection("converter:///EDI?file:///C:/input?select=*.edi")
let $filename := concat("file:///C:/output/",
local:get-file-name(document-uri($doc)))
return
ddtek:serialize-to-url($doc, $filename, "")

The resulting XML document need to conform to a specific XML Schema. No problem, XQuery is designed to perform such transformations. Assume we need to transform EDIFACT messages, we start from our previous query and simply add some transformation logic.

for $edi in fn:collection("converter:///EDI?file:///C:/input?select=*.edi")

let $filename := concat("file:///C:/output/",
local:get-file-name(document-uri($edi)))
let $xml :=
<order submitter="{$edi/EDIFACT/UNB/UNB02/UNB0201}">{
for $group28 in $edi/EDIFACT/ORDERS/GROUP_28
return
<book>
<id>{
$group28/LIN/LIN03/LIN0301/text()
}</id>
<quantity>{
$group28/QTY/QTY01/QTY0102/text()
}</quantity>
<ISBN>{
$group28/LIN/LIN03/LIN0301/text()
}</ISBN>
</book>
}</order>
return
ddtek:serialize-to-url($xml, $filename, "")

We have still fairly simple code with already significant functionality. You’re now ready to use any of the built-in DataDirect XQuery functionality. Assume the data needs to be enriched with data from a web service or your customer data stored in a relational database. Let’s extend the previous example and add a shipping address to the result, which is obtained out of the CUSTOMERS table in our database.

for $edi in fn:collection("converter:///EDI?file:///C:/input?select=*.edi")
let $address := collection("CUSTOMERS")/CUSTOMERS[ID = $edi/EDIFACT/UNB/UNB02/UNB0201]
let $filename := concat("file:///C:/output/",
local:get-file-name(document-uri($edi)))
let $xml :=
<order submitter="{$edi/EDIFACT/UNB/UNB02/UNB0201}">{
<shipping_address>
<street>{
concat($address/STREET, " ", $address/NUMBER)
}</street>
<zip>{
$address/ZIPCODE
}</zip>
<state>{
$address/STATE
}</state>
</shipping_address>,
for $group28 in $edi/EDIFACT/ORDERS/GROUP_28
return
<book>
<id>{
$group28/LIN/LIN03/LIN0301/text()
}</id>
<quantity>{
$group28/QTY/QTY01/QTY0102/text()
}</quantity>
<ISBN>{
$group28/LIN/LIN03/LIN0301/text()
}</ISBN>
</book>
}</order>
return
ddtek:serialize-to-url($xml, $filename, "")


Hopefully I demonstrated how to query, enrich an transform XML messages. Soon we'll talk about "split a huge XML file into smaller documents".

Tech Tags:

Labels: , ,

Tuesday, December 4, 2007

XQuery generating multiple XML documents?

A requirement we often see with DataDirect XQuery customers, is the ability to generate multiple results. Imagine a directory with XML files, each of which needs to be transformed and result in a new XML document. Wouldn’t it be great to query the complete directory and generate the corresponding results, all in a single XQuery? No need to glue multiple queries together, do external scripting, write Java or any other proprietary code. Simply a single XQuery!

The use case is not new. And not surprising, people are doing this in XSLT since a long time through the xsl:result-document instruction. Despite the library of almost 200 built-in functions in XQuery, there is not such a function in XQuery 1.0.

The good news is that DataDirect XQuery 3.1 adds out-of-the-box support to serialize query results in a file.

ddtek:serialize-to-url(
$items as item()*,
$url as xs:string,
$options as xs:string)

ddtek:serialize-to-url() has three arguments, first the data to be serialized, second a URL specifying the file to be saved and the last argument tweaks the serialization process.

The following example creates a output.xml file in the c:\results directory,

let $item := <message>XQuery rocks!</message>
return
ddtek:serialize-to-url($item,
"file:///results/output.xml","")

As you see there is no need at all to write any Java code, this can be executed from within your favorite XQuery editor or using the DataDirect XQuery command line utility.

As a URL specifies the output location, you can for example easily upload the result to an FTP server,

let $item := <message>XQuery rocks!</message>
return
ddtek:serialize-to-url($item,
"ftp://uid:pwd@myftpserver/results/", "")

Finally, the serialization of the results can be tweaked. Here we encode the XML output as UTF-16 and specify to include an XML declaration,

let $item := <message>XQuery rocks!</message>
return
ddtek:serialize-to-url($item,
"file:///results/output.xml",
"omit-xml-declaration=no,encoding=UTF-16")

Let's look at a more concrete example, where all XML document are copied from one directory to another. in this query we use DataDirect XQuery's capability to query a complete directory through fn:collection.

declare function local:get-file-name($document-uri as xs:string){
tokenize($document-uri, "/")[last()]
};

for $doc in fn:collection("file:///C:/input?select=*.xml")
let $filename := concat("file:///C:/output/",
local:get-file-name(document-uri($doc)))
return
ddtek:serialize-to-url($doc, $filename, "")

Assuming this XQuery is saved as C:\xquery\copy-document.xq, it can be executed from the command line as follows

java -jar C:\ddxq\lib\ddxq.jar C:\xquery\copy-document.xq

In the next days, we’ll show through some real use cases how to take advantage of ddtek:serialize-to-url

Tech Tags:

Labels: ,

Wednesday, November 7, 2007

Updating XML with XQuery 1.0

I was reading an interesting discussion yesterday on xquery-talk, replacing a node in in-memory XML.

How can one modify an XML structure through XQuery? In the future, the answer is definitely XQuery Update Facility. But the XQuery Update Facility is currently still work in progress, and not yet widely supported. What do we do today?

Ryan Grimm wrote an XQuery library to update an in-memory XML structure. And it looks like the in-mem-update library is pretty functional complete, having the following functions.

  • node-insert-child
  • node-insert-before
  • node-insert-after
  • node-replace
  • node-delete

How do you use these functions? Let's have a look at a query from the XQuery Update Facility Use Cases, and show an equivalent implementation based on the in-mem-update library.

Consider Q2, Enter a bid for user Annabel Lee on February 1st, 1999 for 60 dollars on item 1001. The XQuery Update Facility based solution is as follows,

let $uid := 
doc("users.xml")/users/user_tuple[name="Annabel Lee"]/userid
return do
insert
<bid_tuple>
<userid>{data($uid)}</userid>
<itemno>1001</itemno>
<bid>60</bid>
<bid_date>1999-02-01</bid_date>
<bid_tuple>
into doc("bids.xml")/bids

Using the library we end up doing something as follows,

import module namespace mem = "http://xqdev.com/in-mem-update" at "in-mem-update.xqy";
let $uid :=
doc("users.xml")/users/user_tuple[name="Annabel Lee"]/userid
return
mem:node-insert-child(
doc("bids.xml")/bids,
<bid_tuple>
<userid>{data($uid)}</userid>
<itemno>1001</itemno>
<bid>60</bid>
<bid_date>1999-02-01</bid_date>
</bid_tuple>)
Looks pretty similar, no? There is actually one fundamental difference. With the XQuery Update Facility, the bids.xml document is actually updated. The in-mem-update variant, doesn't update the bids.xml document, but rather returns a copy of the original document reflecting the change.
This shows one of the possible issues with the library. Each modification made to an XML structure results in a copy. Making a lot of changes to a single XML structure, or updating a huge XML structure might affect performance. Still, I believe the library is useful in a lot of common scenarios.

The library is written to be used with MarkLogic Server, and unfortunately based on an older version of the XQuery specification. This makes it fail out of the box using XQuery 1.0 compliant processors. I updated the XQuery module in order to make it XQuery 1.0 compatible, and in addition added support for document nodes. It's available for download here.

So, you can now update all your data with DataDirect XQuery. Using the ddtek:sql-insert, ddtek:sql-update and ddtek:sql-delete functions you can update your relation database. And using the in-mem-update library you can now also make changes to your XML documents.

I believe this library is complementary to the functions modifying XML elements and attributes available in the FunctX XQuery library. Wouldn't it be cool to have these functions added to FunctX? I leave it to Ryan Grimm and Priscilla Walmsley to discuss this in detail.

Labels: , , , ,

Monday, October 15, 2007

XQuery your Excel spreadsheets

A few weeks ago I blogged about XQuery your office documents. We can query our Office Open XML and OpenDocument Format documents, because they are XML based. But what about older formats? For example, there are a zillion of Excel 2003 spreadsheets, and they will be around for another few years. Wouldn't it be great if we can query those, just as we can query OOXML and ODF documents through DataDirect XQuery?

URI Resolvers

JAXP defines the URIResolver interface. A URIResolver turns a URI into a 'virtual' XML document. The concept of URIResolver is supported by most Java-based XPath, XSLT and XQuery implementations and thus also by DataDirect XQuery.

URIResolvers allow you to the query any proprietary format through XQuery, as long as you go through the effort of some Java coding to transform the legacy format to XML.

There are also products using the URIResolver interface to make non-XML data standards available to the XML eco-system. The XML Converters, for example, allow to query many non-XML data formats like the EDI standards X12, EDIFACT, EANCOM, HL7, etc, as well as dBASE, CSV, JSON and many others.

Reading Excel 2003 files

We're going to use the concept of URIResolvers to query our Excel 2003 spreadsheets. But of course, we need a Java implementation to read these XLS files. Apache POI is such Java API to access Microsoft file formats, including Excel 2003 documents.

The advantages of Apache POI for Excel are,

  • Cross platform, as there are no dependencies on native Windows DLLs
  • The API is powerful enough to translate an XLS into XML
  • Formula support, most of the time you want to query the 'data'. For a cell containing a formula (for example SUM(A1:A5)) you're not interested in the formula itself but its result.

The ExcelURIResolver

So we have written a URIResolver to read Excel 2003 files. The xqexcel.jar file is available here, and needs to be added to your CLASSPATH. You also need two Apache POI 3.0.1 jar files, poi-3.0.1-FINAL-20070705.jar and poi-scratchpad-3.0.1-FINAL-20070705.jar. The Apache POI distribution can be downloaded here.

Enabling the ExcelURIResolver in DataDirect XQuery is trivial. If you are using the XQJ API, you can simply register the ExcelURIResolver through your DDXQDataSource.

...
DDXQDataSource ddds = new DDXQDataSource();
ddds.setDocumentUriResolver(
"com.ddtek.xquery.excel.ExcelURIResolver");
...

Using the DataDirect XQuery command-line utility, all you need to do is adding the -r option specifying the class name of the Excel URIResolver, com.ddtek.xquery.excel.ExcelURIResolver.

And you're all set to query Excel 2003 documents through the fn:doc() function. Use the excel: URI scheme, specifying the file name of the .XLS

fn:doc('excel:C:/my office documents/sales2007.xls')

The virtual Excel XML document

An Excel 2003 document is called a workbook and can contain several sheets, each sheet is a grid of cells. Our ExcelURIResolver makes the following information available through the virtual XML document.

  • All sheets, each with the name of the sheet.
  • Within a sheet, all used row. For each row, the row number as available in Excel is available.
  • Within a row, each cell being used. Note that you can have a different number of cells within each row. So, its not like a relational table where each row has a fixed number of columns. For each of the cells, the name of the column is available, consistent with the scheme used by Excel.

As an example, consider a sample Excel file, ciscoexpo.xls, from Microsoft's web site. The file consists of one sheet called Sheet1 and it looks as follows.
The columns A and B (Year and Sales) contain plain data (numbers) and the columns C and D (Predication and Ratio) contain formula's. Cell C5 is for example =58.552664*EXP(0.569367*A5) and D5 is =C5/C4.

When we query the complete document,

fn:doc('excel:C:/my office documents/ciscoexpo.xls')

We get the following virtual XML document.

<workbook name="excel:C:/my office documents/ciscoexpo.xls">
 <sheet name="Sheet1">
  <row line="1">
   <cell column="2" name="B1">Year 1=1990</cell>
  </row>
  <row line="3">
   <cell column="1" name="A3">Year</cell>
   <cell column="2" name="B3">Sales</cell>
   <cell column="3" name="C3">Prediction</cell>
   <cell column="4" name="D3">Ratio</cell>
  </row>
  <row line="4">
   <cell column="1" name="A4">1</cell>
   <cell column="2" name="B4">70</cell>
   <cell column="3" name="C4">103.4712285029616</cell>
  </row>
  <row line="5">
   <cell column="1" name="A5">2</cell>
   <cell column="2" name="B5">183</cell>
   <cell column="3" name="C5">182.84898408571283</cell>
   <cell column="4" name="D5">1.767148092578018</cell>
  </row>
  <row line="6">
   <cell column="1" name="A6">3</cell>
   <cell column="2" name="B6">340</cell>
   <cell column="3" name="C6">323.1212334568959</cell>
   <cell column="4" name="D6">1.7671480925780185</cell>
  </row>
  ...
  <row line="13">
   <cell column="1" name="A13">10</cell>
   <cell column="2" name="B13">12154</cell>
   <cell column="3" name="C13">17389.060639019517</cell>
   <cell column="4" name="D13">1.767148092578018</cell>
  </row>
  <row line="15">
   <cell column="1" name="A15">16</cell>
   <cell column="3" name="C15">529558.3247555149</cell>
  </row>
 </sheet>
 <sheet name="Sheet2"/>
 <sheet name="Sheet3"/>
</workbook>

Accessing cell B5 to B7 in XQuery world,

let $xls := fn:doc('excel:C:/my office documents/ciscoexpo.xls')
let $sheet := $xls/workbook/sheet[@name="Sheet1"]
return
$sheet/row/cell[@name=("B5","B6","B7")]
Conclusion

We have shown how to open data locked up for years in your Excel spreadsheets. We can now query this virtual XML document like we can with any other XML document, opening a wide range of use cases.

  • Transform Excel 2003 documents into any XML standard format
  • Join your .xls data with your relation database to generate complex XML documents
  • Create EDI messages using the XML Converters with data stored in Excel 2003 spreadsheets
  • Extract information out of Excel 2003 documents and upload it into your database
  • Publish data out of your Excel spreadsheets in PDF format using XSL-FO
  • etc

But also, the concept of URI Resolver is powerful, and allows you basically to query any proprietary data through XQuery.

Labels: ,

Friday, September 28, 2007

Thinking about insurance? XQuery can help!


ACORD is a well-known organization focused on standards in the insurance industry. In the last several years ACORD has developed an impressive number of XML-based standards, and more and more organizations encounter these standards every year. It should be no surprise that a language like XQuery can be extremely helpful in creating, consuming, and processing ACORD messages; and XQuery processors able to access heterogeneous data sources, like DataDirect XQuery, provide an even more powerful way to process ACORD-based requests, to validate requests against a variety of sources, and to perform back-end updates based on changes communicated through ACORD messages.

In order to illustrate the value brought to the table by DataDirect XQuery, we went through the exercise of creating XQuery that deal with specific ACORD requests in hypothetical (but reasonably realistic) scenarios. We recently published the results on our web site, at http://www.xquery.com/ACORD/.

One of the major benefits of using DataDirect XQuery is that users are able to access and process multiple heterogeneous data sources in the context of a single language (XQuery) and data model. To prove that, we spent some time trying to imagine how tasks similar to the ones described in the XQuery examples would be solved in an environment where the developer is limited to the use of Java+SQL. We'll leave it to you to judge the difference in complexity between the XQuery and Java+SQL approaches.

Of course, insurance is just one of the many possible "industry verticals" that we could choose; time permitting, we will tackle other industries (health care and airlines are two that come to mind) focusing on different standards. If you have any specific suggestions, we are interested in hearing about them! Write and let us know.



Labels: , ,

Tuesday, September 18, 2007

XQuery your office documents

With the introduction of the Office Open XML in Microsoft Office 2007 and OpenDocument Format used by OpenOffice applications, your office documents like spreadsheets and word processing documents, become consumable with XQuery.

Technically these formats are basically a bunch of XML files, packaged in a ZIP file. Here is for example an Office Open XML .docx file opened with winzip,

Although the format is different, the OpenDocument Format is conceptually similar, a bunch of XML files packaged in a ZIP.

How can you query the XML files inside the .docx file? After all it’s fairly simple with with DataDirect XQuery using the standard fn:doc XQuery function. fn:doc has a single argument, the URL identifying the XML document to query. Amongst the standard URL schemes like file: and http:, your Java virtual machine also support the jar: URL scheme.

A jar archive is considered "a zip archive with logical extensions". The "logical extensions" being special files like manifest.mf or the META-INF directory located in the archives. But physically these are just zip archives and as such we can use the jar: URL scheme to query Office Open XML documents. For example, to query the main document from our example above, use the following fn:doc call,

doc('jar:file:///C:/example.docx!/word/document.xml')

Let’s take a concrete example, and extract all the comments from John, ordered by date. The Office Open XML format stores the comments in the comments.xml document(what else would you have expected?).

declare namespace w =
"http://schemas.openxmlformats.org/wordprocessingml/2006/main";
<all_john_comments>>{
for $comment in
doc('jar:file:///C:/example.docx!/word/comments.xml')/*/w:comment
where $comment/@w:author = 'John'
order by xs:dateTime($comment/@w:date)
return
<comment date="{$comment/@w:date}">{$comment//text()}</comment>
}</all_john_comments>

Well, I guess you get the idea...

It’s not only about simply querying your office documents. Using the out-of–the-box DataDirect XQuery and XML Converters features, a whole new range of capabilities and use cases become available. For example, extract data out of 'document forms' and save it in your database. Or extract and aggregate data out of a spreadsheet to generate EDI messages.

We have shown how to query your the OpenDocument Format and Office Open XML documents. In a future post, we’ll show how the "older" office formats can be queried using a Custom URI Resolver.
Imagine you can query your existing Excel files using DataDirect XQuery. Stay tuned!

Labels: ,

Monday, August 13, 2007

Formatting numbers in XQuery 1.0


I often need to use XQuery to create XHTML or even XSL-FO; which is apparently a fairly unusual usecase for XQuery. It is true that most people dealing with the transformation of XML into HTML or XSL-FO have been doing that with XSLT in the past; and I guess most people working on XQuery specs assumed that this would still be the case even after XQuery became a standard.

Or at least that's the only reason I can find for some obvious things that seem to be missing in XQuery 1.0, one of which is the availability of a format-number() function, which is a quite basic function for creating a string representation of a number.

Why do I need to use XQuery rather than XSLT? Because the XHTML/XSL-FO reports I need to create are aggregating data that is available in part as XML documents, and in part inside a relational database; and at least some XQuery implementations make that task very easy to achieve in a highly scalable way.

Of course in most XQuery implementations like DataDirect XQuery or Saxon you can create Java extension functions to implement a format-number() function; but I wanted to find a way to achieve at least some of the format-number() functionality without having to rely on a Java function; that's why I ended up writing the attached piece of XQuery.

I haven't spent much time cleaning it up, adding documentation or making it part of a nicely structured library module; if anyone is willing to do that (or to make it become part of FunctX), please feel free to. And also, if you have any comments or suggestions about how to improve it, feel free to post them here.

format-number.xquery



Labels: , , , ,