XML Connections

Sunday, May 25, 2008

Shredding XML documents into tables, a database independent approach...

We see more and more DataDirect XQuery users shredding XML documents into a relational database. Reading specific data out of XML documents and storing it in their database. Any relational database, Oracle, SQL Server, DB2, Informix, Sybase, MySQL, PostgreSQL,... you name it!

Ok, I see some of you already asking... "Why not using a native XML database to store the complete XML documents?" There is not a single answer to the question, as always it depends on many factors. Are your processing data-centric XML? Do you need to store the data in an existing relational database? Should the data in your database be queryable through reporting tools? Are you enhancing an existing application with an XML interface? Etc, etc. If you answer positive to one or more of these questions, it is worth considering an approach where the XML documents are shredded into relation tables.

Fine, but why DataDirect XQuery? Most RDBMS solutions offer already the ability to shred XML documents in a relational database. Like explained in this IBM developerWorks paper, Shred XML documents using DB2 pureXML.

There are a variety of reasons... These database solutions are mostly vendor specific. And even if your organization deploys only one database brand, you are likely to run into serious incompatibilities over different versions of the database. These solutions are mostly cumbersome in usage. What about scalability when it comes to processing large document, in the hundreds of megabytes or several gigabytes? What if specific data transformations are required? Etc, etc.

DataDirect XQuery answers most of these concerns. Bulk load of XML data into a relational database includes some simple but illustrative examples. Like in the next query where books are uploaded in the shipments table. Note that in addition to the idea of bulk load, there is also the need to transform and validate some of the data. All fairly simple using XQuery.

 declare variable $shipment
as document-node(element(*, xs:untyped)) external;

for $book in $shipment/order/book
return
ddtek:sql-insert("shipments",
"DATE", current-dateTime(),
"ISBN", $book/isbn,
"TITLE", upper-case($book/title),
"QUANTITY",
if ($book/@quantity) then $book/@quantity else 1)

Or to make the example a bit more complex, consider the following database independent upsert scenario.

 declare variable $shipment
as document-node(element(*, xs:untyped)) external;

for $book in $shipment/order/book
let $quantity := if ($book/@quantity) then
xs:integer($book/@quantity)
else
1
let $shipment :=
collection("shipments")/shipments[ISBM = $book/isbn]
return
if($shipment) then
ddtek:sql-update($shipment,
"QUANTITY", $shipment/QUANTITY + $quantity)
else
ddtek:sql-insert("shipments",
"DATE", current-dateTime(),
"ISBN", $book/isbn,
"TITLE", upper-case($book/title),
"QUANTITY", $quantity)

But there is of course much more you can do. Another example, inspired by the developerWorks article mentioned above, consider a so called "bill of materials" XML document.

<items>
<item desc="computersystem" model="L1234123">
<part desc="computer" partnum="5423452345">
<part desc="motherboard" partnum="5423452345">
<part desc="CPU" partnum="6109486697">
<part desc="register" partnum="6109486697"/>
</part>
<part desc="memory" partnum="545454232">
<part desc="transistor" partnum="6109486697"/>
</part>
</part>
<part desc="diskdrive" partnum="6345634563456">
<part desc="spindlemotor" partnum="191986123"/>
</part>
<part desc="powersupply" partnum="098765343">
<part desc="powercord" partnum="191986123"/>
</part>
</part>
<part desc="monitor" partnum="898234234">
<part desc="cathoderaytube" partnum="191986123"/>
</part>
<part desc="keyboard" partnum="191986123">
<part desc="keycaps" partnum="191986123"/>
</part>
<part desc="mouse" partnum="98798734">
<part desc="mouseball" partnum="98798734"/>
</part>
</item>
</items>

This data with recursive part elements,can be represented in a relational table.

Nothing more than the following simple query will get you there.

 for $item in /items/item
return
(
ddtek:sql-insert("itemtest",
"itemname", $item/@desc,
"id", $item/@model)
,
for $part in $item//part
return
ddtek:sql-insert("itemtest",
"itemname", $item/@desc,
"parent", $part/../@desc,
"description", $part/@desc,
"id", $part/@partnum)
)

We've shown how easy you can shred XML documents and load data into relation tables. And important, all this in a scalable and database independent way.

Labels: , , , , , ,

Tuesday, May 13, 2008

XQuery Update Facility versus XSLT?

Last month we discussed Transforming XML using XQuery updates. With XQuery 1.0 update and transform operations are rather challenging to implement. Or you can use some library to get there, like we explained in Updating XML with XQuery 1.0. In any case, we have to admit, compared to XSLT this is a shortcoming.

With the XQuery Update Facility, things will change drastically. Let's have a closer look at a concrete usage scenario from a recent question on SSDN.

If preceding-sibling type="zMADDRESS", type="zZip" value
should be left unchanged. Else if, preceding-sibling
type="zADDRESS", type="zZip" items should be removed.
And the comma and space which always seem to precede
the zZip in ZAddress or zNeighb must also be removed--
if that is all within a zADDRESS.
For examples:
1. Before:
<tps:c type="zStreet">20 West Row</tps:c>
<tps:c type="zAddress">,</tps:c>
<tps:c type="zNeighb">Canberra City,</tps:c>
<tps:c type="zZip">2600</tps:c>
1. After:
<tps:c type="zStreet">20 West Row</tps:c>
<tps:c type="zAddress">,</tps:c>
<tps:c type="zNeighb">Canberra City</tps:c>
2. Before:
<tps:c type="zStreet">82 Northbourne Ave.</tps:c>
<tps:c type="zAddress">,</tps:c>
<tps:c type="zNeighb">Braddon</tps:c>
<tps:c type="zAddress">,</tps:c>
<tps:c type="zZip">2601</tps:c>
2. After:
<tps:c type="zStreet">82 Northbourne Ave.</tps:c>
<tps:c type="zAddress">,</tps:c>
<tps:c type="zNeighb">Braddon</tps:c>
3. Beofre:
<tps:c type="zMaddress">Box 544, Burra Creek,</tps:c>
<tps:c type="zCity">Queanbeyan,</tps:c>
<tps:c type="zZip">2620</tps:c>
3. After, no change because one its preceding-siblings is "zMaddress".

The proposed XSLT solution is as follows,

<?xml version='1.0'?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:tps="http://www.typefi.com/ContentXML">

<xsl:template match="/">
<xsl:apply-templates/>
</xsl:template>

<xsl:template match="*">
<xsl:element name="{name()}">
<xsl:for-each select="@*">
<xsl:attribute name="{name()}"><xsl:value-of select="."/></xsl:attribute>
</xsl:for-each>
<xsl:apply-templates select="*|text()"/>
</xsl:element>
</xsl:template>

<xsl:template match="tps:c[@type='zZip']">
<xsl:choose>
<xsl:when test="preceding-sibling::tps:c[@type='zAddress']">
<!-- removed -->
</xsl:when>
<xsl:otherwise><xsl:copy-of select="."/></xsl:otherwise>
</xsl:choose>
</xsl:template>

<xsl:template match="tps:c[@type='zAddress']">
<xsl:choose>
<xsl:when test="text() = ',' and following-sibling::tps:c[1][@type='zZip']">
<!-- removed -->
</xsl:when>
<xsl:otherwise><xsl:copy-of select="."/></xsl:otherwise>
</xsl:choose>
</xsl:template>

</xsl:stylesheet>

What would this look like using XQuery Update Facility?

declare namespace tps = "http://www.typefi.com/ContentXML";

copy $doc := .
modify
(delete node $doc//tps:c[@type='zZip']
[preceding-sibling::tps:c[@type='zAddress']],
delete node $doc//tps:c[@type='zAddress']
[.=(","," ")]
[following-sibling::tps:c[1][@type='zZip']])
return $doc

I don't want to end up in one of those endless XQuery versus XSLT discussions. But beside the fact that XQuery Update Facility adds a nice palette of new functionality to XQuery, I believe it offers concise, well readable solutions.

Tech Tags:

Labels: