XML Connections

Sunday, June 29, 2008

XQuery for the SQL programmer – Introduction

At DataDirect we often get XQuery coding questions from SQL programmers. It can probably be explained by two main reasons. First there are many developers with a SQL background, not surprising they are looking for some analogy with SQL during their first XQuery steps. Second, with DataDirect XQuery you can use XQuery against relational databases, DataDirect XQuery is a natural inroad for SQL programmers into the world of XML and XQuery.

As explained last week, we have good experiences teaching XQuery. I thought it would be a good idea to start a series "XQuery for the SQL programmer" - a light introduction to XQuery from a SQL perspective.
This series is not a general introduction to XQuery. We assume you have already some notions of XQuery (or XPath). You can always refresh your XQuery knowledge reading Learn XQuery in 10 Minutes: An XQuery Tutorial.

The following essays are available in this series. This list is updated as posts become available, bookmark it if you want to have the update-to-date and complete list at first hand.

This series includes a lot of examples, based on the schema of Use Case "R" from the XML Query Use Cases. It’s a simple auction system, consisting of three tables. Here is the description taken from the XML Query Use Cases document.

The auction maintains a USERS table containing information on registered users, each identified by a unique userid, who can either offer items for sale or bid on items. An ITEMS table lists items currently or recently for sale, with the userid of the user who offered each item. A BIDS table contains all bids on record, keyed by the userid of the bidder and the item number of the item to which the bid applies

The three tables look as follows:
USERS

ITEMS

BIDS

As said, this series make extensive use of examples, all the XQuery examples are fully functional and have been tested with DataDirect XQuery.

To get started, let’s look at a first but simple SQL query. Get the item numbers of all motorcycle items.

select ITEMNO
from ITEMS
where DESCRIPTION = 'Motorcycle'

Or in XQuery you could write.

for $item in collection("ITEMS")/ITEMS
where $item/DESCRIPTION = "Motorcycle"
return $item/ITEMNO

This is a FLWOR expression, which stands for "for-let-where-orderby-return". I guess you see the analogy with a SQL select statement. Suppose we want to sort the results by item number. In SQL we add an order by clause.

select ITEMNO
from ITEMS
where DESCRIPTION = 'Motorcycle'
order by ITEMNO

In XQuery it’s similar to add an order by clause.

for $item in collection("ITEMS")/ITEMS
where $item/DESCRIPTION = "Motorcycle"
order by $item/ITEMNO
return $item/ITEMNO

This was of course not much more than a quick introduction. In our next post we’ll start with the real work, and discuss the Relational and XML data models.
Stay tuned!

Tech Tags:

Labels: , ,

Thursday, June 12, 2008

Teaching XQuery to SQL developers

Dan McCreary wrote about his Teaching XSLT vs. Teaching XQuery experience.

At DataDirect we have also good experience teaching XQuery, and a developer with some SQL experience is definitely not in disadvantage. A lot of the query concepts in SQL are easily mapped into XQuery. Priscilla Walmsley’s book, which I highly recommend, devotes a complete chapter on this topic.

As a SQL developer you might be wondering, why using XQuery to talk to my favorite relational database? Minollo blogged about it a while back. Yes there are some good reasons to have XQuery in your toolbox in addition to good old SQL.

XQuery is the natural choice to bridge the relational and XML world, with products like XML Converters you can easily extend this from XML to the complete EDI world. And tools like the XQueryWebService Framework make it trivial to build powerful data services that query, aggregate and update multiple data sources.

Increased developer’s productivity is great, but equally important is performance and scalability. This is exactly one of the focus areas of DataDirect XQuery, for example by taking full advantage of your current SQL investments.

Labels: , , , ,

Sunday, June 8, 2008

Query office documents, a more detailed look

A question about querying office spreadsheets, on the Mulberry XSL list.

...
All I need to do is retrieve the string value of every cell of
column "A" whose string length is greater than 20 characters in
length.
Anyone know how to use XPath to retrieve this from an Excel
2007 file?
...

Querying office documents with XQuery, or another XML query language like XPath or XSLT, is a popular reoccurring question. We believe DataDirect XQuery should be in your toolset when it comes to querying office documents. Hence we have blogged about it a few times before.

Back to the question, looks like a fairly simple question, isn't it? Let's try... Let's start from a concrete OOXML example, a worksheet might look as follows.

<worksheet
xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
...
<sheetData>
<row r="1" spans="1:27">
<c r="A1" t="s">
<v>0</v>
</c>
<c r="AA1" t="s">
<v>0</v>
</c>
</row>
<row r="2" spans="1:27">
<c r="A2">
<v>1.23</v>
</c>
</row>
<row r="3" spans="1:27">
<c r="A3" t="s">
<v>1</v>
</c>
</row>
<row r="4" spans="1:27">
<c r="A4" t="str">
<f>CONCATENATE("Hello ", "world", "!")</f>
<v>Hello world!</v>
</c>
<c r="AA4" t="s">
<v>2</v>
</c>
</row>
...
</sheetData>
...
</worksheet>

The sheet's data is structured by row. With each row specifying thenon-empty cells. Cells contain values in the <v> element. here it get's more tricky. String values are not stored in the cell table, unless they are the result of a calculation. In stead you'll find a zero-based index as value, which points into the shared string table where that string is stored uniquely. This is done to optimize load/save performance and to reduce duplication of information. To determine whether the 0 in <v> is a number or an index to a string, the cell's data type must be examined. When the data type indicates string, then it is an index and not a numeric value.

In the example above,

  • For cell A1, the t="s" attribute, indicates it is a string. As such the 0 value in the <v> element is an index in the shared string table.
  • AA1 has the same value as A1
  • A2 has a numeric value 1.23
  • A4 has a string-based formula, as such the value is stored in the cell itself

Let's now have a closer look at the shared strings table, here is an example.

It's a list of <si> elements. These can be plain text, or be rich text strings. As we will see later on this is important when retrieving the string value.

<sst
xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main"
count="5"
uniqueCount="3">
<si>
<t>Hello world!</t>
</si>
<si>
<r>
<t xml:space="preserve">Hello </t>
</r>
<r>
<rPr>
<b/>
<sz val="11"/>
<color rgb="FF00B050"/>
<rFont val="Calibri"/>
<family val="2"/>
<scheme val="minor"/>
</rPr>
<t>wonderful</t>
</r>
<r>
<rPr>
<sz val="11"/>
<color theme="1"/>
<rFont val="Calibri"/>
<family val="2"/>
<scheme val="minor"/>
</rPr>
<t xml:space="preserve"> </t>
</r>
<r>
<rPr>
<i/>
<sz val="11"/>
<color rgb="FFFF0000"/>
<rFont val="Calibri"/>
<family val="2"/>
<scheme val="minor"/>
</rPr>
<t>world</t>
</r>
<r>
<rPr>
<sz val="11"/>
<color theme="1"/>
<rFont val="Calibri"/>
<family val="2"/>
<scheme val="minor"/>
</rPr>
<t>!</t>
</r>
</si>
<si>
<t>Hello wonderful world!</t>
</si>
</sst>

So, back to the question. The next query is the answer.

 1: declare namespace ssml = "http://schemas.openxmlformats.org/spreadsheetml/2006/main";
2: declare variable $sheet := doc("jar:file:///C:/example.xlsx!/xl/worksheets/sheet1.xml");
3: declare variable $sharedStrings := doc("jar:file:///C:/example.xlsx!/xl/sharedStrings.xml");
4: for $c in $sheet/ssml:worksheet/ssml:sheetData/ssml:row/ssml:c[matches(@r,"^A[0-9]+")]
5: let $str :=
6: if ($c/@t = "s") then
7: string-join(($sharedStrings/ssml:sst/ssml:si)[xs:integer($c/ssml:v)+1]//ssml:t,"")
8: else
9: string($c/ssml:v)
10: where string-length($str)>20
11: return
12: <cell id="{$c/@r}">{$str}</cell>
1. Declares the SpreadsheetML namespace
2. Specifies the XML document representing the first sheet in the spreadsheet
3. Specifies the XML document containing the shared strings table
4. Get all cells in the A column, we use a regular expression to filter all cells starting with an A, followed by a number of alpha numeric characters
5.
6. Test if the cell contains a string value or not
7. Lookup the string value in the shared strings table. As the index in spreadsheetML is zero-based we need to increment the index by one before accessing the n-th element. As the string can be rich text, we string-join all descendent <t> elements
8.
9. Get the value contained in the cell itself
10. As specified in the original question, only get the values of more than 20 characters
11.
12. Return the cell row/column number and the actual value

After all this query is longer than one might have thought initially, given the rather simple question. However, all complexity is there due to the Open Office XML standard. Can you imagine to write the equivalent in for example Java?

Out of scope for this post, but the solution can easily be enhanced, for example to query all sheets in the spreadsheet.

Someone wants to try and write the same for spreadsheets saved in OpenDocument Format?

Labels: , , , ,