XML Connections

Sunday, June 8, 2008

Query office documents, a more detailed look

A question about querying office spreadsheets, on the Mulberry XSL list.

...
All I need to do is retrieve the string value of every cell of
column "A" whose string length is greater than 20 characters in
length.
Anyone know how to use XPath to retrieve this from an Excel
2007 file?
...

Querying office documents with XQuery, or another XML query language like XPath or XSLT, is a popular reoccurring question. We believe DataDirect XQuery should be in your toolset when it comes to querying office documents. Hence we have blogged about it a few times before.

Back to the question, looks like a fairly simple question, isn't it? Let's try... Let's start from a concrete OOXML example, a worksheet might look as follows.

<worksheet
xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
...
<sheetData>
<row r="1" spans="1:27">
<c r="A1" t="s">
<v>0</v>
</c>
<c r="AA1" t="s">
<v>0</v>
</c>
</row>
<row r="2" spans="1:27">
<c r="A2">
<v>1.23</v>
</c>
</row>
<row r="3" spans="1:27">
<c r="A3" t="s">
<v>1</v>
</c>
</row>
<row r="4" spans="1:27">
<c r="A4" t="str">
<f>CONCATENATE("Hello ", "world", "!")</f>
<v>Hello world!</v>
</c>
<c r="AA4" t="s">
<v>2</v>
</c>
</row>
...
</sheetData>
...
</worksheet>

The sheet's data is structured by row. With each row specifying thenon-empty cells. Cells contain values in the <v> element. here it get's more tricky. String values are not stored in the cell table, unless they are the result of a calculation. In stead you'll find a zero-based index as value, which points into the shared string table where that string is stored uniquely. This is done to optimize load/save performance and to reduce duplication of information. To determine whether the 0 in <v> is a number or an index to a string, the cell's data type must be examined. When the data type indicates string, then it is an index and not a numeric value.

In the example above,

  • For cell A1, the t="s" attribute, indicates it is a string. As such the 0 value in the <v> element is an index in the shared string table.
  • AA1 has the same value as A1
  • A2 has a numeric value 1.23
  • A4 has a string-based formula, as such the value is stored in the cell itself

Let's now have a closer look at the shared strings table, here is an example.

It's a list of <si> elements. These can be plain text, or be rich text strings. As we will see later on this is important when retrieving the string value.

<sst
xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main"
count="5"
uniqueCount="3">
<si>
<t>Hello world!</t>
</si>
<si>
<r>
<t xml:space="preserve">Hello </t>
</r>
<r>
<rPr>
<b/>
<sz val="11"/>
<color rgb="FF00B050"/>
<rFont val="Calibri"/>
<family val="2"/>
<scheme val="minor"/>
</rPr>
<t>wonderful</t>
</r>
<r>
<rPr>
<sz val="11"/>
<color theme="1"/>
<rFont val="Calibri"/>
<family val="2"/>
<scheme val="minor"/>
</rPr>
<t xml:space="preserve"> </t>
</r>
<r>
<rPr>
<i/>
<sz val="11"/>
<color rgb="FFFF0000"/>
<rFont val="Calibri"/>
<family val="2"/>
<scheme val="minor"/>
</rPr>
<t>world</t>
</r>
<r>
<rPr>
<sz val="11"/>
<color theme="1"/>
<rFont val="Calibri"/>
<family val="2"/>
<scheme val="minor"/>
</rPr>
<t>!</t>
</r>
</si>
<si>
<t>Hello wonderful world!</t>
</si>
</sst>

So, back to the question. The next query is the answer.

 1: declare namespace ssml = "http://schemas.openxmlformats.org/spreadsheetml/2006/main";
2: declare variable $sheet := doc("jar:file:///C:/example.xlsx!/xl/worksheets/sheet1.xml");
3: declare variable $sharedStrings := doc("jar:file:///C:/example.xlsx!/xl/sharedStrings.xml");
4: for $c in $sheet/ssml:worksheet/ssml:sheetData/ssml:row/ssml:c[matches(@r,"^A[0-9]+")]
5: let $str :=
6: if ($c/@t = "s") then
7: string-join(($sharedStrings/ssml:sst/ssml:si)[xs:integer($c/ssml:v)+1]//ssml:t,"")
8: else
9: string($c/ssml:v)
10: where string-length($str)>20
11: return
12: <cell id="{$c/@r}">{$str}</cell>
1. Declares the SpreadsheetML namespace
2. Specifies the XML document representing the first sheet in the spreadsheet
3. Specifies the XML document containing the shared strings table
4. Get all cells in the A column, we use a regular expression to filter all cells starting with an A, followed by a number of alpha numeric characters
5.
6. Test if the cell contains a string value or not
7. Lookup the string value in the shared strings table. As the index in spreadsheetML is zero-based we need to increment the index by one before accessing the n-th element. As the string can be rich text, we string-join all descendent <t> elements
8.
9. Get the value contained in the cell itself
10. As specified in the original question, only get the values of more than 20 characters
11.
12. Return the cell row/column number and the actual value

After all this query is longer than one might have thought initially, given the rather simple question. However, all complexity is there due to the Open Office XML standard. Can you imagine to write the equivalent in for example Java?

Out of scope for this post, but the solution can easily be enhanced, for example to query all sheets in the spreadsheet.

Someone wants to try and write the same for spreadsheets saved in OpenDocument Format?

Labels: , , , ,

Monday, October 22, 2007

Yes! You can finally query your office documents!


When I was working on an XML database in the late nineties, I remember hearing a lot of noise about the fact that — finally! — XML would allow for real reuse and collaboration when working with applications like word processors and spreadsheet editors. Copy like "Data is finally disjoint from format; store the data as XML, and use other languages [XSLT, typically] to take care of the format." was easy to find in industry publications and seemed to herald the start of something big.

Well, XML did work that way, but only in limited cases; for the vast majority of applications, XML didn't provide any of the benefits of separate data and formatting. Fast forward a decade, and there's now genuine promise in this area.

Two emerging standards, OpenDocument Format (ODF) and Office Open XML (OOXML), are gaining popularity (and originating several fights in standard bodies), and things are moving again, even if they're not moving in exactly the way people thought they would some time ago: neither ODF nor OOXML really create a clear separation between data and presentation. Instead, they have adopted an XML format that provides for a mix of data and presentation. But both standards have adopted XML, which means that you can finally use standard XML tools, like XQuery, to query your office documents!

Marc recently wrote a nice article about how you can leverage XQuery to query XML-based office document standards; he provides an excellent technical overview, and the included examples are good references about how to get started experimenting with XQuery and these new formats.


Labels: , , , ,