XML Connections

Thursday, March 20, 2008

Using XQuery to join two Excel spreadsheets? Why not?


This afternoon I was having some fun (not!) reviewing some third party license agreements, when someone emailed me a couple of Excel spreadsheets saying "Hey Minollo! I have these two spreadsheets; one has a list of all employees since 1990 in Progress Corporation and the cities where they lived when they were hired; the other one has a list of the current employees and in which Progress division they work; I need to create a list of all the current employees with details about the division they work in and the town the live (or lived) in. Is there anything that "XQuery thing" you always talk me about can do for me here?"

Well, the temptation to keep reviewing license agreements rather than playing with XQuery was strong... but I resisted! I had to help this guy! :)

I was happy to notice that the spreadsheets had been saved using the Office Open XML (OOXML) format; OOXML is basically a package of XML files describing the document (text, spreadsheet, slides) that it represents. Being XML-based, it's quite easy to inspect it using an XML query language; the package is a zip format, so individual documents can be retrieved using the standard "jar:" file scheme, for example.

So, coming back to the guy asking for help joining two spreadsheets; what can we do for him? If you look at the structure of the two spreadsheets he sent me (attached to this post), you'll see two XML documents that are relevant to us:

  • xl/worksheets/sheet1.xml
    This file contains the content of each non-empty cell in the spreadsheet. The document is split in elements and each element has a number of cells represented by elements. When the content of a element is a character string, like in this particular case, the element contains a element whose value is the index of the corresponding character string in a list of all the character strings used in the spreadsheet

  • xl/sharedStrings.xml
    This file contains all the character strings used as content of cells in the spreadsheet; the order is relevant, as the character strings are referenced using an index (like described above)

To retrieve the whole content of xl/worksheets/sheet1.xml from c:\doc1.xlsx using a Java-based XQuery processor like DataDirect XQuery, you can just us the standard fn:doc() function:

fn:doc("jar:file:///c:/doc1.xlsx!/xl/worksheets/sheet1.xml")

The index corresponding to the character string contained in the first cell of the first row in the spreadsheet can be retrieved doing:

fn:doc("jar:file:///c:/doc1.xlsx!/xl/worksheets/sheet1.xml")//
   ooxml:sheetData/ooxml:row[1]/ooxml:c[1]/ooxml:v

...where ooxml is a namespace prefix associated to the URI http://schemas.openxmlformats.org/spreadsheetml/2006/main .

And if you have the index of the character string, the character string itself can be retrieved as:

fn:doc("jar:file:///c:/doc1.xlsx!/xl/sharedStrings.xml")//
   ooxml:si[$theIndex+1]/ooxml:t/text()

Easy, isn't it!?

Add some syntax, a couple of FLWOR expressions, a few HTML tags, and here you go! This XQuery just joins the two spreadsheets and creates an easy to read HTML report:


 declare namespace ooxml=
    "
http://schemas.openxmlformats.org/spreadsheetml/2006/main";
 let $doc1Strings := doc("jar:file:///c:/doc1.xlsx!/xl/sharedStrings.xml")//ooxml:si
 let $doc2Strings := doc("jar:file:///c:/doc2.xlsx!/xl/sharedStrings.xml")//ooxml:si
 return
  <table> {
   <tr style="font-weight: bold"><td>Name</td><td>Division</td><td>City</td></tr>,
   for $row1 in
    doc("jar:file:///c:/doc1.xlsx!/xl/worksheets/sheet1.xml")//
          ooxml:sheetData/ooxml:row,
       $row2 in
    doc("jar:file:///c:/doc2.xlsx!/xl/worksheets/sheet1.xml")//
          ooxml:sheetData/ooxml:row
   let $col1-1 := $doc1Strings[$row1/ooxml:c[1]/ooxml:v/number()+1]/ooxml:t/text()
   let $col2-1 := $doc2Strings[$row2/ooxml:c[1]/ooxml:v/number()+1]/ooxml:t/text()
   where $col1-1 = $col2-1
   return
    let $col1-2 := $doc1Strings[$row1/ooxml:c[2]/ooxml:v/number()+1]/ooxml:t/text()
    let $col2-2 := $doc2Strings[$row2/ooxml:c[2]/ooxml:v/number()+1]/ooxml:t/text()
    return
     <tr>
      <td>{$col1-1}</td>
      <td>{$col2-2}</td>
      <td>{$col1-2}</td>
     </tr>
  } </table>

It's that simple! Here are the two spreadsheets, if you want to try it yourself: doc1.xlsx, doc2.xlsx

Labels: , , ,

Monday, October 22, 2007

Yes! You can finally query your office documents!


When I was working on an XML database in the late nineties, I remember hearing a lot of noise about the fact that — finally! — XML would allow for real reuse and collaboration when working with applications like word processors and spreadsheet editors. Copy like "Data is finally disjoint from format; store the data as XML, and use other languages [XSLT, typically] to take care of the format." was easy to find in industry publications and seemed to herald the start of something big.

Well, XML did work that way, but only in limited cases; for the vast majority of applications, XML didn't provide any of the benefits of separate data and formatting. Fast forward a decade, and there's now genuine promise in this area.

Two emerging standards, OpenDocument Format (ODF) and Office Open XML (OOXML), are gaining popularity (and originating several fights in standard bodies), and things are moving again, even if they're not moving in exactly the way people thought they would some time ago: neither ODF nor OOXML really create a clear separation between data and presentation. Instead, they have adopted an XML format that provides for a mix of data and presentation. But both standards have adopted XML, which means that you can finally use standard XML tools, like XQuery, to query your office documents!

Marc recently wrote a nice article about how you can leverage XQuery to query XML-based office document standards; he provides an excellent technical overview, and the included examples are good references about how to get started experimenting with XQuery and these new formats.


Labels: , , , ,