XML Connections

Thursday, March 20, 2008

Using XQuery to join two Excel spreadsheets? Why not?


This afternoon I was having some fun (not!) reviewing some third party license agreements, when someone emailed me a couple of Excel spreadsheets saying "Hey Minollo! I have these two spreadsheets; one has a list of all employees since 1990 in Progress Corporation and the cities where they lived when they were hired; the other one has a list of the current employees and in which Progress division they work; I need to create a list of all the current employees with details about the division they work in and the town the live (or lived) in. Is there anything that "XQuery thing" you always talk me about can do for me here?"

Well, the temptation to keep reviewing license agreements rather than playing with XQuery was strong... but I resisted! I had to help this guy! :)

I was happy to notice that the spreadsheets had been saved using the Office Open XML (OOXML) format; OOXML is basically a package of XML files describing the document (text, spreadsheet, slides) that it represents. Being XML-based, it's quite easy to inspect it using an XML query language; the package is a zip format, so individual documents can be retrieved using the standard "jar:" file scheme, for example.

So, coming back to the guy asking for help joining two spreadsheets; what can we do for him? If you look at the structure of the two spreadsheets he sent me (attached to this post), you'll see two XML documents that are relevant to us:

  • xl/worksheets/sheet1.xml
    This file contains the content of each non-empty cell in the spreadsheet. The document is split in elements and each element has a number of cells represented by elements. When the content of a element is a character string, like in this particular case, the element contains a element whose value is the index of the corresponding character string in a list of all the character strings used in the spreadsheet

  • xl/sharedStrings.xml
    This file contains all the character strings used as content of cells in the spreadsheet; the order is relevant, as the character strings are referenced using an index (like described above)

To retrieve the whole content of xl/worksheets/sheet1.xml from c:\doc1.xlsx using a Java-based XQuery processor like DataDirect XQuery, you can just us the standard fn:doc() function:

fn:doc("jar:file:///c:/doc1.xlsx!/xl/worksheets/sheet1.xml")

The index corresponding to the character string contained in the first cell of the first row in the spreadsheet can be retrieved doing:

fn:doc("jar:file:///c:/doc1.xlsx!/xl/worksheets/sheet1.xml")//
   ooxml:sheetData/ooxml:row[1]/ooxml:c[1]/ooxml:v

...where ooxml is a namespace prefix associated to the URI http://schemas.openxmlformats.org/spreadsheetml/2006/main .

And if you have the index of the character string, the character string itself can be retrieved as:

fn:doc("jar:file:///c:/doc1.xlsx!/xl/sharedStrings.xml")//
   ooxml:si[$theIndex+1]/ooxml:t/text()

Easy, isn't it!?

Add some syntax, a couple of FLWOR expressions, a few HTML tags, and here you go! This XQuery just joins the two spreadsheets and creates an easy to read HTML report:


 declare namespace ooxml=
    "
http://schemas.openxmlformats.org/spreadsheetml/2006/main";
 let $doc1Strings := doc("jar:file:///c:/doc1.xlsx!/xl/sharedStrings.xml")//ooxml:si
 let $doc2Strings := doc("jar:file:///c:/doc2.xlsx!/xl/sharedStrings.xml")//ooxml:si
 return
  <table> {
   <tr style="font-weight: bold"><td>Name</td><td>Division</td><td>City</td></tr>,
   for $row1 in
    doc("jar:file:///c:/doc1.xlsx!/xl/worksheets/sheet1.xml")//
          ooxml:sheetData/ooxml:row,
       $row2 in
    doc("jar:file:///c:/doc2.xlsx!/xl/worksheets/sheet1.xml")//
          ooxml:sheetData/ooxml:row
   let $col1-1 := $doc1Strings[$row1/ooxml:c[1]/ooxml:v/number()+1]/ooxml:t/text()
   let $col2-1 := $doc2Strings[$row2/ooxml:c[1]/ooxml:v/number()+1]/ooxml:t/text()
   where $col1-1 = $col2-1
   return
    let $col1-2 := $doc1Strings[$row1/ooxml:c[2]/ooxml:v/number()+1]/ooxml:t/text()
    let $col2-2 := $doc2Strings[$row2/ooxml:c[2]/ooxml:v/number()+1]/ooxml:t/text()
    return
     <tr>
      <td>{$col1-1}</td>
      <td>{$col2-2}</td>
      <td>{$col1-2}</td>
     </tr>
  } </table>

It's that simple! Here are the two spreadsheets, if you want to try it yourself: doc1.xlsx, doc2.xlsx

Labels: , , ,

Monday, February 18, 2008

Converting proprietary file formats to specific XML structures


We have talked in the past about how you can use XQuery to deal with EDI message (transform a directory of EDI messages, create EDI messages out of your database, converting tab delimited files into EDI). But sometimes you are not dealing with EDI; it's not infrequent that companies still rely on proprietary file formats to exchange or circulate information.
For example, a few days ago we received a request from a user who needed to convert a file in a proprietary format into a specific XML structure. The sample file he sent us looked something like this:


10       0123456789MR JOHN            SMITH
20       0123456789ACCT1
30       0123456789SALARY       500000
30       0123456789BONUS          1000
20       0123456789ACCT2
30       0123456789OTHER           100
10       1234566790MR DAVID           DOE
20       1234567890ACCT1
30       1234567890SALARY        10000


This isn't EDI, CSV, tab delimited or any of the other standard or semi-standard ways to represent information in text files; it's a proprietary format based on a specific interpretation of the size and position of the fields. This is the description of records and fields that the user shared with us:


one block is made of:
  one record 10 (customer)
  1 to N record 20 per record 10 (accounts)
  1 to N record 30 per record 20 (transactions)

record 10
  record type
on 10 characters
  customer id on 10 characters
  customer lastname on 20 characters
  customer firstname on 30 characters

record 20
  record type
on 10 characters
  customer id on 10 characters
  account id on 10 characters

record 30
  record type
on 10 characters
  customer id on 10 characters
  label on 10 characters
  value on 10 characters

How can we deal with something like this without building our own parser? DataDirect XML Converters and Stylus Studio can definitely help! XML Converters support what we call "custom conversions"; a custom conversion defines how files belonging to the same format family should be converted to XML by XML Converters. As the definition of a custom conversion can be a tedious task, Stylus Studio provides an editor that allows you to create custom conversion through an intuitive graphical interface. Explaining how that would work in this case is probably more complicated than actually doing it... so, here is a short video which describes all the operations you would do in Stylus Studio to convert the file provided by the user into XML.


Thanks to the custom conversion created using Stylus Studio, we are now able to manipulate the proprietary format file as XML; but, as mentioned above, the user needs to convert the file into a specific XML format. Using the example above, the desired XML result should look like this:


<root>
    <record_10>
        <type>10</type>
        <customer_id>0123456789</customer_id>
        <lastname>MR JOHN</lastname>
        <firstname>SMITH</firstname>
    </record_10>
    <record_20>
        <type>20</type>
        <customer_id>0123456789</customer_id>
        <account_id>ACCT1</account_id>
    </record_20>
    <record_30>
        <type>30</type>
        <customer_id>0123456789</customer_id>
        <label>SALARY</label>
        <value>500000</value>
    </record_30>
    <record_30>
        <type>30</type>
        <customer_id>0123456789</customer_id>
        <label>BONUS</label>
        <value>1000</value>
    </record_30>
    <record_20>
        <type>20</type>
        <customer_id>0123456789</customer_id>
        <account_id>ACCT2</account_id>
    </record_20>
    <record_30>
        <type>30</type>
        <customer_id>0123456789</customer_id>
        <label>OTHER</label>
        <value>100</value>
    </record_30>
    <record_10>
        <type>10</type>
        <customer_id>1234566790</customer_id>
        <lastname>MR DAVID</lastname>
        <firstname>DOE</firstname>
    </record_10>
    <record_20>
        <type>20</type>
        <customer_id>1234567890</customer_id>
        <account_id>ACCT1</account_id>
    </record_20>
    <record_30>
        <type>30</type>
        <customer_id>1234567890</customer_id>
        <label>SALARY</label>
        <value>10000</value>
    </record_30></root>

This is where the combination of DataDirect XQuery and XML Converters proves particularly helpful: not only we are able to move from a proprietary file format to well formed XML, but we also have all the power, performance and scalability of XQuery to deal with such XML to transform it (potentially, even augment it!) into the final format that we need. The following XQuery - which implicitly relies on the custom converter described above - does the trick; notice that we created a couple of functions to take care of the positional grouping which is required to generate the desired output; that makes the XQuery much easier to read and more reusable:


declare function local:getRelated20s($item) {
    let $nextItem :=$item/following-sibling::*[local-name()!="record_20"and local-name()!="record_30"][1]
    for $related in $item/following-sibling::*[local-name()="record_20"]
    where if ($nextItem) then $related << $nextItem else true()
    return $related
};
declare function local:getRelated30s($item) {
    let $nextItem := $item/following-sibling::*[local-name()!="record_30"][1]
    for $related in $item/following-sibling::*[local-name()="record_30"]
    where if($nextItem) then $related << $nextItem else true()
    return $related
};
<DataSets>
    {
    for $record_10 in doc('converter:file:///c:/blog1/sample.conv?sampleinput.txt')/root/record_10
    return
    <DataSet>
        <CustomerId>
            {$record_10/customer_id/text()}
        </CustomerId>
        <CustomerName>
            {$record_10/lastname/text()}
        </CustomerName>
        <CustomerFirstName>
            {$record_10/firstname/text()}
        </CustomerFirstName>
        <Accounts>
            {
            for $record_20 in local:getRelated20s($record_10)
            return
            <Account>
                <AccountId>
                    {$record_20/account_id/text()}
                </AccountId>
                <Transactions> {
                    for $record_30 in local:getRelated30s($record_20)
                    return
                        <Transaction>
                            <Label>{$record_30/label/text()}</Label>
                            <Value>{$record_30/value/text()}</Value>
                        </Transaction>
                } </Transactions>
            </Account>
            }
        </Accounts>
    </DataSet>
    }
</DataSets>

Once again, thanks to the fact that DataDirect XQuery and XML Converters make it possible to manipulate a wide variety of data sources (even proprietary ones!), you are able to deal with transformation (and possibly aggregation) problems from an XML point of view, shielding as much as possible those tasks from the low level details of the data sources.

Labels: , , , ,

Tuesday, January 29, 2008

Converting tab delimited files into EDI


A few days ago I received a email saying "I am a DataDirect XQuery user, and I'm successfully using your product to create XML-based reports from incoming EDI messages. Now I'm in need to do something somewhat different: it may happen that my company receives tab-delimited messages, and those messages need to be translated to EDI X12 831, rather than converted to XML. Can your product help me handling tab-delimited to EDI conversions?".

Not being a marketing person, I usually like showing how things are done, rather than just claiming that they can be easily done using our products... :)
So, I asked the user for an example of his data, a tab-delimited file. If you have read some of the other posts in this Blog, you will know that DataDirect XQuery can access tab-delimited files thanks to the XML Converters technology. So, consider this tab-delimited file:

id description quantity unitprice
001 Telephone 1 23.00
002 Desk 1 129.00
003 Keyboard 2 21.00

The following XQuery will return, for example, all the description fields in the tab-delimited input message:

 <items> {
  for $item in doc("converter:TAB:first=yes?file:///c:/sample.tab")//row
  return $item/description
 } </items>

The result is:

 <items>
  <description>Telephone</description>
  <description>Desk</description>
  <description>Keyboard</description>
 </items>

That shows that we can handle a tab-delimited file in DataDirect XQuery and generate XML; but it doesn't help much our user: he needs to output EDI.
But we know how to do that; Marc has shown us in this post how you can create EDI messages from XQuery. We just need to apply the same idea to our example; the following XQuery uses the incoming tab-delimited file to apparently generate an XML fragment consistent with a specific vocabulary; that vocabulary is what the underlying XML Converters technology knows how to translate in the corresponding EDI message (an X12 832 message in this case, which is a price/sales catalog). The serialization option at the beginning of the XQuery is instructing DataDirect XQuery to output a raw EDI message rather than XML; that way we are moving data from a tab-delimited file format into an EDI message manipulating it as if it was all XML, but without ever materializing it as XML! Many of the EDI fields typically specified are omitted here; but this example will run if you want to see how it works.

 declare option ddtek:serialize "method=EDI,long=yes";
 <X12>
  <ISA>
   <ISA06-InterchangeSenderId>1515151515</ISA06-InterchangeSenderId>
   <ISA08-InterchangeReceiverId>5151515151</ISA08-InterchangeReceiverId>
   <ISA11-RepetitionSeparator>^</ISA11-RepetitionSeparator>
   <ISA13-InterchangeControlNumber>000032123</ISA13-InterchangeControlNumber>
   <ISA14-AcknowledgmentRequested>0</ISA14-AcknowledgmentRequested>
   <ISA15-UsageIndicator>P</ISA15-UsageIndicator>
   <ISA16-ComponentElementSeparator>*</ISA16-ComponentElementSeparator>
  </ISA>
  <GS>
   <GS01-FunctionalIdentifierCode>CT</GS01-FunctionalIdentifierCode>
   <GS02-ApplicationSendersCode>9988776655</GS02-ApplicationSendersCode>
   <GS03-ApplicationReceiversCode>1122334455</GS03-ApplicationReceiversCode>
   <GS06-GroupControlNumber>128</GS06-GroupControlNumber>
   <GS07-ResponsibleAgencyCode>X</GS07-ResponsibleAgencyCode>
   <GS08-VersionReleaseIndustry>004030</GS08-VersionReleaseIndustry>
  </GS>
  <TS_832>
   <ST>
    <ST01-TransactionSetIdentifierCode>832</ST01-TransactionSetIdentifierCode>
    <ST02-TransactionSetControlNumber>12345</ST02-TransactionSetControlNumber>
   </ST>
   <BCT>
    <BCT01-CatalogPurposeCode>CP</BCT01-CatalogPurposeCode>
    <BCT02-CatalogNumber>GOV56789</BCT02-CatalogNumber>
    <BCT10-TransactionSetPurposeCode>00</BCT10-TransactionSetPurposeCode>
   </BCT>
   {
    for $row in doc('converter:TAB:first=yes?file:///c:/sample.tab')/table/row
    return
     <GROUP_5>
      <LIN>
       <LIN02-ProductServiceIdQualifier>MF</LIN02-ProductServiceIdQualifier>
       <LIN03-ProductServiceId>{$row/id/text()}</LIN03-ProductServiceId>
       <LIN07-ProductServiceId>{$row/description/text()}</LIN07-ProductServiceId>
       <LIN09-ProductServiceId>{$row/quantity/text()}</LIN09-ProductServiceId>
      </LIN>
      <GROUP_6>
       <CTP>
        <CTP03-UnitPrice>{$row/unitprice/text()}</CTP03-UnitPrice>
       </CTP>
      </GROUP_6>
     </GROUP_5>
   }
   <SE/>
  </TS_832>
  <GE/>
 </X12>

Curious about the result? Here you go:
ISA+00+ +00+ +ZZ+1515151515 +ZZ+5151515151 +080129+1710+U+00000+000032123+0+P+*'
GS+CT+9988776655+1122334455+20080129+1710+128+X+004030'
ST+832+12345'
BCT+CP+GOV56789++++++++00'
LIN++MF+001++++Telephone++1'
CTP+++23'
LIN++MF+002++++Desk++1'
CTP+++129'
LIN++MF+003++++Keyboard++2'
CTP+++21'
SE+9+12345'
GE+1+128'
IEA+1+000032123'

The beauty of this approach is that you can use it in conjunction with all the other data aggregation capabilities of DataDirect XQuery, like possibly adding data to the outgoing EDI message which is not available in the incoming tab-delimited file but available inside a relational database (like a encoded field that may need to be looked up in a table); and this will all work taking advantage of the performance and scalability features of DataDirect XQuery and XML Converters: the example above will seamlessly stream from tab-delimited to EDI without ever allocating in memory for than a few items per time.

The same technique applies of course to all the data sources supported by DataDirect XQuery and XML Converters; you can find more details at the following links:

Labels: , , , , ,

Tuesday, January 8, 2008

Standard XQuery support for MySQL(R): it's available now!


As you may have noticed if you paid attention to our recent new releases, DataDirect XQuery now fully supports MySQL(R).
That means that finally MySQL users have access to a standard way (XQuery) to get access to their data as XML. To see how all this works, and how MySQL users can take advantage of this opportunity, we have created a few examples on our web site; see how easy it is to extract data from MySQL as XML, how to aggregate it with XML documents or even EDI documents; and how you can even update your MySQL data directly from XQuery.

Labels: , , , ,

Monday, December 17, 2007

New product versions released!


I just wanted to let you know that we (DataDirect) have recently released new version of our XML products; all of them are available for download and free evaluation:

DataDirect XQuery 3.1
Complete release highlights are available here.
  • New database support for DB2 v9.1 for z/OS, MySQL, Oracle 11g, Informix, PostgreSQL
  • Support for XML-based office document standards (OpenDocument Format, Office Open XML, new PDF versions)
  • Richer support for XQuery output formats (EDI and flat files)
  • New Eclipse-based plug-in for developing and testing queries
  • Performance enhancements
DataDirect XML Converters 3.1
Complete release highlights available here.
  • Support for the HL7 EDI standard
  • XML Schema API generates an XML Schema that describes how to convert XML to non-XML
  • Support for Standard Exchange Format (SEF) to manage custom EDI message types
  • New exception and error listener interfaces for flexible error handling
Stylus Studio 2008
Complete release highlights availabe here.
  • Support for DataDirect XQuery 3.1 and XML Converters 3.1 (embedded in the product)
  • New WSDL editor
  • C# code generation for .NET
  • Simplified generated Java code
  • Auto-complete XML Fragments
  • XML Schema Editor Enhancements
In the coming weeks I'm planning to post examples showing how you can use the new (and old) functionality in our XML products. Of course, if there are specific examples you would like to see covered, I would be very interested in hearing about them.

Labels: , , ,

Monday, November 19, 2007

Join us at the XML 2007 Conference!


The annual XML Conference is always a great opportunity to hear about XML technologies from the people that work on XML standards and products.

This year's XML 2007 Conference is happening in Boston, starting on the 3rd of December.

We will be there, talking about XML Converters (Tuesday, 2:45pm) and DataDirect XQuery (Wednesday 1pm, 2pm, 3pm and 4pm!).

In addition to that, DataDirect is also hosting a breakfast session, where we are planning to have some fun playing games and giving away prizes (promised! No slides during breakfast!).

Hope to see you all there!

Tech Tags:

Labels: , ,

Monday, October 22, 2007

Yes! You can finally query your office documents!


When I was working on an XML database in the late nineties, I remember hearing a lot of noise about the fact that — finally! — XML would allow for real reuse and collaboration when working with applications like word processors and spreadsheet editors. Copy like "Data is finally disjoint from format; store the data as XML, and use other languages [XSLT, typically] to take care of the format." was easy to find in industry publications and seemed to herald the start of something big.

Well, XML did work that way, but only in limited cases; for the vast majority of applications, XML didn't provide any of the benefits of separate data and formatting. Fast forward a decade, and there's now genuine promise in this area.

Two emerging standards, OpenDocument Format (ODF) and Office Open XML (OOXML), are gaining popularity (and originating several fights in standard bodies), and things are moving again, even if they're not moving in exactly the way people thought they would some time ago: neither ODF nor OOXML really create a clear separation between data and presentation. Instead, they have adopted an XML format that provides for a mix of data and presentation. But both standards have adopted XML, which means that you can finally use standard XML tools, like XQuery, to query your office documents!

Marc recently wrote a nice article about how you can leverage XQuery to query XML-based office document standards; he provides an excellent technical overview, and the included examples are good references about how to get started experimenting with XQuery and these new formats.


Labels: , , , ,

Tuesday, October 16, 2007

Converting between EDI and XML... Why?


EDI (Electronic Data Interchange) pre-dates XML by several years; people started using the term EDI to describe the transfer of data, typically across multiple companies, often using VANs (Value Added Networks) or the Internet. Many standards bodies have been creating EDI standards in the past several years; some of the most popular EDI standards in active use today include X12, HL7, EDIFACT, IATA.

And how is XML related to all this? Since the start of XML's adoption in the late 90's, one of XML's primary applications was in handling B2B or even B2C data interchange. The benefits compared to EDI were obvious, numerous, and embedded in the very nature of XML: human readable, easily extensible, easy to validate (well, after XML Schema saw the light, at least), easily parsed using off-the-shelf components, and easily accessed through standard interfaces like DOM, Sax, and StAX. None of these features applies to the EDI standards, as anyone dealing with EDI knows very well.

So, there is no point in worrying about EDI standards; we can just use XML and rely on powerful XML tools like DataDirect XQuery to create and manipulate messages used for data interchange across companies, right? Wrong.

It is true that "the world" is moving in the direction of using XML more and more to deal with data interchange. But it's also true that there are so many existing critical applications based on EDI standards, in so many different industry verticals (health, airlines, and insurance, to name but a few), that EDI is not going away anytime soon.

As you can imagine, the contemporaneous existence of two strong data interchange formats creates another problem: you are company A, and you deal with company B, which "speaks" only EDI, and with company C, which has recently upgraded all its systems which now "speak" only XML... (I wish I were making up this scenario for illustrative purposes, but I am not.) You must become bilingual, speaking and understanding both EDI and XML. Many companies try to address that problem creating ad-hoc code that knows how to deal with EDI and XML, and the many variants that various partner companies may be using in their EDI approach; but that's far from an optimal approach, as developing and testing applications becomes a nightmare, and handling changes is a big challenge.

That's exactly where a product like DataDirect XML Converters can help — XML Converters are streaming-based Java and .NET components that translate between EDI and XML. This means that if your company (company A) receives EDI messages from a partner company (company B), you can still treat that EDI as XML data thanks to XML Converters; similarly, if you need to send an EDI message to a partner company, you can create XML and then rely on XML Converters to translate that XML into EDI. You don't need to write EDI parsers; you can handle variations from the EDI standards through XML Converters; and you can just focus on manipulating XML, preferably through a language like XQuery.

Edict Systems understood the benefits that DataDirect XML Converters bring to the table, and they are now part of the growing number of companies that use XML Converters to handle business situations in which EDI and XML need to co-exist, or where developers want to be shielded from the task of parsing and creating raw EDI. An interesting case study of Edict System's DataDirect XML Converters implementation is now available on the DataDirect XML Converters web site: http://www.xmlconverters.com/customers/edict-systems.html

There's much more to write about other usecases where using DataDirect XML Converters and DataDirect XQuery together provide a very powerful combination, and I'll be blogging about that in future posts.



Tech Tags:

Labels: , , ,