XML Connections

Monday, February 18, 2008

Converting proprietary file formats to specific XML structures


We have talked in the past about how you can use XQuery to deal with EDI message (transform a directory of EDI messages, create EDI messages out of your database, converting tab delimited files into EDI). But sometimes you are not dealing with EDI; it's not infrequent that companies still rely on proprietary file formats to exchange or circulate information.
For example, a few days ago we received a request from a user who needed to convert a file in a proprietary format into a specific XML structure. The sample file he sent us looked something like this:


10       0123456789MR JOHN            SMITH
20       0123456789ACCT1
30       0123456789SALARY       500000
30       0123456789BONUS          1000
20       0123456789ACCT2
30       0123456789OTHER           100
10       1234566790MR DAVID           DOE
20       1234567890ACCT1
30       1234567890SALARY        10000


This isn't EDI, CSV, tab delimited or any of the other standard or semi-standard ways to represent information in text files; it's a proprietary format based on a specific interpretation of the size and position of the fields. This is the description of records and fields that the user shared with us:


one block is made of:
  one record 10 (customer)
  1 to N record 20 per record 10 (accounts)
  1 to N record 30 per record 20 (transactions)

record 10
  record type
on 10 characters
  customer id on 10 characters
  customer lastname on 20 characters
  customer firstname on 30 characters

record 20
  record type
on 10 characters
  customer id on 10 characters
  account id on 10 characters

record 30
  record type
on 10 characters
  customer id on 10 characters
  label on 10 characters
  value on 10 characters

How can we deal with something like this without building our own parser? DataDirect XML Converters and Stylus Studio can definitely help! XML Converters support what we call "custom conversions"; a custom conversion defines how files belonging to the same format family should be converted to XML by XML Converters. As the definition of a custom conversion can be a tedious task, Stylus Studio provides an editor that allows you to create custom conversion through an intuitive graphical interface. Explaining how that would work in this case is probably more complicated than actually doing it... so, here is a short video which describes all the operations you would do in Stylus Studio to convert the file provided by the user into XML.


Thanks to the custom conversion created using Stylus Studio, we are now able to manipulate the proprietary format file as XML; but, as mentioned above, the user needs to convert the file into a specific XML format. Using the example above, the desired XML result should look like this:


<root>
    <record_10>
        <type>10</type>
        <customer_id>0123456789</customer_id>
        <lastname>MR JOHN</lastname>
        <firstname>SMITH</firstname>
    </record_10>
    <record_20>
        <type>20</type>
        <customer_id>0123456789</customer_id>
        <account_id>ACCT1</account_id>
    </record_20>
    <record_30>
        <type>30</type>
        <customer_id>0123456789</customer_id>
        <label>SALARY</label>
        <value>500000</value>
    </record_30>
    <record_30>
        <type>30</type>
        <customer_id>0123456789</customer_id>
        <label>BONUS</label>
        <value>1000</value>
    </record_30>
    <record_20>
        <type>20</type>
        <customer_id>0123456789</customer_id>
        <account_id>ACCT2</account_id>
    </record_20>
    <record_30>
        <type>30</type>
        <customer_id>0123456789</customer_id>
        <label>OTHER</label>
        <value>100</value>
    </record_30>
    <record_10>
        <type>10</type>
        <customer_id>1234566790</customer_id>
        <lastname>MR DAVID</lastname>
        <firstname>DOE</firstname>
    </record_10>
    <record_20>
        <type>20</type>
        <customer_id>1234567890</customer_id>
        <account_id>ACCT1</account_id>
    </record_20>
    <record_30>
        <type>30</type>
        <customer_id>1234567890</customer_id>
        <label>SALARY</label>
        <value>10000</value>
    </record_30></root>

This is where the combination of DataDirect XQuery and XML Converters proves particularly helpful: not only we are able to move from a proprietary file format to well formed XML, but we also have all the power, performance and scalability of XQuery to deal with such XML to transform it (potentially, even augment it!) into the final format that we need. The following XQuery - which implicitly relies on the custom converter described above - does the trick; notice that we created a couple of functions to take care of the positional grouping which is required to generate the desired output; that makes the XQuery much easier to read and more reusable:


declare function local:getRelated20s($item) {
    let $nextItem :=$item/following-sibling::*[local-name()!="record_20"and local-name()!="record_30"][1]
    for $related in $item/following-sibling::*[local-name()="record_20"]
    where if ($nextItem) then $related << $nextItem else true()
    return $related
};
declare function local:getRelated30s($item) {
    let $nextItem := $item/following-sibling::*[local-name()!="record_30"][1]
    for $related in $item/following-sibling::*[local-name()="record_30"]
    where if($nextItem) then $related << $nextItem else true()
    return $related
};
<DataSets>
    {
    for $record_10 in doc('converter:file:///c:/blog1/sample.conv?sampleinput.txt')/root/record_10
    return
    <DataSet>
        <CustomerId>
            {$record_10/customer_id/text()}
        </CustomerId>
        <CustomerName>
            {$record_10/lastname/text()}
        </CustomerName>
        <CustomerFirstName>
            {$record_10/firstname/text()}
        </CustomerFirstName>
        <Accounts>
            {
            for $record_20 in local:getRelated20s($record_10)
            return
            <Account>
                <AccountId>
                    {$record_20/account_id/text()}
                </AccountId>
                <Transactions> {
                    for $record_30 in local:getRelated30s($record_20)
                    return
                        <Transaction>
                            <Label>{$record_30/label/text()}</Label>
                            <Value>{$record_30/value/text()}</Value>
                        </Transaction>
                } </Transactions>
            </Account>
            }
        </Accounts>
    </DataSet>
    }
</DataSets>

Once again, thanks to the fact that DataDirect XQuery and XML Converters make it possible to manipulate a wide variety of data sources (even proprietary ones!), you are able to deal with transformation (and possibly aggregation) problems from an XML point of view, shielding as much as possible those tasks from the low level details of the data sources.

Labels: , , , ,