XML Connections

Wednesday, August 27, 2008

XQuery for the SQL programmer – Publishing data as XML

Publishing data as XML... What did you expect, of course that such topic needs to be handled in a series like XQuery for the SQL Programmer. Despite the enormous usage of XML throughout most application development initiatives these days, publishing relation data as XML using SQL dialects or extensions is still cumbersome through. Cumbersome meaning that first this is basically handled through a different language, embedded in your SQL statements. And second, there is still no cross database solution to handle this, there are differences between various database flavors, frequently even between database versions.

Let’s start with a simple use case, and generate an XML structure as follows, based on the USERS table,

<users>
<user id="U01">
<name>Tom Jones</name>
<rating>B</rating>
</user>
<user id="U02">
<name>Mary Doe</name>
<rating>A</rating>
</user>
...
</users>

In SQL Server, using FOR XML PATH,

SELECT
USERID as '@id',
NAME as name,
RATING as rating
FROM USERS FOR XML PATH ('user'), ROOT('users')

Or if your database supports the SQL/XML standard, like Oracle 10gR2 or DB2 v9, you can use the SQL/XML publishing functions. Here is the Oracle 10gR2 variant, note that some of the details will need to be changed to make it work on DB2.

SELECT
xmlelement(name "users",
(SELECT
xmlagg(
xmlelement(name "user",
xmlattributes(USERID as id),
xmlelement(name "name", NAME),
xmlelement(name "rating", RATING)
)
)
FROM USERS)
)
FROM DUAL

In XQuery we could write the following. Note the elegance of the solution as the generated XML is readable within the query,

<users>{
for $u in collection("USERS")/USERS
return
<user id="{$u/USERID}">
<name>{$u/NAME/text()}</name>
<rating>{$u/RATING/text()}</rating>
</user>
}</users>

Let’s now extend this existing XML, and include for every user the bids being placed. Imagine we want the following kind of result,

<users>
<user id="U01">
<name>Tom Jones</name>
<rating>B</rating>
<bids>
<bid id="1002">
<bid>400</bid>
<date>1999-02-14</date>
</bid>
<bid id="1004">
<bid>40</bid>
<date>1999-03-05</date>
</bid>
</bids>
</user>
<user id="U02">
...
</users>

Unfortunately, such XML structure cannot be generated through SQL Server's FOR XML PATH; we need to use FOR XML EXPLICIT. If you’re familiar with it, I’m sure you appreciate some of the details of the FOR XML PATH construct in SQL Server. Let me give it a try,

SELECT
Tag,
Parent,
[users!1!],
[user!2!id],
[user!2!name!Element],
[user!2!rating!Element],
[bids!3!],
[bid!4!id],
[bid!4!bid!Element],
[bid!4!date!Element]
FROM (
SELECT
1 AS Tag,
NULL AS Parent,
0 AS Sort,
NULL AS 'users!1!',
NULL AS 'user!2!id',
NULL AS 'user!2!name!Element',
NULL AS 'user!2!rating!Element',
NULL AS 'bids!3!',
NULL AS 'bid!4!id',
NULL AS 'bid!4!bid!Element',
NULL AS 'bid!4!date!Element'
UNION ALL
SELECT
2 AS Tag,
1 AS Parent,
convert(integer, substring(USERID, 2,2)) * 100 AS Sort,
NULL,
USERID,
NAME,
RATING,
NULL, NULL, NULL, NULL
FROM USERS
UNION ALL
SELECT
3 AS Tag,
2 AS Parent,
convert(integer, substring(USERID, 2,2)) * 100 + 1 AS Sort,
NULL, NULL, NULL, NULL,NULL, NULL, NULL, NULL
FROM USERS
UNION ALL
SELECT
4 AS Tag,
3 AS Parent,
convert(integer, substring(USERID, 2,2)) * 100 + 2 AS Sort,
NULL, NULL, NULL, NULL,NULL,
USERID,
BID,
BID_DATE
FROM BIDS
) A
ORDER BY Sort
FOR XML EXPLICIT

And if you have a SQL/XML background, here is an Oracle 10gR2 compatible query,

SELECT
xmlelement(name "users",
(SELECT
xmlagg(
xmlelement(name "user",
xmlattributes(u.USERID as id),
xmlelement(name "name", u.NAME),
xmlelement(name "rating", u.RATING),
xmlelement(name "bids",
(SELECT
xmlagg(
xmlelement(name "bid",
xmlattributes(b.ITEMNO as id),
xmlelement(name "bid", b.BID),
xmlelement(name "date", b.BID_DATE)
)
)
FROM BIDS b
WHERE b.USERID = u.USERID)
)
)
)
FROM USERS u)
)
FROM DUAL

The XQuery version is more concise, more readable and as a consequence more maintainable,

<users>{
for $u in collection("USERS")/USERS
return
<user id="{$u/USERID}">
<name>{$u/NAME/text()}</name>
<rating>{$u/RATING/text()}</rating>
<bids>{
for $b in collection("BIDS")/BIDS
where $b/USERID = $u/USERID
return
<bid id="{$b/ITEMNO}">
<bid>{$b/BID/text()}</bid>
<date>{$b/BID_DATE/text()}</date>
</bid>
}</bids>
</user>
}</users>

There is a subtle difference between the above SQL and XQuery queries. In SQL, be it the SQL/XML standard or Microsoft SQL Server’s FOR XML approach, element construction is skipped in case of NULL data. This is not the case with the XQuery queries. Consider the following query from above,

<users>{
for $u in collection("USERS")/USERS
return
<user id="{$u/USERID}">
<name>{$u/NAME/text()}</name>
<rating>{$u/RATING/text()}</rating>
</user>
}</users>

If for some record in our database, the NAME or RATING are NULL, the or element would still be included the query result. In order to make the XQuery fully compatible with SQL, the following trick can be used. Up to now we have been adding the path expression to select the data, as enclosed expressions inside the element constructors,

<rating>{$u/RATING/text()}</rating>

If we simply select the data, and add the element constructor as last step in the path expression, we get the desired effect.

$u/RATING/<rating>{./text()}</rating>

If for some user the rating is NULL, then $u/RATING will evaluate to the empty sequence, as a consequence the element will not be created.
Applying this to our complete query we have the following,

<users>{
for $u in collection("USERS")/USERS
return
<user>{
$u/USERID/attribute id {.},
$u/NAME/<name>{./text()}</name>,
$u/RATING/<rating>{./text()}</rating>
}</user>
}</users>

I hope this post gave a sense on the power and simplicity, when it comes to publishing relational data as XQuery.
All posts in the XQuery for the SQL Programmer series have been about querying your relational databases. Next we’ll discuss updates. Can you update your data through XQuery, and how does this compare to what you are used to in SQL?

Tech Tags:

Sunday, August 24, 2008

Bridging the gap between EDI, proprietary formats and XML

Minollo is back on stage next Wednesday. In this week's webinar he explores how to bridge the gap between EDI, proprietary formats, and XML.

  • Extracting data from EDI and proprietary files
    Manipulate an EDI file or proprietary files as simple XML documents, and extract information from them
  • Aggregating EDI messages and different data sources
    Merge incoming EDI messages with a variety of different data sources (relational databases, XML files, Web services), and generate a document consistent with a predefined XML Schema
  • Create EDI messages using data from different data sources
    Extract data from a variety of data sources (relational databases, XML files, Web services), and generate an EDI document consistent with a specific version/sub-version of the EDI specifications you need to use
  • Convert an EDI message in a different flat file format and vice-versa
    Process an incoming EDI message and send a Comma Separated Value format to your business partner. Receive a fixed-width proprietary format and send an EDI message back.

Register now and attend this technical webinar.

Labels:

Friday, August 22, 2008

Accessing LDAP directory services through XQuery

As you know by now, DataDirect XQuery provides an easy and efficient way to aggregate data available in a variety of data sources and formats, like Relational Databases, XML documents, Web service responses, flat files, EDI files and so on.

But, as you also well know, the world of data storage and access is very messy; there are (and always will be) protocols, data stores and formats that are not supported out of the box even by the most sophisticated data integration tool. For example, lately a few DataDirect XQuery users have asked about the possibility to access LDAP directory services to create reports that include information available in part in RDBMS and in part in LDAP directories. DataDirect XQuery doesn't support LDAP directories out of the box; but it does support the possibility to extend the variety of supported data sources through several extension methods, like custom URI resolvers, Java extension functions or custom collection URI resolvers.

So, I started thinking about what would be the best way to expose LDAP directory access from XQuery, and I came up with the following requirements for the example that I wanted to make available to our users:
- Access to the LDAP directories must be highly scalable: the same way we rely on XML streaming processing and sophisticated SQL generation and result sets consumption, we need to make LDAP access work in a streaming fashion (in fact, one of the users interested in this functionality was planning to process hundreds of thousands of LDAP records from within XQuery; better do that in a streaming fashion!)
- Access to the LDAP directories must be available both as a custom URI resolver (it's very natural for the user to think in terms of doc("ldap://localhost:10389?...") URIs when accessing LDAP resources), and as Java extension functions (which can provide more flexibility when parameters are dynamically specified)

The real work was to create a StAX interface able to consume the results returned by LDAP search operations; exposing that interface as either a custom URI resolver or a Java extension function was a very simple job.

The result of this process is attached here. If you want to try it out, you'll just need to make sure your classpath includes the folder where you expand the ZIP file. Then, a simple XQuery like this will start returning you data that is stored in your LDAP directory service (I've tested these examples using Apache Directory Suite):
doc("ldap://localhost:10389?auth=simple
        &amp;principal=uid=admin,ou=system&amp;pwd=secret
        &amp;name=ou=users,ou=system&amp;filter=cn=*"
)

For me, this is the result I get:
<ldap url="...">
    <item>
        <postalCode>01803</postalCode>
        <mobile>781 555-555</mobile>
        <email>ipedruzz@datadirect.com</email>
        <uid>ivanpedruzzi</uid>
        <objectClass>organizationalPerson</objectClass>
        <sn>pedruzzi</sn>
        <cn>Ivan Pedruzzi</cn>
    </item>
    <item>
        <postalCode>01880</postalCode>
        <mobile>781 555-666</mobile>
        <email>cinnocent@datadirect.com</email>
        <uid>carloinnocenti</uid>
        <objectClass>organizationalPerson</objectClass>
        <sn>innocenti</sn>
        <cn>Carlo Innocenti</cn>
    </item>
    ...


As you would imagine, the data looks like XML; and at this point you can handle it as if it was stored in any normal XML document referenced through the doc() function.
So, suppose for example you want to merge personal data stored in LDAP with other information stored in relational database; for example, suppose you are a telcom company and you want to retrieve all details about your subscribers who own cellphones with GPS capabilities in a specific ZIP code:
<subscribers>
    {
    for $subscribers in collection("subscribers")/subscribers ,
        $phones in collection("phones")/phones
    where $phones/id = $subscribers/phone and
        $phones/GPS = "yes" and $subscribers/zipcode = "01880"
    return
    <subscriber>
        <email>
            {
            doc(concat("ldap://localhost:10389?
                      auth=simple&amp;principal=uid=admin,ou=system&amp;
                      pwd=secret&amp;name=ou=users,ou=system&amp;filter=uid="
,
                      $subscribers/id)
                      )//email/text()
            }
        </email>
        <phone>
            {
            concat($phones/brand," ",$phones/model)
            }
        </phone>
    </subscriber>
    }
</subscribers>
 


Pretty nice, isn't it. That would return a result like this:
<subscribers>
    <subscriber>
        <email>ipedruzz@datadirect.com</email>
        <phone>Apple IPhone II</phone>
    </subscriber>
    <subscriber>
        <email>nancy.vodicka@datadirect.com</email>
        <phone>Blackberry 8800</phone>
    </subscriber>
</subscribers>

A similar approach applies if you want to expose an LDAP directory through Java extension functions; the extension function mechanism is more flexible, and it also gives us a chance to cache some factory/connection objects to the LDAP service; so, it may be also more efficient depending on the problem you are trying to solve.
The mechanism I wrote as an example relies on a single Java class with one constructor and one method; the constructor is used to create the factory object used by the call() method to actually perform the LDAP search. From an XQuery point of view, things look like this:
(: declare the namespace implementing the Java extension functions :)
declare namespace ldap= "ddtekjava:com.ddtek.ldap.ldap";
(: declare the constructor and method functions as they are seen from XQuery :)
declare function ldap:ldap($contextFactory as xs:string)
      as
ddtek:javaObject external;
declare function ldap:call($this as ddtek:javaObject, $server as xs:string, $port as xs:string,
        $pwd as xs:string, $auth as xs:string, $principalName as xs:string,
        $nameToFilter as xs:string, $filterExp as xs:string)
            as
document-node(element(*, xs:untyped)) external;

(: create an "ldap" class object using a specific factory class :)
declare variable $ldap:= ldap:ldap("com.sun.jndi.ldap.LdapCtxFactory");
(: execute the call() method specifying the required arguments :)
ldap:call($ldap,
  "localhost","10389","secret","simple","uid=admin,ou=system",
  "ou=users,ou=system"
,"cn=*")
 

The result of this operation is equivalent to the excution of the doc() function above using the custom URI format; if I wanted to use the Java extension function to merge RDBMS and LDAP data, I would use an approach equivalent to what we saw before:
declare variable $ldap := ldap:ldap("com.sun.jndi.ldap.LdapCtxFactory");
<subscribers>
  {
    for $subscribers in collection("ldap.dbo.subscribers")/subscribers,
        $phones in collection("ldap.dbo.phones")/phones
    where $phones/id = $subscribers/phone and
      
$phones/GPS = "yes" and $subscribers/zipcode = "01880"
    return
      <subscriber>
        <email>
          { ldap:call($ldap,
            "localhost","10389", "secret","simple", "uid=admin,ou=system",
            "ou=users,ou=system"
,
            concat("uid=", $subscribers/id))/ldap/item/email/text() }
        </email>
        <phone>
          { concat($phones/brand, " ", $phones/model) }
        </phone>
      </subscriber>
  }
</subscribers>

Both approaches will consume the results returned by the LDAP searches in a streaming fashion, thanks to the StAX interface used to interface the custom URI resolver and Java extension function to DataDirect XQuery. Feel free to dig in the attached sources to learn more about how that works, and/or to improve/change the behavior of this example.

So, an interesting example, but what's the point I'm trying to make here? As I mentioned before, data today is available in such a wide variety of stores and formats, and accessible through such a variety of different protocols and APIs that it's virtually impossible for data integration tools to support all of them out of the box. But as long as the data integration tool allows you to extend its behavior through flexible and scalable mechanisms like the one described here, and as long as the data model internally supported by the tool (XML, in this case) is powerful enough to accommodate virtually any kind of physical data model you need to access, then you can still leverage the power, flexibility, scalability and performance that the aggregation tool offers. And that's certainly true for DataDirect XQuery.
 

Labels: , , , , ,

Wednesday, August 6, 2008

Web Services and Data Integration webinar replay

If you missed Minollo's live performance a few weeks ago, you can get the archived version now.

The recording contains code demos for,

  • Using XML to create logical views of a variety of physical data sources
  • Aggregating XML documents with relational data
  • Consuming Web services in XQuery
  • Exposing XQuery as data services

Labels: , ,

Tuesday, August 5, 2008

XQuery for the SQL Programmer - Grouping and aggregation

Today's topic in the XQuery for the SQL programmer series is grouping.
Where grouping is built-in the SQL language, this is unfortunately a bit less trivial in XQuery. Let's start with a simple example, count the number of bids for every user.

select USERID, count(*)
from BIDS
group by USERID

XQuery has no built-in grouping construct, but you can achieve the same using the distinct-values() function, which we introduced in our last week's post.

for $userid in distinct-values(collection("BIDS")/BIDS/USERID)
return
<user>
<userid>{$userid}</userid>
<count>{count(collection("BIDS")/BIDS[USERID = $userid])}</count>
</user>

Let's now add a having clause, we're interested in the users with more than 1 bid,

select USERID, count(*)
from BIDS
group by USERID
having count(*) > 1

Our previous XQuery is changed as follows. Note that we use a let-clause, which we introduced in Joining Data.

for $userid in distinct-values(collection("BIDS")/BIDS/USERID)
let $count := count(collection("BIDS")/BIDS[USERID = $userid])
where $count > 1
return
<user>
<userid>{$userid}</userid>
<count>{$count }</count>
</user>

And we further detail our query, as we're only interested in significant bids of more than 100$.

select USERID, count(*)
from BIDS
where BID > 100
group by USERID
having count(*) > 1

In XQuery we do,

for $userid in distinct-values(collection("BIDS")/BIDS/USERID)
let $count := count(collection("BIDS")/BIDS[USERID = $userid and BID > 100])
where $count > 1
return
<user>
<userid>{$userid}</userid>
<count>{$count }</count>
</user>

As a final example, let's discuss a grouping scenario based on two values. For all items group on the rating of the user, and count the number of bids, in addition to the average bid,

select ITEMNO, RATING, AVG(BID), COUNT(*) from xvs001.BIDS, xvs001.USERS
where USERS.USERID = BIDS.USERID
GROUP BY RATING, ITEMNO
ORDER BY ITEMNO, RATING

In XQuery one could write the following query. Not the two for-loops over distinct-values().

for $rating in distinct-values(collection("USERS")/USERS/RATING)
for $itemno in distinct-values(collection("BIDS")/BIDS/ITEMNO)
for $userid in collection("USERS")/USERS[RATING=$rating]/USERID
let $bids := collection("BIDS")/BIDS[ITEMNO=$itemno and USERID = $userid]
let $avg := avg($bids/BID)
let $count := count($bids)
where $bids
order by $itemno, $rating
return
<result>
<<itemno>{$itemno}</itemno>
<rating>{$rating }</rating>
<avg>{$avg}</avg>
<count>{$count}</count>
</result>

We can conclude that resolving grouping problems in XQuery 1.0 is less simple than in SQL. Unfortunate, but today's reality.
In the margin, the XML Query Working Group has recognized the issue, and is active working on adding native grouping capabilities to the language. All this in scope of XQuery 1.1. Today still a working draft, but at least we know it will change and improve over time.

Tech Tags:

Labels: , ,