Relational to XML Transformations

Relational to XML Transformations Background & Issues Preliminaries Execution strategies The SilkRoute System rel-xml-i

Background & Issues • XML – a de-facto standard for data exchange (B2B) • Business data is will be stored in relational db’s : reliability, optimized query processing, established applications • Need efficient generation of XML data from relational db’s Subject was investigated in several projects (early 2000’s) : • Xperanto (IBM Almaden) • SilkRoute (AT&T) • PRATA (Bell Labs/Lucent) rel-xml-i

Issues : Relations are flat, unordered, tuples are pure data XML is nested, tagged, and ordered • Need a language/interface to specify needed data and its form • Which parts of the work should be performed Inside/outside the relational engine Relational engines are very good at • Optimizing SQL queries (efficient execution) • Sorting But do not deal with tagging, do not generate XML rel-xml-i

Xperanto : The IBM team had access to DB2 internals • Extended SQL by a few primitives to generate XML, Implemented these in the DB2 relational engine • Analyzed space of execution strategies, using simulations, concluded that 2 strategies, both doing almost all work as one (extended) SQL query is best But: We do not have access to db internals  The interesting part is their analysis/simulations rel-xml-i

SilkRoute : • Relational db is presented as an XML view (standard transformation) • Desired XML specified in XML query language • Initial version: home-brewed XML query language • Last version: QXuery Main idea: • Query composition with the query that define the standard view allow to generate the data by SQL queries + tagging • Found that one big SQL query is not always best rel-xml-i

PRATA : • Use DTD’s as description of desired XML, & a generalization of attribute grammars with query actions to specify the needed data • Can handle recursion in DTD’s (former approaches cannot) • Can still optimize to use a small number of SQL queries for data generation rel-xml-i

Comment : We are now back in the GAV approach • View: standard XML view of relational • Desired XML: query on this view • Main idea: query composition Complications: • XML data is tagged, relational is not • Nested data • Different nesting in view and query target • Need to change structure • May need fusion rel-xml-i

Execution strategies The issue: Data is stored in relational tables, can be retrieved with one/few/many SQL queries • Which approach is more efficient? • How can the approach be implemented, assuming the transformation is in some XML-ish l query language rel-xml-i

Xperanto execution strategies : Base example : rel-xml-i

The source relational schema: rel-xml-i

Space of evaluation strategies : • Early/late tagging • Early/late structuring (to form the nested XML structure) • All work inside the engine, or (at least part)outside the engine Some combinations are meaningless, e.g. early tagging/late structuring rel-xml-i

1st strategy : early tagging, early structuring, outside the engine An application issues a sequence of SQL queries, matching the structure of the result e.g: For each customer do 1. retrieve root – customer info – cust. name & id retrieved, tagged & output 2. retrieve, tag, & output customer account info 3. retrieve, tag, & output customer purchase orders 4. for each PO, retrieve, tag, output items, then payment info Early structuring : queries follow structure of generated doc Early tagging : each element is tagged when retrieved Outside the engine : obvious rel-xml-i

Shortcomings : • Many small granularity queries – several queries per “object” – serious performance problems • Performs a nested loop join – a fixed join order and join strategy – the relational engine might explore others rel-xml-i

2nd strategy: Early structuring, tagging inside the machine : For this, augment the db engine with • New data type : xml • Constructors for the kinds of elements in the document, e.g. rel-xml-i

Now, can express query as: XMLAGG aggregates several XML fragments into one For example, two accounts for the customer rel-xml-i

The XML fragments have variable size  represented as Character large objects (CLOBs) Problems: • CLOBS are stored separately of their tuples, hence may need separate fetches • Each XML constructor copies the inpuyt CLOBs to form its output CLOB – a lot of copying Advantage : One large query, rather than many small ones Still nested loop join, but possibly engine can select another strategy (?) rel-xml-i

3rd strategy: Late structuring & tagging: If both structuring and tagging are done late (possibly outsideengine), we can separate process into two • Content creation – retrieve the data from the db, inside the engine • Structuring and tagging – the 1stpossibly inside, the 2nd outside rel-xml-i

Contents creation – outer join approach: Select cust.*, acct.*, porder.*, pay.*, item.* From Customer cust left join Account acct on cust.id = acc.custId left join PurchaseOrder porder on cust.id = porder.custId left join Item item on porder.id = item.poId left join payment pay on porder.id = pay,poId Left join: a customer should occur in result even if no account, or has an account but no purchase orders, etc. Result for a customer w/o some fields is padded with nulls Join is performed for each path in tree (root to leaf) in some order Disadvantages : ?? rel-xml-i

Contents creation – (unsorted) path outer union approach: Select cust.*, acct.*, type =1 From Customer cust left join Account acct on cust.id = acc.custId Outer union Select cust.*, porder.*, item.*, type = 2 From Customer cust left join PurchaseOrder porder on cust.id = porder.poId left join Item item on porder.id = item.poId Outer union Select cust.*, porder.*, pay.*, type = 3 From Customer cust left join PurchaseOrder porder on cust.id = porder.poId left join payment pay on porder.id = pay,poId Outer union: pads with nulls, like the left/right join, but does not duplicate data (as much) Each sub-query is a join for one leaf-to-root path Note: a sub-query repeated twice rel-xml-i

Contents creation – (unsorted) node outer union approach: The previous strategy still ahs some redundnacy: A parents info is replicated with the descendents Can avoid by using id’s of parents in the descendents rel-xml-i

Structuring & tagging (for unsorted outer union) : Use a hash table, with hash key type and ancestor id’s for an element to in the XML tree hash-based tagger For each tuple in relational result, that defines a node in tree, find out (hash) if parent is present: Yes – just add this new node (tag) No – add nodes for all the missing ancestors along the path to root (hashing repeatedly for shorter paths) Main disadvantage : For large outputs, main memory shortage Also, does not necessarily satisfy required order – may need sorting rel-xml-i

Last strategy : early structuring & late tagging : The idea: order the relational contents in the same order it needs to appear in the (flattened) XML file Then, tagging (& nesting) can be performed in constant space • All info about a node X appears before/with the information of its children • The info of X & its descendents appears together (no mixing with descendents of other nodes) • The children are ordered as required by the XML def rel-xml-i

Contents creation – sorted path/node outer union approach: Same as the outer union approach, but add a final sort step Relational engines are sorting experts, including external sorting • Sort on id fields, with id of higher nodes preceding those of lower nodes In example: CustId, AcctId, PoId, ItemId, PayementId • Nulls should be accounted for in sorting, and null values should precede non-nulls rel-xml-i

Performance comparisons (fig. 13): • Outer join performs badly – too much data redundancy (not shown) • One large query is better than many small ones (stored proc.) • Inside the engine outperforms out the engine (for similar strategies) outside the engine needs to copy data and bind to external variables --- binding out time is a significant component for all approaches Binding out In black rel-xml-i

Relational to XML Transformations