1 / 24

Relational to XML Transformations

Relational to XML Transformations. Background & Issues Preliminaries Execution strategies The SilkRoute System. Background & Issues. XML – a de-facto standard for data exchange (B2B) Business data is will be stored in relational db’s :

gaia
Download Presentation

Relational to XML Transformations

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Relational to XML Transformations Background & Issues Preliminaries Execution strategies The SilkRoute System rel-xml-i

  2. Background & Issues • XML – a de-facto standard for data exchange (B2B) • Business data is will be stored in relational db’s : reliability, optimized query processing, established applications • Need efficient generation of XML data from relational db’s Subject was investigated in several projects (early 2000’s) : • Xperanto (IBM Almaden) • SilkRoute (AT&T) • PRATA (Bell Labs/Lucent) rel-xml-i

  3. Issues : Relations are flat, unordered, tuples are pure data XML is nested, tagged, and ordered • Need a language/interface to specify needed data and its form • Which parts of the work should be performed Inside/outside the relational engine Relational engines are very good at • Optimizing SQL queries (efficient execution) • Sorting But do not deal with tagging, do not generate XML rel-xml-i

  4. Xperanto : The IBM team had access to DB2 internals • Extended SQL by a few primitives to generate XML, Implemented these in the DB2 relational engine • Analyzed space of execution strategies, using simulations, concluded that 2 strategies, both doing almost all work as one (extended) SQL query is best But: We do not have access to db internals  The interesting part is their analysis/simulations rel-xml-i

  5. SilkRoute : • Relational db is presented as an XML view (standard transformation) • Desired XML specified in XML query language • Initial version: home-brewed XML query language • Last version: QXuery Main idea: • Query composition with the query that define the standard view allow to generate the data by SQL queries + tagging • Found that one big SQL query is not always best rel-xml-i

  6. PRATA : • Use DTD’s as description of desired XML, & a generalization of attribute grammars with query actions to specify the needed data • Can handle recursion in DTD’s (former approaches cannot) • Can still optimize to use a small number of SQL queries for data generation rel-xml-i

  7. Comment : We are now back in the GAV approach • View: standard XML view of relational • Desired XML: query on this view • Main idea: query composition Complications: • XML data is tagged, relational is not • Nested data • Different nesting in view and query target • Need to change structure • May need fusion rel-xml-i

  8. Execution strategies The issue: Data is stored in relational tables, can be retrieved with one/few/many SQL queries • Which approach is more efficient? • How can the approach be implemented, assuming the transformation is in some XML-ish l query language rel-xml-i

  9. Xperanto execution strategies : Base example : rel-xml-i

  10. The source relational schema: rel-xml-i

  11. Space of evaluation strategies : • Early/late tagging • Early/late structuring (to form the nested XML structure) • All work inside the engine, or (at least part)outside the engine Some combinations are meaningless, e.g. early tagging/late structuring rel-xml-i

  12. 1st strategy : early tagging, early structuring, outside the engine An application issues a sequence of SQL queries, matching the structure of the result e.g: For each customer do 1. retrieve root – customer info – cust. name & id retrieved, tagged & output 2. retrieve, tag, & output customer account info 3. retrieve, tag, & output customer purchase orders 4. for each PO, retrieve, tag, output items, then payment info Early structuring : queries follow structure of generated doc Early tagging : each element is tagged when retrieved Outside the engine : obvious rel-xml-i

  13. Shortcomings : • Many small granularity queries – several queries per “object” – serious performance problems • Performs a nested loop join – a fixed join order and join strategy – the relational engine might explore others rel-xml-i

  14. 2nd strategy: Early structuring, tagging inside the machine : For this, augment the db engine with • New data type : xml • Constructors for the kinds of elements in the document, e.g. rel-xml-i

  15. Now, can express query as: XMLAGG aggregates several XML fragments into one For example, two accounts for the customer rel-xml-i

  16. The XML fragments have variable size  represented as Character large objects (CLOBs) Problems: • CLOBS are stored separately of their tuples, hence may need separate fetches • Each XML constructor copies the inpuyt CLOBs to form its output CLOB – a lot of copying Advantage : One large query, rather than many small ones Still nested loop join, but possibly engine can select another strategy (?) rel-xml-i

  17. 3rd strategy: Late structuring & tagging: If both structuring and tagging are done late (possibly outsideengine), we can separate process into two • Content creation – retrieve the data from the db, inside the engine • Structuring and tagging – the 1stpossibly inside, the 2nd outside rel-xml-i

  18. Contents creation – outer join approach: Select cust.*, acct.*, porder.*, pay.*, item.* From Customer cust left join Account acct on cust.id = acc.custId left join PurchaseOrder porder on cust.id = porder.custId left join Item item on porder.id = item.poId left join payment pay on porder.id = pay,poId Left join: a customer should occur in result even if no account, or has an account but no purchase orders, etc. Result for a customer w/o some fields is padded with nulls Join is performed for each path in tree (root to leaf) in some order Disadvantages : ?? rel-xml-i

  19. Contents creation – (unsorted) path outer union approach: Select cust.*, acct.*, type =1 From Customer cust left join Account acct on cust.id = acc.custId Outer union Select cust.*, porder.*, item.*, type = 2 From Customer cust left join PurchaseOrder porder on cust.id = porder.poId left join Item item on porder.id = item.poId Outer union Select cust.*, porder.*, pay.*, type = 3 From Customer cust left join PurchaseOrder porder on cust.id = porder.poId left join payment pay on porder.id = pay,poId Outer union: pads with nulls, like the left/right join, but does not duplicate data (as much) Each sub-query is a join for one leaf-to-root path Note: a sub-query repeated twice rel-xml-i

  20. Contents creation – (unsorted) node outer union approach: The previous strategy still ahs some redundnacy: A parents info is replicated with the descendents Can avoid by using id’s of parents in the descendents rel-xml-i

  21. Structuring & tagging (for unsorted outer union) : Use a hash table, with hash key type and ancestor id’s for an element to in the XML tree hash-based tagger For each tuple in relational result, that defines a node in tree, find out (hash) if parent is present: Yes – just add this new node (tag) No – add nodes for all the missing ancestors along the path to root (hashing repeatedly for shorter paths) Main disadvantage : For large outputs, main memory shortage Also, does not necessarily satisfy required order – may need sorting rel-xml-i

  22. Last strategy : early structuring & late tagging : The idea: order the relational contents in the same order it needs to appear in the (flattened) XML file Then, tagging (& nesting) can be performed in constant space • All info about a node X appears before/with the information of its children • The info of X & its descendents appears together (no mixing with descendents of other nodes) • The children are ordered as required by the XML def rel-xml-i

  23. Contents creation – sorted path/node outer union approach: Same as the outer union approach, but add a final sort step Relational engines are sorting experts, including external sorting • Sort on id fields, with id of higher nodes preceding those of lower nodes In example: CustId, AcctId, PoId, ItemId, PayementId • Nulls should be accounted for in sorting, and null values should precede non-nulls rel-xml-i

  24. Performance comparisons (fig. 13): • Outer join performs badly – too much data redundancy (not shown) • One large query is better than many small ones (stored proc.) • Inside the engine outperforms out the engine (for similar strategies) outside the engine needs to copy data and bind to external variables --- binding out time is a significant component for all approaches Binding out In black rel-xml-i

More Related