1 / 36

Introduction to XML Algebra

Introduction to XML Algebra. Based on talk prepared for CS561 by Wan Liu and Bintou Kane. Data Model. data model ~ core data structures and data types supported by DBMS relational database is a table (set-oriented) data model XML format is a tree-structured hierarchical model.

tanika
Download Presentation

Introduction to XML Algebra

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to XML Algebra Based on talk prepared for CS561 by Wan Liu and Bintou Kane

  2. Data Model • data model ~ core data structures and data types supported by DBMS • relational database is a table (set-oriented) data model • XML format is a tree-structured hierarchical model

  3. Why XML Algebra? • It is common to translate a query language into an algebra. • First, the algebra is used to give a semantics for the query language. • Second, the algebra is used to support query optimization.

  4. NIAGARA • Title : Following the paths of XML Data: An algebraic framework for XML query evaluation By : Leonidas Galanis, Efstratios Viglas, David J. DeWitt, Jeffrey. F. Naughton, and David Maier. Univ. of Wisconsin

  5. Outline • Concepts of Niagara Algebra • Operations • Optimization

  6. Goals of Niagara Algebra • Be independent of schema information • Query on both structure and content • Generate simple, flexible, yet powerful algebraic expressions • Allow re-use of traditional optimization techniques

  7. Example: XML Source Documents • Invoice.xml • <Invoice_Document> • <invoice No = 1> • <account_number>2 </account_number> • <carrier>AT&T</carrier> • <total>$0.25</total> • </invoice> • <invoice> • <account_number>1 </account_number> • <carrier>Sprint</carrier> • <total>$1.20</total> • </invoice> • <invoice> • <account_number>1 </account_number> • <carrier>AT&T</carrier> • <total>$0.75</total> • </invoice> • </Invoice_Document> Customer.xml <Customer_Document> <customer> <account>1 </account> <name>Tom</name> </customer > <customer> <account>2 </account> <name>George</name> </customer > </Customer _Document>

  8. XML Data Model and Tree Graph Example: Invoice_Document <Invoice_Document> <invoice> <number>2</number> <carrier>Sprint</carrier> <total>$0.25</total> </invoice> <invoice> <number>1</number> <carrier>Sprint</carrier><total>$1.20</total> </invoice> </Invoice_Document> … Invoice Invoice number carrier number total total carrier 2 AT&T $0.25 1 $1.20 Sprint Ordered Tree Graph, Semi structured Data

  9. XML Data Model [GVDNM01] • Collection of bags of vertices. • Vertices in a bag have no order. • Example: Rootinvoice.xml invoice invoice.account_number < account_number > element-content </ account_number > <invoice> Invoice-element-content </invoice> [Root“invoice.xml”, invoice, invoice. account_number]

  10. Data Model • Bag elements are reachable by path expressions. • Path expression consists of two parts: • An entry point • A relative forward part • Example: account_number:invoice

  11. Operators • Source S , Follow, Select , Join , Rename , Expose , Vertex , Group , Union , Intersection , Difference - , Cartesian Product.

  12. Source Operator S • Input :a list of documents • Output :a collection of singleton bags • Examples : S (*) All Known XML documents S (invoice*.xml) All XML documents whose filename match “invoice*.xml S (*,schema.dtd) All known XML documents that conform to schema.dtd

  13. Follow operator  • Input :a path expression in entry point notation • Functionality : extracts vertices reachable by path expression • Output : a new bag that consists of the extracted vertex + all contents of original bag (in case of unnesting follow)

  14. Follow operator (Example*) {[Rootinvoice.xml , invoice, invoice.carrier]} Rootinvoice.xml invoice invoice.carrier <carrier> carrier -element-content </carrier > <invoice> Invoice-element-content </invoice> *Unnesting Follow (carrier:invoice) Rootinvoice.xml invoice <invoice> Invoice-element-content </invoice> {[Rootinvoice.xml , invoice]}

  15. Select operator  • Input: a set of bags • Functionality :filters the bags of a collection using a predicate • Output : a set of bags that conform to the predicate • Predicate:Logical operator (,,), or simple qualifications (,,,,,)

  16. Select operator (Example) {[Rootinvoice.xml , invoice],… } Rootinvoice.xml invoice <invoice> Invoice-element-content </invoice> invoice.carrier =Sprint Rootinvoice.xml invoice Rootinvoice.xml invoice <invoice> Invoice-element-content </invoice> <invoice> Invoice-element-content </invoice> {[Rootinvoice.xml , invoice], [Rootinvoice.xml , invoice], ……………}

  17. Join operator • Input:two collections of bags • Functionality:Joins the two collections based on a predicate • Output:the concatenation of pairs of pages that satisfy the predicate

  18. Join operator (Example) {[Rootinvoice.xml , invoice, Rootcustomer.xml , customer]} Rootinvoice.xml invoiceRootcustomer.xml customer <invoice> Invoice-element-content </invoice> <customer> customer-element-content </customer> account_number: invoice =number:customer Rootinvoice.xml invoice Rootcustomer.xml customer <invoice> Invoice-element-content </invoice> <customer> customer-element-content </customer> {[Rootinvoice.xml , invoice]} {[Rootcustomer.xml , customer]}

  19. Expose operator  • Input:a list of path expressions of vertices to be exposed • Output:a set of bags that contains vertices in the parameter list with the same order

  20. Expose operator (Example) {[Rootinvoice.xml , invoice.bill_period, invoice.carrier]} Rootinvoice.xml invoice. bill_period invoice.carrier <carrier> bill_period -element-content </carrier > <invoice> carrier-element-content </invoice> (bill_period,carrier) Rootinvoice.xml invoiceinvoice.carrier invoice.bill_period <carrier> bill_period -element-content </carrier > <invoice> Invoice-element-content </invoice> <invoice> carrier-element-content </invoice> {[Rootinvoice.xml , invoice, invoice.carrier, invoice.bill_period]}

  21. Vertex operator  • Creates the actual XML vertex that will encompass everything created by an expose operator • Example :  (Customer_invoice)[((account)[invoice.account_number], (inv_total)[invoice.total])]

  22. Other operators • Group: is used for arbitrary grouping of elements based on their values • Aggregate functions can be used with the group operator (i.e. average) • Rename  :Changes entry point annotation of elements of a bag. • Example:(invoice.bill_period,date)

  23. Example: XML Source Documents • Invoice.xml • <Invoice_Document> • <invoice> • <account_number>2 </account_number> • <carrier>AT&T</carrier> • <total>$0.25</total> • </invoice> • <invoice> • <account_number>1 </account_number> • <carrier>Sprint</carrier> • <total>$1.20</total> • </invoice> • <invoice> • <account_number>1 </account_number> • <total>$0.75</total> • </invoice> • <auditor>maria</auditor> • </Invoice_Document> • Customer.xml • <Customer_Document> • <customer> • <account>1 </account> • <name>Tom</name> • </customer > • <customer> • <account>2 </account> • <name>George</name> • </customer > • </Customer _Document>

  24. List account number, customer name, and invoice total for all invoices that has carrier = “Sprint”. Xquery Example FOR $i in (invoices.xml)//invoice, $c in (customers.xml)//customer WHERE $i/carrier = “Sprint” and $i/account_number= $c/account RETURN <Sprint_invoices> $i/account_number, $c/name, $i/total </Sprint_invoices>

  25. Example: Xquery output <Sprint_Invoice> <account_number>1 </account_number> <name>Tom</name> <total>$1.20</total> </Sprint_Invoice >

  26. Algebra Tree Execution Account_number name total Expose (*.account_number , *.name, *.total ) invoice(2) customer(1) Join (*.invoice.account_number=*.customer.account) invoice (2) Select (carrier= “Sprint” ) Invoice (1) invoice (2) invoice (3) customer(1) customer (2) Follow (*.invoice) Follow (*.customer) Source (Invoices.xml) Source (cutomers.xml)

  27. Optimization with Niagara Optimizer based on Niagara algebra: • Use the operation more efficiently • Produce simpler expressions by combining operations

  28. Language Convention • A and B are path expressions • A< B -- Path Expression A is prefix of B • AnB --- Common prefix of path A and B • AńB --- Greatest common of path A and B • ┴ --- Null path Expression

  29. Heuristics using Rewrite Rules Allow optimization based on path selectivity When applying un-nesting following operation Φμ

  30. Interchangeability of Follow operation Φμ(A) [Φμ(B)]=Φμ (B)[Φμ (A)] TRUE when exists C such that C < A && C < B and C = AńB Or AnB = ┴

  31. Application of Rule on Invoice Φμ(acc_Num:invoice)[Φμ(carrier:invoice)] * =?= Φμ(carrier:invoice)[Φμ(acc_Num:invoice)] **

  32. Application of Rule on Invoice Φμ(acc_Num:invoice)[Φμ(carrier:invoice)] ?= Φμ(carrier:invoice)[Φμ(acc_Num:invoice)] Equivalent because both share the common prefix “invoice”. Case AńB = invoice

  33. Benefit of Rule Application NOTE: let us assume that acc_Num is required for each invoice element, while carrier is not required for invoice element THEN: Φμ(acc_Num:invoice)[Φμ(carrier:invoice)] ?= Φμ(carrier:invoice)[Φμ(acc_Num:invoice)] Then what algebra tree do we prefer? Φμ(acc_Num:invoice)[Φμ(acc_Num:customer)] make more sense than ** Why?

  34. Discussion Reduction of Input Size on first Sub-operation: Φμ(carrier:invoice)

  35. Should we/can we apply the rule below? Φμ(acc_Num:invoice)[Φμ(acc_Num:Customer)]

  36. “acc_Num:invoice” and “acc_Num:customer” are two totally different paths Case is:AnB = ┴ So yes, rule is valid.

More Related