360 likes | 499 Views
Introduction to XML Algebra. Based on talk prepared for CS561 by Wan Liu and Bintou Kane. Data Model. data model ~ core data structures and data types supported by DBMS relational database is a table (set-oriented) data model XML format is a tree-structured hierarchical model.
E N D
Introduction to XML Algebra Based on talk prepared for CS561 by Wan Liu and Bintou Kane
Data Model • data model ~ core data structures and data types supported by DBMS • relational database is a table (set-oriented) data model • XML format is a tree-structured hierarchical model
Why XML Algebra? • It is common to translate a query language into an algebra. • First, the algebra is used to give a semantics for the query language. • Second, the algebra is used to support query optimization.
NIAGARA • Title : Following the paths of XML Data: An algebraic framework for XML query evaluation By : Leonidas Galanis, Efstratios Viglas, David J. DeWitt, Jeffrey. F. Naughton, and David Maier. Univ. of Wisconsin
Outline • Concepts of Niagara Algebra • Operations • Optimization
Goals of Niagara Algebra • Be independent of schema information • Query on both structure and content • Generate simple, flexible, yet powerful algebraic expressions • Allow re-use of traditional optimization techniques
Example: XML Source Documents • Invoice.xml • <Invoice_Document> • <invoice No = 1> • <account_number>2 </account_number> • <carrier>AT&T</carrier> • <total>$0.25</total> • </invoice> • <invoice> • <account_number>1 </account_number> • <carrier>Sprint</carrier> • <total>$1.20</total> • </invoice> • <invoice> • <account_number>1 </account_number> • <carrier>AT&T</carrier> • <total>$0.75</total> • </invoice> • </Invoice_Document> Customer.xml <Customer_Document> <customer> <account>1 </account> <name>Tom</name> </customer > <customer> <account>2 </account> <name>George</name> </customer > </Customer _Document>
XML Data Model and Tree Graph Example: Invoice_Document <Invoice_Document> <invoice> <number>2</number> <carrier>Sprint</carrier> <total>$0.25</total> </invoice> <invoice> <number>1</number> <carrier>Sprint</carrier><total>$1.20</total> </invoice> </Invoice_Document> … Invoice Invoice number carrier number total total carrier 2 AT&T $0.25 1 $1.20 Sprint Ordered Tree Graph, Semi structured Data
XML Data Model [GVDNM01] • Collection of bags of vertices. • Vertices in a bag have no order. • Example: Rootinvoice.xml invoice invoice.account_number < account_number > element-content </ account_number > <invoice> Invoice-element-content </invoice> [Root“invoice.xml”, invoice, invoice. account_number]
Data Model • Bag elements are reachable by path expressions. • Path expression consists of two parts: • An entry point • A relative forward part • Example: account_number:invoice
Operators • Source S , Follow, Select , Join , Rename , Expose , Vertex , Group , Union , Intersection , Difference - , Cartesian Product.
Source Operator S • Input :a list of documents • Output :a collection of singleton bags • Examples : S (*) All Known XML documents S (invoice*.xml) All XML documents whose filename match “invoice*.xml S (*,schema.dtd) All known XML documents that conform to schema.dtd
Follow operator • Input :a path expression in entry point notation • Functionality : extracts vertices reachable by path expression • Output : a new bag that consists of the extracted vertex + all contents of original bag (in case of unnesting follow)
Follow operator (Example*) {[Rootinvoice.xml , invoice, invoice.carrier]} Rootinvoice.xml invoice invoice.carrier <carrier> carrier -element-content </carrier > <invoice> Invoice-element-content </invoice> *Unnesting Follow (carrier:invoice) Rootinvoice.xml invoice <invoice> Invoice-element-content </invoice> {[Rootinvoice.xml , invoice]}
Select operator • Input: a set of bags • Functionality :filters the bags of a collection using a predicate • Output : a set of bags that conform to the predicate • Predicate:Logical operator (,,), or simple qualifications (,,,,,)
Select operator (Example) {[Rootinvoice.xml , invoice],… } Rootinvoice.xml invoice <invoice> Invoice-element-content </invoice> invoice.carrier =Sprint Rootinvoice.xml invoice Rootinvoice.xml invoice <invoice> Invoice-element-content </invoice> <invoice> Invoice-element-content </invoice> {[Rootinvoice.xml , invoice], [Rootinvoice.xml , invoice], ……………}
Join operator • Input:two collections of bags • Functionality:Joins the two collections based on a predicate • Output:the concatenation of pairs of pages that satisfy the predicate
Join operator (Example) {[Rootinvoice.xml , invoice, Rootcustomer.xml , customer]} Rootinvoice.xml invoiceRootcustomer.xml customer <invoice> Invoice-element-content </invoice> <customer> customer-element-content </customer> account_number: invoice =number:customer Rootinvoice.xml invoice Rootcustomer.xml customer <invoice> Invoice-element-content </invoice> <customer> customer-element-content </customer> {[Rootinvoice.xml , invoice]} {[Rootcustomer.xml , customer]}
Expose operator • Input:a list of path expressions of vertices to be exposed • Output:a set of bags that contains vertices in the parameter list with the same order
Expose operator (Example) {[Rootinvoice.xml , invoice.bill_period, invoice.carrier]} Rootinvoice.xml invoice. bill_period invoice.carrier <carrier> bill_period -element-content </carrier > <invoice> carrier-element-content </invoice> (bill_period,carrier) Rootinvoice.xml invoiceinvoice.carrier invoice.bill_period <carrier> bill_period -element-content </carrier > <invoice> Invoice-element-content </invoice> <invoice> carrier-element-content </invoice> {[Rootinvoice.xml , invoice, invoice.carrier, invoice.bill_period]}
Vertex operator • Creates the actual XML vertex that will encompass everything created by an expose operator • Example : (Customer_invoice)[((account)[invoice.account_number], (inv_total)[invoice.total])]
Other operators • Group: is used for arbitrary grouping of elements based on their values • Aggregate functions can be used with the group operator (i.e. average) • Rename :Changes entry point annotation of elements of a bag. • Example:(invoice.bill_period,date)
Example: XML Source Documents • Invoice.xml • <Invoice_Document> • <invoice> • <account_number>2 </account_number> • <carrier>AT&T</carrier> • <total>$0.25</total> • </invoice> • <invoice> • <account_number>1 </account_number> • <carrier>Sprint</carrier> • <total>$1.20</total> • </invoice> • <invoice> • <account_number>1 </account_number> • <total>$0.75</total> • </invoice> • <auditor>maria</auditor> • </Invoice_Document> • Customer.xml • <Customer_Document> • <customer> • <account>1 </account> • <name>Tom</name> • </customer > • <customer> • <account>2 </account> • <name>George</name> • </customer > • </Customer _Document>
List account number, customer name, and invoice total for all invoices that has carrier = “Sprint”. Xquery Example FOR $i in (invoices.xml)//invoice, $c in (customers.xml)//customer WHERE $i/carrier = “Sprint” and $i/account_number= $c/account RETURN <Sprint_invoices> $i/account_number, $c/name, $i/total </Sprint_invoices>
Example: Xquery output <Sprint_Invoice> <account_number>1 </account_number> <name>Tom</name> <total>$1.20</total> </Sprint_Invoice >
Algebra Tree Execution Account_number name total Expose (*.account_number , *.name, *.total ) invoice(2) customer(1) Join (*.invoice.account_number=*.customer.account) invoice (2) Select (carrier= “Sprint” ) Invoice (1) invoice (2) invoice (3) customer(1) customer (2) Follow (*.invoice) Follow (*.customer) Source (Invoices.xml) Source (cutomers.xml)
Optimization with Niagara Optimizer based on Niagara algebra: • Use the operation more efficiently • Produce simpler expressions by combining operations
Language Convention • A and B are path expressions • A< B -- Path Expression A is prefix of B • AnB --- Common prefix of path A and B • AńB --- Greatest common of path A and B • ┴ --- Null path Expression
Heuristics using Rewrite Rules Allow optimization based on path selectivity When applying un-nesting following operation Φμ
Interchangeability of Follow operation Φμ(A) [Φμ(B)]=Φμ (B)[Φμ (A)] TRUE when exists C such that C < A && C < B and C = AńB Or AnB = ┴
Application of Rule on Invoice Φμ(acc_Num:invoice)[Φμ(carrier:invoice)] * =?= Φμ(carrier:invoice)[Φμ(acc_Num:invoice)] **
Application of Rule on Invoice Φμ(acc_Num:invoice)[Φμ(carrier:invoice)] ?= Φμ(carrier:invoice)[Φμ(acc_Num:invoice)] Equivalent because both share the common prefix “invoice”. Case AńB = invoice
Benefit of Rule Application NOTE: let us assume that acc_Num is required for each invoice element, while carrier is not required for invoice element THEN: Φμ(acc_Num:invoice)[Φμ(carrier:invoice)] ?= Φμ(carrier:invoice)[Φμ(acc_Num:invoice)] Then what algebra tree do we prefer? Φμ(acc_Num:invoice)[Φμ(acc_Num:customer)] make more sense than ** Why?
Discussion Reduction of Input Size on first Sub-operation: Φμ(carrier:invoice)
Should we/can we apply the rule below? Φμ(acc_Num:invoice)[Φμ(acc_Num:Customer)]
“acc_Num:invoice” and “acc_Num:customer” are two totally different paths Case is:AnB = ┴ So yes, rule is valid.