520 likes | 666 Views
Querying Distributed Data using XML. Yannis Papakonstantinou UCSD. Overview. The Virtual XML View Approach towards Data Integration Query Processing in XML Mediators Issues Overview An Algebra-Based Architecture Navigation-driven Evaluation Related Topics Querying XML Views on the Web
E N D
Querying Distributed Data using XML Yannis Papakonstantinou UCSD
Overview • The Virtual XML View Approach towards Data Integration • Query Processing in XML Mediators • Issues Overview • An Algebra-Based Architecture • Navigation-driven Evaluation • Related Topics • Querying XML Views on the Web • Other architectures: a transducer/stream-based model • Beyond Structured Querying
Data Integration Requirements in eBusiness Applications • It starts with …“Provide to customers, partners, employees Application X”, where X may be in Business Intelligence, Customer Support, … • Then the problem comes up…“The applications uses information assets widely distributed across my enterprise?” • If only….“Give to the application a single place to go to access all the information required. Requirements are evolving so make sure the system can be easily maintained and upgraded”
customer_table customer name John id 56 city Chicago customer name George id 58 city Chicago … <customer_table> <customer> <name>John</name> <id>56</id> <city>Chicago</city> </customer> <customer> <name>George</name> <id>58</id> <city>Chicago</city> </customer> … </customer_table> View-Based Approach: Wrappers Export Basic Source Views Client Application Integrated (XML) View Mediator (XML) View (XML) View Wrapper Wrapper Customers Rel. DB Orders Rel. DB
Wrappers Export Basic Source Views order_table order id 1034 cid 56 item chips order id 1567 cid 56 item salsa … Client Application Integrated (XML) View Mediator (XML) View (XML) View Wrapper Wrapper Customers Rel. DB Orders Rel. DB
customers customer name John id 56 city Chicago orders order id 1034 item chips order … customer … order_table order id 1034 cid 56 item chips order id 1567 cid 56 item salsa … Client Application customer_table customer name John id 56 city Chicago customer name George id 58 city Chicago … Integrated (XML) View Mediator (XML) View (XML) View Wrapper Wrapper Customers Rel. DB Orders Rel. DB Mediators Export Integrated Views, Tailored to Application Needs
Virtual Views:Query-Driven Mediator Operation Find all Chicago customer names, along with their ordered items Application Retrieve Chicago customer names and id’s Retrieve all cid’s and item names of orders Mediator Wrapper Wrapper Customers Database Orders Database
customers customer name John ordered_items item chips item salsa customer … customer name John id 56 … order cid 56 item chips order cid 56 item salsa … On-Demand (Query-Driven)Mediator Operation Application Mediator Wrapper Wrapper Customers Database Orders Database
Multiple Plans are Possible • Retrieve customers • For each customer find matching orders
A New Kind of Query Processing Problem • Build and Run “Optimal” Plan • Consisting of operators that • Collect source info using supported queries and commands • Combine info into XML result
Challenges in Query Processing & Optimization • Operate within the Limited and Different Capabilities of the Sources • Describe sets of supported queries • Use most efficient supported queries • Optimize plans/queries sent to sources • Estimate Costs of Plans • Adapt Plans Along the Way • Beyond Conjunctive Queries • Compose Queries/Views Efficiently • Schema inference & optimization • Combine navigation & querying
Queries supported by mediator Queries supported by wrapper From Limited Wrappers to Efficient Plans for Extended Query Sets all queries over schema • Answering Queries Using Views • But with Infinite Sets of Views • Increasing Relevance due to Web Services Source Data & Schema Source Data & Schema
Challenges in Query Processing & Optimization • Operate within the Limited and Different Capabilities of the Sources • Describe sets of supported queries • Use most efficient supported queries • Optimize plans/queries sent to sources • Estimate Costs of Plans • Adapt Plans Along the Way • Beyond Conjunctive Queries • XQuery processing • Schema inference & optimization • Combine navigation & querying • Build iterator models for low memory footprint
customers customer name John id 56 city Chicago orders order id 1034 item chips order … customer … order_table order id 1034 cid 56 item chips order id 1567 cid 56 item salsa … customer_table customer name John id 56 city Chicago customer name George id 58 city Chicago … Navigation-Driven Evaluation of Query Result
right(p) down(p) Navigation-Driven Evaluation p Input: client navigations view definition ans = q( s1 … sn ) Client result Lazy Mediator Output: source navigations s1 sn ... XML source XML source
Navigation-Driven Evaluation Input: client navigations Client view definition ans = q( s1 … sn ) result Lazy Mediator Output: source navigations s1 sn ... XML source XML source
Navigation-Driven Evaluation Input: client navigations view definition ans = q( s1 … sn ) Client result Lazy Mediator Output: source navigations s1 sn ... XML source XML source
Navigation-Driven Evaluation Input: client navigations view definition ans = q( s1 … sn ) Client result Lazy Mediator Output: source navigations s1 sn ... XML source XML source
Navigation-Driven Evaluation Input: client navigations view definition ans = q( s1 … sn ) Client result Lazy Mediator Output: source navigations s1 sn ... XML source XML source
customers customer name John id 56 city Chicago orders order id 1034 item chips order … customer … Mixing Querying & Navigation Find details of all salsa orders below visited node
Challenges in Mixing Querying & Navigation • Two-dimensional navigation • Reminds of cursors but there are multiple continuation points • Controlling size + shape • Contextualizing queries by navigation
Overview • The Virtual XML View Approach towards Data Integration • Query Processing in XML Mediators • Issues Overview • An Algebra-Based Architecture • Navigation-driven Evaluation • Quick Overview of Related Topics • Querying the XML View on the Web • Other architectures: a transducer-based model • Beyond Structured Querying • Fuzzy/preference queries & Top-N processing • Unstructured Queries
An Algebra-Based Query Processor Architecture Client XQuery Navigation Requests Results XQuery Views Translation to Algebra Algebra Plan Source Schemas & Types Source Description Rewriter/Optimizer Physical Algebra Plan Functions Plan Execution Engine Function Description Queries & Fetch Requests to Sources
Query Processing on Tuple-Oriented Algebra Enables… • Well-known efficient physical implementations of the operators • Join optimization • Nested data by nested plans or group-by • Efficient iterator model
XQuery: Queries & Views for XML <customers> { for $cust in document(“db”)/customer return <customer> { $cust/id, for $order in document(“db”)/order where $order/cid = $cust/id return <order> { $order/id } </order> } </customer> } </customers>
$db1 $cust $cust_id ct c1 i1 ct c2 i2 $db1 $cust ct c1 ct c2 ct c1 i1 $db1 ct c2 i2 Access and Navigation getD $cust, id $cust_id db customer_table customer name John id 56 customer name George id 58 getD $db1, customer $cust source db, [$db1]
$db1 $cust_id ct i1 ct i2 ct $db1 ct Simplification Using Schema Inference Since $cust_id $cust and $cust is “useless” otherwise db customer_table customer name John id 56 customer name George id 58 getD $db1, customer/id $cust_id i1 i2 source db, [$db1]
Plan p … $db1 $cust_id $orders ct i1 [o11…] nestedSrc $part $db1 $cust_id ct i1 $db1 $cust_id ct i2 $db1 $cust_id ct i1 $db1 $cust_id ct i2 $db1 $cust_id $part ct i1 ct i2 $db1 $cust_id ct i1 ct i2 Nested Plans ct i2 [o21…] apply $part, p $orders for $part
$db1 $cust_id ct i1 Joins and Selections $cust_id $db1 $cust_id $db2 $order $cust_id2 $order_id … $cust_id2=? $db2 $order $cust_id2 $order_id … getD $order, id $order_id getD $order, cid $cust_id2 getD $db2, order $order nestedSrc $part source db, [$db2]
… $order_id $oidL … o1 [o1] … o2 [o2] … $oidL $oidE … [o1] e1 … [o2] e2 e2 order e1 order $orders [e1, e2] Constructors listify $oidE $orders o2 crEl order, $oidL $oidE o1 crList $orders $oidL … $order_id … o1 … o2
Plan Decomposition • Within Rewriting Optimizer • Rules replacing “leaf” trees • May move commutable parts • Catch: No projection limitation
Replacing Nested Plans with GroupBy/Outerjoin Combinations apply $part, p $R apply $part, p $R p3 p3 nestedSrc $part groupBy S(p1) $part p2 nestedSrc $part for $part p1 p1 p2
Overview • The Need for Data Integration • The Virtual XML View Approach • Query Processing in XML Mediators • Architecture • Algebra • Navigation-driven Evaluation • Quick Overview of Related Topics • Querying the XML View on the Web • Other architectures: a transducer-based model • Beyond Structured Querying • Fuzzy/preference queries & Top-N processing • Unstructured Queries
Building Navigation-Driven Evaluation on the Algebra Client Source access Source access Source Source
$db1 $cust $cust_id ct c1 i1 ct c2 i2 $db1 $cust ct c1 ct c2 Think of Each Operator as a Lazy Mediator root tuple $db1 customer_table customer name John id 56 customer name George id 58 c1 $cust $cust_id i1 tuple getD $cust, id $cust_id c2 $db1 $cust i2 $cust_id
Navigation-Driven Evaluation of Operators • Augmented with • nextTuple(p) • p.attr Input: client navigations result Lazy Operator Output: source navigations s1 sn ... Result of Operator below Result of Operator below
<f’1, f’2, …, f’n> Operator State V1: V2: … Vn: Other: … Proceed down/right f’1 f’2 … f’n Use of Semantic Id’s in Navigation-Driven Evaluation r/d(<f1, f2, …, fn>) Operator State V1: V2: … Vn: Other: … f1 f2 … fn
Example of Semantic Id:getD X, a Z root root tuple tuple tuple tuple pv = <value, p’v> pI = <identity, p’I> pB = <binding, p’B, p’’B>
Fragments Reduce the “Set State” – “Produce State” Overhead root customer Hole 3 name, “John” order Hole 2 oid, 123 lineitem lineitem lineitem Hole 1
root customer Hole 3 name, “John” order Hole 5 order ordnum=16 oid, 123 lineitem lineitem lineitem Hole 1 Hole 4 lineitem lineitem
Controlling the Size and Shape of Fragments Client listify Client-Server Interaction Controler listify Source access Source access Source Source
Fragment Size causes Memory Footprint causes Performance
Fragmentation Strategies • Fixed Fragment Size / FCFS • Ideal for depth-first, left-to-right navigation • Adaptive: Assign larger pieces to those who use them • ~ f(Li) / all Lj f(Lj), f is x2
Response Perfomance for Breadth-First and Depth-First Depth First traversal Breadth First traversal
XSM System [VLDB02] Joint Work w/ Bertram Ludaescher, Pratik Mukhopadhay, Yu Xu • Assume sequentially-accessed XML data • Transducer-based Compiled-code XQuery processor • Future high-bandwidth streams • XQuery on chip • XSM Compiler • inputs XQuery, DTD • produces Xml Stream Machine • XSM2C translates into C or Java code
... <u><v> /v><v><w> <w></v> ...< ... on < init_stream a > do <a> push error on event on a do flush </ > do action ... <a><b> /b><b><c> <c></b> ...< ... Xml Stream Machine output stream XSM on do finite control : Event action <c> <b> <a> input event stack data buffers input streams
XKeyword, XSearch: XML DBs for Unstructured Queries Joint Work w/ Andrey Balmin, Vagelis Hristidis, Yu Xu • (XML) query languages too heavyweight and structured • Need to know structure, semantics, roles • XSearch: keyword proximity queries in trees (lowest common ancestor queries) • XKeyword/DISCOVER: keyword proximity queries for searches in labeled graphs