1 / 56

PARTIAL RESULTS AND STRUCTURAL AGGREGATION OVER XML DATA STREAMS

PARTIAL RESULTS AND STRUCTURAL AGGREGATION OVER XML DATA STREAMS. Kristin Tufte PhD Defense Dec 17, 2004. Streams & XML. person. lname:Jones. fname:Bob. address. (Jones, Bob, 153 Fir St., Portland). Nested, structured data (XML)

Download Presentation

PARTIAL RESULTS AND STRUCTURAL AGGREGATION OVER XML DATA STREAMS

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. PARTIAL RESULTS AND STRUCTURAL AGGREGATION OVER XML DATA STREAMS Kristin Tufte PhD Defense Dec 17, 2004

  2. Streams & XML person lname:Jones fname:Bob address (Jones, Bob, 153 Fir St., Portland) • Nested, structured data (XML) • Streams: network traffic information, environmental sensor data, telephone call records, click streams street: 153 Fir St. city: Portland That was then… …this is now.

  3. New Challenges • XML • Data is nested • New operators, query language • Streams • Potentially infinite • Produce results without waiting for end of stream/data • Arrival rate not in control of database system • XML Streams • Stock Data • Data Exchange • Intelligent Transportation Systems

  4. Talk Preview • Incremental Query Evaluation (IQE) • Merge Operation • Merge Theory • Merge Performance

  5. Context for IQE • Continuous Queries – Tapestry (Early 1990’s) • Monotonic queries, append-only databases • Long-running Queries • Online aggregation (Hellerstein et al.), • Nested Aggregates (Tan et al.) • Incremental Query Evaluation (IQE) (Partial Results) • General solution for long-running queries over XML data • Stream Processing • Potentially infinite streams of data • STREAM, Aurora (Borealis), Niagara West • Triggers (Eric Hanson, NiagaraCQ)

  6. Incremental Query Evaluation* • Motivation: Internet queries (long-running, data in XML) • Get results to users before all of the data arrives • Non-monotonic (blocking) operators are problematic • Modify operators and system framework count group by Subject select DateTime ≥ “12/17/04:12AM” (Title, Subject, DateTime) * Joint work with Jai Shanmugasundaram

  7. (Non-)monotonic Operators • An operator O is monotonic if: A  B O(A) O(B) • select, join (but often implemented with a blocking algorithm) • O is non-monotonic if it is not monotonic • aggregates, nest • On new input monotonic operators add to output, non-monotonic operators change output count group by Subject select DateTime ≥ “12/17/04:12AM” (Title, Subject, DateTime)

  8. Handling Non-monotonic Operators Old Value New Value Subject, Count Subject, Count ( null, null, Ukraine, 2) (Ukraine, 2, Ukraine, 3) Title, Subject, D/TTitle, Subject, D/T (null, null, null, Title1, Ukraine, 1AM) (null, null, null, Title2, Ukraine, 3AM) (null, null, null, Title3, Ukraine, 5AM) top10(count) • Users issue partial result requests • Re-evaluation – transmit full result on every partial result request • Differential – avoid retransmitting duplicate data • Operators produce and process tuple inserts, deletes, updates • All tuples contain “old value” and “new value” count group by Subject select DateTime ≥ “12/17/04:12AM” (Title, Subject, DateTime)

  9. Re-evaluation vs. Differential

  10. Skewed Data

  11. produce partial result ( null, null, Google, {Title1}), ( null, null, Microsoft, {Title2, Title3}) (Google, {Title1}, Google, {Title1, Title4}) Subject: Google Subject: Google Subject: Google Merge Title: Title5 Title: Title1 Title: Title4 Title: Title1 Title:Title4 Title: Title5 Differential Nest Old Value New Value Subject, Title Subject, Title Subject, Title (Google, Title1), (Microsoft, Title2), (Microsoft, Title3) (Google, Title4) (Google, {Title1,Title4}, Google, {Title1, Title4, Title 5}) (Google, Title5) but what you’d really like to send is: (Google, {Title5}) and “merge” it with: (Google, {Title1,Title4})

  12. Talk Preview • Incremental Query Evaluation • Merge Operation • Merge Theory • Merge Performance

  13. Merge Operation • Flexible method for combining two XML (nested) documents-“recursive union” over similarly-structured XML documents • Merge Template guides the process • “Keys” are used to indicate when elements should be combined

  14. Merge Example Combined Inserted Used in Match auction item item desc: Trek Madone 5.9 Bike bid iid:433 desc: 1971 Martin Guitar iid:501 bid bidder: Dave amt: $1500 bidder: Sue amt: $1550 Merged Document auction auction item item item iid:501 desc: Trek Madone 5.9 Bike bid iid:433 desc: 1971 Martin Guitar iid:501 bid bidder: Dave amt: $1500 bidder: Sue amt: $1550 Auction Document New Bid

  15. Merge Template (MT) (auction, [], NoContentNoAttrs) auction • Merge Template is an XML document consisting of a tree of Element Merge Templates (EMT) • EMT is a triplet containing: (name, local key, content combine function) (item, [iid], NoContentNoAttrs) item (iid, [], ExactMatch) (desc, [], ShallowContent - Replace) iid:501 (bid, [bidder, amt], NoContentNoAttrs) bid (bidder, [], ExactMatch) (amt, [], ExactMatch) bidder: Sue amt: $1550

  16. Merge Template Features • Used as the basis for an Accumulate operator • Repeated merge over a stream of XML documents to create an Accumulator • Accumulator is a view of the stream • Performs structural aggregation • Keys used to identify elements to combine • Keys external to document • Content-Combine Functions • aggregate, deep replace • Attributes – handled like elements without children

  17. Outline • Incremental Query Evaluation • Partial results over XML data • Merge Operation • Merge Theory • Merge Performance

  18. Theoretical Foundations • Why a formal definition? • Prove Merge is deterministic (unique result) • Unambiguous definition • Key results: • Formal definition of Merge as the join of an upper semi-lattice • Merge is the least upper bound of two documents (under some constraints) • Path Set Representation • Good for reasoning about XML documents

  19. D3 is “smallest” document that “contains” D1 and D2 View Merge as Least Upper Bound auction item item iid:501 desc: Trek Madone 5.9 Bike bid iid:433 desc: 1971 Martin Guitar bid bidder: Dave amt: $1500 bidder: Sue amt: $1550 Merged Document (D3) auction auction item item item iid:501 desc: Trek Madone 5.9 Bike bid id:433 desc: 1971 Martin Guitar iid:501 bid bidder: Dave amt: $1500 bidder: Sue amt: $1550 Auction Document (D1) New Bid (D2)

  20. What can go wrong? • No unique result (no Least Upper Bound (LUB)) • Keys in Merge Template eliminate ambiguity • Know D4 is correct result if we know iid is a key for item auction auction item item item iid:501 iid:433 iid:501 iid:433 D3 D4 auction auction item item iid:501 iid:433 D1 D2

  21. What is a lattice? {1, 2, 3} D3 • An Upper-Semi Lattice is: a partially ordered set, in which least upper bounds (LUBs) exist and are unique • A set of sets closed under union form an upper semi lattice. • implies  {1, 2} {2, 3} {1, 2} {2, 3} D1 D2 Ex 1 – Not Lattice LUB of {1,2} and {2, 3} does not exist Ex 2 – Lattice Order: S1 S2 if S1  S2 Ex 3 – Lattice Order: document containment

  22. What do I need for a lattice? • Set of documents (LT) (T is a Merge Template) • Order (document containment) • Show LT satisfies the properties of a lattice.

  23. Document Containment Order • D1 is contained in D2if there is a structure-preserving mapping from D1 into D2 auction auction item item item D1 D2 iid:501 desc: Trek Madone 5.9 Bike iid:433 desc: 1971 Martin Guitar iid:501 desc: Trek Madone 5.9 Bike

  24. Merge Template (T) Defines LT • A Merge Template, T, is specific to a set of documents • Auction MT specific to “auction” documents • LT is all documents that are “compatible” and “key-respecting” with respect to T • Different lattice for each Merge Template D8 D10 LT Set of all documents T D4 D5 D2 D1 D3

  25. Non-Key-Respecting Documents • means contained in. D is contained in D′ if there is a structure-preserving mapping from D into D′ • D3 is not key-respecting with respect to T and should not be inLT. auction auction (auction, [], NoContentNoAttrs) item item item (item, [iid], NoContentNoAttrs) iid:501 iid:433 iid:501 iid:433 (iid, [], ExactMatch) D3 D4 T auction auction item item iid:501 iid:433 D1 D2

  26. Merge-Lattice Theorem Overview D3 ρ(D1) ρ(D2)  LT • Associate each document D with a unique path set ρ(D) • ρ(D1)  ρ(D2) is a Least Upper Bound (LUB) for ρ(D1) and ρ(D2) • ρ(D1)  ρ(D2) is the “smallest” set that contains both ρ(D1) and ρ(D2) • Intuition: Merge of D1 and D2 should be the document associated with ρ(D1)  ρ(D2)   D1 D2 ρ(D1) ρ(D2) ρ1 ρ2

  27. Document and Path Set auction[]: auction[].item[id:501]: auction[].item[id:501].id[]:501 auction[].item[id:501].desc[]:Trek Madone 5.9 Bike auction[].item[id:501].bid[bidder:Dave,amt:$1500]: auction[].item[id:501].bid[bidder:Dave,amt:$1500]. bidder[]:Dave auction[].item[id:501].bid[bidder:Dave,amt:$1500]. amt[]:$1500 auction • Use Merge Template + document to create path set • One element in path set for each element in document • Path comprised of rooted key value and element content • Path set order (subset) identical to document containment order item iid:501 desc: Trek Madone 5.9 Bike bid bidder: Dave amt: $1500 auction[].item[iid:501].desc[]:Trek Madone 5.9 Bike rooted key value element content

  28. Proof that D3 is in L • Construct D3 from ρ(D1)  ρ(D2), show D3 is compatible and key-respecting with respect to T D3 3 σ σ-1 (=ρ3) T 2 ρ2 ρ(D1) ρ(D2)  D2 1 ρ2-1 ρ1 D1 ρ1-1

  29. Outline • Incremental Query Evaluation • Partial results over XML data • Merge Operation • Merge Theory • Merge Performance

  30. Implementation Highlights • Accumulate operator uses repeated binary Merges to combine a series of XML documents into one result document • Accumulate is implemented as a recursive walk over input docs and the Merge Template • Implemented in Niagara v1.0 (UW-Madison) • Lazy construction of DOM nodes: SAXDOM • General improvements to Niagara 1.0 code base

  31. Performance Environment • 866 MHz Pentium PIII, 512MB memory, Red Hat Linux 8.0 • Sun JVM J2SE 1.4.2, maximum memory 412MB

  32. Input Data - XMark Persons Items Bids site site site people open_auctions open_auctions id person* id open_auction* id open_auction* name phone? reserve? seller person interval bidder email profile start end time bid education personref person * 0 or more ? optional

  33. Structural Aggregation with Restructuring • Q5.1 – simple structural aggregation query • For each person produce a list of items they bid on and their bids on those items people site person* id open_auctions itemsbid id open_auction* item* id bidder time bid bid* personref person time amt Q5.1 input (Bids) Q5.1 output

  34. Restructuring of Input site people people restructure accumulate open_auctions person id:53 id:53 person open_auction iid:8 itemsbid itemsbid bidder open_auction iid:8 item id:8 time:5:00 bid:$82 bidder bid person:53 personref time:5:00 bid:$82 time:5:00 amt:$82 personref person:53 Restructured Input Q5.1 Input Q5.1 Output

  35. Q5.1 query plans nest (“”) accumulate unnest (time) nest (bidderid) construct (restructured document) unnest (person_ref.person as bidderid) nest (bidderid) unnest (bidder.person_ref.person as bidderid) unnest (bidder) nest (itemid, bidderid) unnest (site.open_auctions.open_auction) unnest (open_auction.id as itemid) unnest (amt) filescan unnest (site.open_auctions.open_auction) Merge Query Plan filescan Nest Query Plan

  36. Q5.1 Nest Query Plan nest (“”) people nest (bidderid) id:53 person itemsbid nest (bidderid) item id:8 nest (itemid, bidderid) bid time:5:00 amt:$82 unnest (open_auction, open_auction.id, bidder, person_ref.person, time, amt) Q5.1 Output filescan Nest Query Plan

  37. Q5.1 Execution Time

  38. Q5.2 Execution Time items item* id id bidder* bid* time amt Q5.2 output Q5.2: for every item list of bidders and their bids Q5.1: for every person list of items sold and bids on those items

  39. Execution time breakdown Q5.2

  40. Simplified Q5.4-A Output people id person* name email phone? itemssold itemsbid profile open_auction* id open_auction* id education reserve? seller interval bidder* person time bid start end person pesonref For each person, provide person information, list of items put up for auction (itemssold) and items bid on (itemsbid)

  41. Simplified Q5.4-B Output people id person* name email phone? itemssold itemsbid profile item* item* id id education seller person bid reserve? interval time amt start end Key: person personref renamed deleted

  42. Q5.4-A and Q5.4-B Results • Q5.4-B is faster despite having to unnest the input more deeply • Key factor: Q5.4-B has fewer elements in the result Query 5.4-A Query 5.4-B

  43. Merge-Ready Structural Aggregation • No restructuring; input structured similar to output • Best case for Merge Q5.5 (small documents) Q5.6 (big documents)

  44. Sliding Structural Aggregation • Extend accumulate to handle sliding windows • For each element, maintain range of windows • Test vs. sliding nest Q6.1 (group bids by item then person)

  45. Conclusion • Studied processing of XML Streams • IQE • General framework for partial results over initial portion of stream • Merge • Flexible operator for combining XML documents • Formal definition in terms of lattice theory • Outperforms nest-based alternatives

  46. Extras/Deletes

  47. Join on Author Nest on Author (Author, Address) (Author, Book) Re-evaluation vs. differential • Query plan for re-evaluation vs. differential

  48. Partially-Ordered Set (POSet) {1, 2, 3} {1, 2} {2, 3} {1} Example: Set of sets ( implies  ) S1 S2 if S1  S2 • Let P be a set. A partial order () on P is such that for all x, y, z P • x  x • x  y and y  x  x = y • x  y and y  z  x  z

  49. Sliding Accum query plan Q6.1 sliding accumulate (document, timestamp, window-min, window-max) ( D1, 12:01 PM, 0, 7 ) t1′ ( D2, 12:20 PM, 1, 8 ) t2′ ( *, 2:00 PM, 0, 0 ) p1′ bucket (document, timestamp) ( D1, 12:01 PM ) t1 ( D2, 12:20 PM ) t2 ( *, 2:00 PM )p1 construct filescan + series of unnests

  50. Sliding Nest Query Plan Q6.1 sliding nest (windowid) sliding nest (bidderid, windowid) sliding nest (bidderid, windowid) sliding nest (itemid, bidderid, windowid) (document, timestamp, window-min, window-max) ( D1, 12:01 PM, 0, 7 ) t1′ ( D2, 12:20 PM, 1, 8 ) t2′ ( *, 2:00 PM, 0, 0 ) p1′ bucket (document, timestamp) ( D1, 12:01 PM ) t1 ( D2, 12:20 PM ) t2 ( *, 2:00 PM )p1 construct filescan + series of unnests

More Related