560 likes | 659 Views
PARTIAL RESULTS AND STRUCTURAL AGGREGATION OVER XML DATA STREAMS. Kristin Tufte PhD Defense Dec 17, 2004. Streams & XML. person. lname:Jones. fname:Bob. address. (Jones, Bob, 153 Fir St., Portland). Nested, structured data (XML)
E N D
PARTIAL RESULTS AND STRUCTURAL AGGREGATION OVER XML DATA STREAMS Kristin Tufte PhD Defense Dec 17, 2004
Streams & XML person lname:Jones fname:Bob address (Jones, Bob, 153 Fir St., Portland) • Nested, structured data (XML) • Streams: network traffic information, environmental sensor data, telephone call records, click streams street: 153 Fir St. city: Portland That was then… …this is now.
New Challenges • XML • Data is nested • New operators, query language • Streams • Potentially infinite • Produce results without waiting for end of stream/data • Arrival rate not in control of database system • XML Streams • Stock Data • Data Exchange • Intelligent Transportation Systems
Talk Preview • Incremental Query Evaluation (IQE) • Merge Operation • Merge Theory • Merge Performance
Context for IQE • Continuous Queries – Tapestry (Early 1990’s) • Monotonic queries, append-only databases • Long-running Queries • Online aggregation (Hellerstein et al.), • Nested Aggregates (Tan et al.) • Incremental Query Evaluation (IQE) (Partial Results) • General solution for long-running queries over XML data • Stream Processing • Potentially infinite streams of data • STREAM, Aurora (Borealis), Niagara West • Triggers (Eric Hanson, NiagaraCQ)
Incremental Query Evaluation* • Motivation: Internet queries (long-running, data in XML) • Get results to users before all of the data arrives • Non-monotonic (blocking) operators are problematic • Modify operators and system framework count group by Subject select DateTime ≥ “12/17/04:12AM” (Title, Subject, DateTime) * Joint work with Jai Shanmugasundaram
(Non-)monotonic Operators • An operator O is monotonic if: A B O(A) O(B) • select, join (but often implemented with a blocking algorithm) • O is non-monotonic if it is not monotonic • aggregates, nest • On new input monotonic operators add to output, non-monotonic operators change output count group by Subject select DateTime ≥ “12/17/04:12AM” (Title, Subject, DateTime)
Handling Non-monotonic Operators Old Value New Value Subject, Count Subject, Count ( null, null, Ukraine, 2) (Ukraine, 2, Ukraine, 3) Title, Subject, D/TTitle, Subject, D/T (null, null, null, Title1, Ukraine, 1AM) (null, null, null, Title2, Ukraine, 3AM) (null, null, null, Title3, Ukraine, 5AM) top10(count) • Users issue partial result requests • Re-evaluation – transmit full result on every partial result request • Differential – avoid retransmitting duplicate data • Operators produce and process tuple inserts, deletes, updates • All tuples contain “old value” and “new value” count group by Subject select DateTime ≥ “12/17/04:12AM” (Title, Subject, DateTime)
produce partial result ( null, null, Google, {Title1}), ( null, null, Microsoft, {Title2, Title3}) (Google, {Title1}, Google, {Title1, Title4}) Subject: Google Subject: Google Subject: Google Merge Title: Title5 Title: Title1 Title: Title4 Title: Title1 Title:Title4 Title: Title5 Differential Nest Old Value New Value Subject, Title Subject, Title Subject, Title (Google, Title1), (Microsoft, Title2), (Microsoft, Title3) (Google, Title4) (Google, {Title1,Title4}, Google, {Title1, Title4, Title 5}) (Google, Title5) but what you’d really like to send is: (Google, {Title5}) and “merge” it with: (Google, {Title1,Title4})
Talk Preview • Incremental Query Evaluation • Merge Operation • Merge Theory • Merge Performance
Merge Operation • Flexible method for combining two XML (nested) documents-“recursive union” over similarly-structured XML documents • Merge Template guides the process • “Keys” are used to indicate when elements should be combined
Merge Example Combined Inserted Used in Match auction item item desc: Trek Madone 5.9 Bike bid iid:433 desc: 1971 Martin Guitar iid:501 bid bidder: Dave amt: $1500 bidder: Sue amt: $1550 Merged Document auction auction item item item iid:501 desc: Trek Madone 5.9 Bike bid iid:433 desc: 1971 Martin Guitar iid:501 bid bidder: Dave amt: $1500 bidder: Sue amt: $1550 Auction Document New Bid
Merge Template (MT) (auction, [], NoContentNoAttrs) auction • Merge Template is an XML document consisting of a tree of Element Merge Templates (EMT) • EMT is a triplet containing: (name, local key, content combine function) (item, [iid], NoContentNoAttrs) item (iid, [], ExactMatch) (desc, [], ShallowContent - Replace) iid:501 (bid, [bidder, amt], NoContentNoAttrs) bid (bidder, [], ExactMatch) (amt, [], ExactMatch) bidder: Sue amt: $1550
Merge Template Features • Used as the basis for an Accumulate operator • Repeated merge over a stream of XML documents to create an Accumulator • Accumulator is a view of the stream • Performs structural aggregation • Keys used to identify elements to combine • Keys external to document • Content-Combine Functions • aggregate, deep replace • Attributes – handled like elements without children
Outline • Incremental Query Evaluation • Partial results over XML data • Merge Operation • Merge Theory • Merge Performance
Theoretical Foundations • Why a formal definition? • Prove Merge is deterministic (unique result) • Unambiguous definition • Key results: • Formal definition of Merge as the join of an upper semi-lattice • Merge is the least upper bound of two documents (under some constraints) • Path Set Representation • Good for reasoning about XML documents
D3 is “smallest” document that “contains” D1 and D2 View Merge as Least Upper Bound auction item item iid:501 desc: Trek Madone 5.9 Bike bid iid:433 desc: 1971 Martin Guitar bid bidder: Dave amt: $1500 bidder: Sue amt: $1550 Merged Document (D3) auction auction item item item iid:501 desc: Trek Madone 5.9 Bike bid id:433 desc: 1971 Martin Guitar iid:501 bid bidder: Dave amt: $1500 bidder: Sue amt: $1550 Auction Document (D1) New Bid (D2)
What can go wrong? • No unique result (no Least Upper Bound (LUB)) • Keys in Merge Template eliminate ambiguity • Know D4 is correct result if we know iid is a key for item auction auction item item item iid:501 iid:433 iid:501 iid:433 D3 D4 auction auction item item iid:501 iid:433 D1 D2
What is a lattice? {1, 2, 3} D3 • An Upper-Semi Lattice is: a partially ordered set, in which least upper bounds (LUBs) exist and are unique • A set of sets closed under union form an upper semi lattice. • implies {1, 2} {2, 3} {1, 2} {2, 3} D1 D2 Ex 1 – Not Lattice LUB of {1,2} and {2, 3} does not exist Ex 2 – Lattice Order: S1 S2 if S1 S2 Ex 3 – Lattice Order: document containment
What do I need for a lattice? • Set of documents (LT) (T is a Merge Template) • Order (document containment) • Show LT satisfies the properties of a lattice.
Document Containment Order • D1 is contained in D2if there is a structure-preserving mapping from D1 into D2 auction auction item item item D1 D2 iid:501 desc: Trek Madone 5.9 Bike iid:433 desc: 1971 Martin Guitar iid:501 desc: Trek Madone 5.9 Bike
Merge Template (T) Defines LT • A Merge Template, T, is specific to a set of documents • Auction MT specific to “auction” documents • LT is all documents that are “compatible” and “key-respecting” with respect to T • Different lattice for each Merge Template D8 D10 LT Set of all documents T D4 D5 D2 D1 D3
Non-Key-Respecting Documents • means contained in. D is contained in D′ if there is a structure-preserving mapping from D into D′ • D3 is not key-respecting with respect to T and should not be inLT. auction auction (auction, [], NoContentNoAttrs) item item item (item, [iid], NoContentNoAttrs) iid:501 iid:433 iid:501 iid:433 (iid, [], ExactMatch) D3 D4 T auction auction item item iid:501 iid:433 D1 D2
Merge-Lattice Theorem Overview D3 ρ(D1) ρ(D2) LT • Associate each document D with a unique path set ρ(D) • ρ(D1) ρ(D2) is a Least Upper Bound (LUB) for ρ(D1) and ρ(D2) • ρ(D1) ρ(D2) is the “smallest” set that contains both ρ(D1) and ρ(D2) • Intuition: Merge of D1 and D2 should be the document associated with ρ(D1) ρ(D2) D1 D2 ρ(D1) ρ(D2) ρ1 ρ2
Document and Path Set auction[]: auction[].item[id:501]: auction[].item[id:501].id[]:501 auction[].item[id:501].desc[]:Trek Madone 5.9 Bike auction[].item[id:501].bid[bidder:Dave,amt:$1500]: auction[].item[id:501].bid[bidder:Dave,amt:$1500]. bidder[]:Dave auction[].item[id:501].bid[bidder:Dave,amt:$1500]. amt[]:$1500 auction • Use Merge Template + document to create path set • One element in path set for each element in document • Path comprised of rooted key value and element content • Path set order (subset) identical to document containment order item iid:501 desc: Trek Madone 5.9 Bike bid bidder: Dave amt: $1500 auction[].item[iid:501].desc[]:Trek Madone 5.9 Bike rooted key value element content
Proof that D3 is in L • Construct D3 from ρ(D1) ρ(D2), show D3 is compatible and key-respecting with respect to T D3 3 σ σ-1 (=ρ3) T 2 ρ2 ρ(D1) ρ(D2) D2 1 ρ2-1 ρ1 D1 ρ1-1
Outline • Incremental Query Evaluation • Partial results over XML data • Merge Operation • Merge Theory • Merge Performance
Implementation Highlights • Accumulate operator uses repeated binary Merges to combine a series of XML documents into one result document • Accumulate is implemented as a recursive walk over input docs and the Merge Template • Implemented in Niagara v1.0 (UW-Madison) • Lazy construction of DOM nodes: SAXDOM • General improvements to Niagara 1.0 code base
Performance Environment • 866 MHz Pentium PIII, 512MB memory, Red Hat Linux 8.0 • Sun JVM J2SE 1.4.2, maximum memory 412MB
Input Data - XMark Persons Items Bids site site site people open_auctions open_auctions id person* id open_auction* id open_auction* name phone? reserve? seller person interval bidder email profile start end time bid education personref person * 0 or more ? optional
Structural Aggregation with Restructuring • Q5.1 – simple structural aggregation query • For each person produce a list of items they bid on and their bids on those items people site person* id open_auctions itemsbid id open_auction* item* id bidder time bid bid* personref person time amt Q5.1 input (Bids) Q5.1 output
Restructuring of Input site people people restructure accumulate open_auctions person id:53 id:53 person open_auction iid:8 itemsbid itemsbid bidder open_auction iid:8 item id:8 time:5:00 bid:$82 bidder bid person:53 personref time:5:00 bid:$82 time:5:00 amt:$82 personref person:53 Restructured Input Q5.1 Input Q5.1 Output
Q5.1 query plans nest (“”) accumulate unnest (time) nest (bidderid) construct (restructured document) unnest (person_ref.person as bidderid) nest (bidderid) unnest (bidder.person_ref.person as bidderid) unnest (bidder) nest (itemid, bidderid) unnest (site.open_auctions.open_auction) unnest (open_auction.id as itemid) unnest (amt) filescan unnest (site.open_auctions.open_auction) Merge Query Plan filescan Nest Query Plan
Q5.1 Nest Query Plan nest (“”) people nest (bidderid) id:53 person itemsbid nest (bidderid) item id:8 nest (itemid, bidderid) bid time:5:00 amt:$82 unnest (open_auction, open_auction.id, bidder, person_ref.person, time, amt) Q5.1 Output filescan Nest Query Plan
Q5.2 Execution Time items item* id id bidder* bid* time amt Q5.2 output Q5.2: for every item list of bidders and their bids Q5.1: for every person list of items sold and bids on those items
Simplified Q5.4-A Output people id person* name email phone? itemssold itemsbid profile open_auction* id open_auction* id education reserve? seller interval bidder* person time bid start end person pesonref For each person, provide person information, list of items put up for auction (itemssold) and items bid on (itemsbid)
Simplified Q5.4-B Output people id person* name email phone? itemssold itemsbid profile item* item* id id education seller person bid reserve? interval time amt start end Key: person personref renamed deleted
Q5.4-A and Q5.4-B Results • Q5.4-B is faster despite having to unnest the input more deeply • Key factor: Q5.4-B has fewer elements in the result Query 5.4-A Query 5.4-B
Merge-Ready Structural Aggregation • No restructuring; input structured similar to output • Best case for Merge Q5.5 (small documents) Q5.6 (big documents)
Sliding Structural Aggregation • Extend accumulate to handle sliding windows • For each element, maintain range of windows • Test vs. sliding nest Q6.1 (group bids by item then person)
Conclusion • Studied processing of XML Streams • IQE • General framework for partial results over initial portion of stream • Merge • Flexible operator for combining XML documents • Formal definition in terms of lattice theory • Outperforms nest-based alternatives
Join on Author Nest on Author (Author, Address) (Author, Book) Re-evaluation vs. differential • Query plan for re-evaluation vs. differential
Partially-Ordered Set (POSet) {1, 2, 3} {1, 2} {2, 3} {1} Example: Set of sets ( implies ) S1 S2 if S1 S2 • Let P be a set. A partial order () on P is such that for all x, y, z P • x x • x y and y x x = y • x y and y z x z
Sliding Accum query plan Q6.1 sliding accumulate (document, timestamp, window-min, window-max) ( D1, 12:01 PM, 0, 7 ) t1′ ( D2, 12:20 PM, 1, 8 ) t2′ ( *, 2:00 PM, 0, 0 ) p1′ bucket (document, timestamp) ( D1, 12:01 PM ) t1 ( D2, 12:20 PM ) t2 ( *, 2:00 PM )p1 construct filescan + series of unnests
Sliding Nest Query Plan Q6.1 sliding nest (windowid) sliding nest (bidderid, windowid) sliding nest (bidderid, windowid) sliding nest (itemid, bidderid, windowid) (document, timestamp, window-min, window-max) ( D1, 12:01 PM, 0, 7 ) t1′ ( D2, 12:20 PM, 1, 8 ) t2′ ( *, 2:00 PM, 0, 0 ) p1′ bucket (document, timestamp) ( D1, 12:01 PM ) t1 ( D2, 12:20 PM ) t2 ( *, 2:00 PM )p1 construct filescan + series of unnests