360 likes | 450 Views
Agrégation de documents XML probabilistes. Serge Abiteboul 1 , T.-H. Hubert Chan 2 , Evgeny Kharlamov 1,3 Werner Nutt 3 , Pierre Senellart 4 1 INRIA Saclay – Île-de-France 2 The University of Hong-Kong 3 Free University of Bozen-Bolzano 4 Télécom ParisTech. Incomplete Databases.
E N D
Agrégation de documents XML probabilistes Serge Abiteboul1, T.-H. Hubert Chan 2, Evgeny Kharlamov 1,3 Werner Nutt 3, Pierre Senellart 4 1 INRIA Saclay – Île-de-France 2 The University of Hong-Kong 3 Free University of Bozen-Bolzano 4 Télécom ParisTech
Incomplete Databases An incomplete database D contains many instances D = { d1,..., dn,...} Query q(x), constant c • c is a certainanswer for q if c q(di) for all di D • c is a possibleanswer for q if c q(di) for some di D Many ways to represent incomplete databases Agrégation de documents XML probabilistes
Probabilistic Databases Incomplete database D = { d1,..., dn } • with probabilitiesPr(di) > 0 for each instance • such that Pr(d1) + ... + Pr(dn) = 1 Query q returns constant c with probability p if p = Pr(di) cq(di) • Mainly studied in the relational setting • Imprecise data on the Web Probabilistic XML Agrégation de documents XML probabilistes
IT-personnel person person name bonus name bonus John Laptop PDA Mary 37 50 30 44 Personnel Data, Instance 1 Agrégation de documents XML probabilistes
IT-personnel person person name bonus name bonus Rick PDA PDA Mary 25 50 44 Personnel Data, Instance 2 Agrégation de documents XML probabilistes
Example: Personnel Queries “What are the names of the IT personnel?” “What bonuses were paid for the PDA project?” “What is the sum of bonuses paid to all employees?" Agrégation de documents XML probabilistes
Personnel DB: Certain/Possible Answers “What are the names of the IT personnel?” Mary: certain Rick: possible “What bonuses were paid for the PDA project?” 44: certain 15: possible “What is the sum of bonuses paid to all employees?“ no certain answer 161, 119: possible Aggregate queries depend on the presence of many data Agrégation de documents XML probabilistes
If We Had Probabilities … … we could ask • What is the probability that the sum of bonuses = 161? • What are all possible sums of bonuses? And what is each one’s probability? • What is the expected value of the sum of bonuses? And what the variance? Distribution of sums of bonuses Moments Agrégation de documents XML probabilistes
Our focus The Problem Space Probabilistic XMLDocument Models Events Distributional Nodes (MUX-DET) COUNT SUM MIN COUNTD AVG Aggregate Function Single Path Queries Tree Pattern Queries Query Language Tree Pattern Querieswith Joins Agrégation de documents XML probabilistes
IT-personnel person person name bonus name bonus J J J J John Rick Laptop PDA PDA Mary J, M M M 37 50 25 50 30 44 15 Pr(J) = 0.3Pr(M) = 0.6 Probabilities of Events Probabilistic XML: Events [Abiteboul/Senellart] J: John hired for Laptop projectM: Mary worked overtime Independent Events Agrégation de documents XML probabilistes
IT-personnel person person name bonus name bonus J J J J John Rick Laptop PDA PDA Mary J, M M M 37 50 25 50 30 44 15 “John was hired, Mary worked overtime” J: John hired for Laptop projectM: Mary worked overtime Pr(J) = 0.3Pr(M) = 0.6 Agrégation de documents XML probabilistes
“John was hired, Mary worked overtime” IT-personnel person person name bonus name bonus John Laptop PDA Mary 37 50 30 44 J: John hired for Laptop projectM: Mary worked overtime Pr(J) = 0.3Pr(M) = 0.6 Pr(d1) = 0.3 x 0.6 Agrégation de documents XML probabilistes
IT-personnel person person name bonus name bonus J J J J John Rick Laptop PDA PDA Mary J, M M M 37 50 25 50 30 44 15 “John wasn’t hired, Mary worked overtime” J: John hired for Laptop projectM: Mary worked overtime Pr(J) = 0.3Pr(M) = 0.6 Pr(d2) = 0.7 x 0.6 Agrégation de documents XML probabilistes
“John wasn’t hired, Mary worked overtime” IT-personnel person person name bonus name bonus Rick PDA PDA Mary 25 50 44 J: John hired for Laptop projectM: Mary worked overtime Pr(J) = 0.3Pr(M) = 0.6 Pr(d2) = 0.7 x 0.6 Agrégation de documents XML probabilistes
IT-personnel person person name name bonus bonus MUX MUX PDA Mary Laptop PDA MUX John Rick 37 50 25 50 DET 15 30 44 Probabilistic XML: MUX and DET Nodes [Nierman/ Jagadish, Kimelfeld/ Sagiv] 0.3 0.7 0.3 0.7 0.6 0.4 Children represent mutually exclusive choices,choices for different mux-nodes are independent MUX Deterministic nodes, children are combined DET Agrégation de documents XML probabilistes
IT-personnel person person name name bonus bonus MUX MUX PDA John 0.3 PDA 0.7 Mary 0.3 0.7 Laptop PDA MUX John Rick 0.6 0.4 25 50 37 50 25 50 DET 15 30 44 30 44 Probabilistic XML: MUX and DET Nodes Pr = 0.3 x 0.7 x 0.6 Agrégation de documents XML probabilistes
Probabilistic XML (PXML) • A PXML documentD • represents (exponentially) many document instancesd • each with a probability Pr(d) • PXML document models • CIE: long-distance dependencies • MUX-DET: only hierarchical dependencies • MUX-DET can be expressed by CIE, but not (concisely) the other way round Other models can be reduced to the ones above, or behave similarly Agrégation de documents XML probabilistes
Aggregate Functions : finite bags of valuesdomain Examples: • count, countd: finitebags of anythingN • sum, avg: finite bags of rational numbersQ Similarly: min, max, parity, top K, ... Agrégation de documents XML probabilistes
Aggregate Queries Q = (q) Two Layers • nonaggregate query q(x) • returns set of nodes q(d) over instance d • aggregate function • applied to the labels of nodes in q(d) • returns single value (q(d)) Agrégation de documents XML probabilistes
IT-personnel bonus * Single Path Queries Which bonuseshave been paid? Simple form of tree pattern queries Paths of node labels or *, connected by “child” and “descendant” edges Return the set of leaf nodes reachable from the root along such a path qbonus Agrégation de documents XML probabilistes
Single Path Aggregate Queries: Examples • SUM(qbonus) “What is the sum of all bonuses?” • MAX(qbonus) “What is maximal bonus that was paid?” Agrégation de documents XML probabilistes
Answer Distributions PXML document D, instances {d1,…, dn} SUM(qbonus) returns exactly one number for every di SUM(qbonus) is a random variable SUM(qbonus) induces a probability distribution over D f(s)= Pr(di), SUM(qbonus)(di) = s the answer distribution Notation: SUM(qbonus)(D) or (q)(D) abstractly Agrégation de documents XML probabilistes
Special Case: Document Aggregation D with instances {d1,…, dn}, • Applying to a regular documentdi : (di) := ({| c | c is a value on a leaf of di|}) • Applying to the probabilistic document D: (D)(c) = Pr(di) (di) = c yields again a distribution (D) Agrégation de documents XML probabilistes
Reduction to Document Aggregation (q)(D) = ? Step 1: Compute a smaller PXML document D' = q(D) containing only matching paths Step 2: Apply to D' Theorem: (q)(D) = (D') Depends on document models and simple path queries Agrégation de documents XML probabilistes
name name J J John Rick Mary Applying qbonus = keep only the paths that match IT-personnel qbonus( ) person person bonus bonus J J Laptop PDA PDA J, M M M 37 50 25 50 30 44 15 … analogous for MUX-DET Agrégation de documents XML probabilistes
name name MUX Mary 0.3 0.7 John Rick Evaluating Single Path Queries/2 qbonus( ) IT-personnel person person bonus bonus MUX PDA 0.3 0.7 Laptop PDA MUX 0.6 0.4 37 50 25 50 DET 15 30 44 Agrégation de documents XML probabilistes
Problems Investigated PXML document D, constant c • Possible Value: Decide Pr((D) = c) > 0 • Probability Computation: Compute Pr((D) = c) • Moment Computation: Compute E((D)k) E is “expected value” Agrégation de documents XML probabilistes
Aggregation over CIE Agrégation de documents XML probabilistes
Aggregation over CIE/2 • Possible Value: “Too much propositional logic present” • Probability Computation: cannot be easier … • Moment Computation: • Difficult for MIN, COUNTD, AVG • Easy for COUNT and SUM: “Moments are sums, moments of COUNT and SUM are sums of sums, which can be rearranged …” Agrégation de documents XML probabilistes
Aggregation over MUX-DET Agrégation de documents XML probabilistes
COUNT, SUM, MIN are Easy ... … because they allow for divide and conquer evaluation: SUM {| a,b,c,d |} = SUM {| a,b |} + SUM {| c,d |} is a monoid aggregatefunction if ({| a1,..., an|} = ({| a1 |}) ... ({| an |}) for some commutative monoid (M, ) and all a1,..., an in M Examples: • count, sum, min, parity, top K • countd, avg Agrégation de documents XML probabilistes
MUX p q ( ) = p( ) + q ( ) Other node ( ) = ( ) ( ) Convex Sums and Convolutions If is a monoid function, answer distributions can be computed bottom up, using two operations: Convex Sum: depends on the monoid Convolution: Agrégation de documents XML probabilistes
Convolution of Distributions (M, ) monoid (D1), (D2) distributions of subdocuments (((D1) (D2)) (c) = (D1)(c1) (D2)(c2) c1c2 = c Agrégation de documents XML probabilistes
Approximating Query Answers Over CIE, probability and moment computation can be hard How good are Monte-Carlo methods? Classical results (Hoeffding) imply: To achieve | E((D)k) – Estimate | < with probability 1– at most O(R2k -2 log 1/) samples are needed, where R = max |(d)|. Consequence: Given and , at most quadratically many samples are needed for E(COUNTD(D)). Agrégation de documents XML probabilistes
Probabilistic Aggregation: Related Work • Tree pattern queries over MUX-DET with HAVING constraints [Cohen/Kimelfeld/Sagiv] • Conjunctive queries with HAVING constraints over relational probabilistic databases [Re/Suciu] • Work on various special topics in the relational setting • probabilistic data streams • uncertain schema mappings Agrégation de documents XML probabilistes
Aggregates over PXML: Conclusion First results of an ongoing project • Map of the problem space • Largely complete investigation for single path queries: • Intractability for CIE • Hierarchical dependencies in MUX-DET can be exploited for monoid aggregation functions Some results carry over to other models, e.g., • Uncertain schema mappings (Dong/Halevy/Yu) Current work: • richer query languages, continuous distributions on leaves Agrégation de documents XML probabilistes