1 / 49

PXML: A Probabilistic Semistructured Data Model and Algebra

PXML: A Probabilistic Semistructured Data Model and Algebra. Edward Hung, Lise Getoor, V.S. Subrahmanian University of Maryland, College Park ICDE, Bangalore, India, Mar 2003. Outline. Motivating example Semistructured data model PXML data model Semantics Algebra

nizana
Download Presentation

PXML: A Probabilistic Semistructured Data Model and Algebra

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. PXML: A Probabilistic Semistructured Data Model and Algebra Edward Hung, Lise Getoor, V.S. Subrahmanian University of Maryland, College Park ICDE, Bangalore, India, Mar 2003

  2. Outline • Motivating example • Semistructured data model • PXML data model • Semantics • Algebra • Probabilistic point query • Related work

  3. Motivating Example • Surveillance applications monitoring a region of battlefield • Image processing system identifies vehicles in convoys appearing in the region in different time • Convoys • timestamp • tanks, trucks, etc • Uncertainty • number of vehicles • Category and identity of a vehicle, e.g., a tank? T-72?

  4. Motivating Example • Semistructured data model • General hierarchical structure is known. • The schema is not fixed • Number of vehicles • Properties of vehicles • Our work: store uncertain information in probabilistic environments.

  5. Semistructured Data Model • Example

  6. PIXML Data Model • Uncertainty • Existence of sub-objects • Number of sub-objects • Identity of the sub-objects

  7. card(convoy2, ts)=[1,1] Time = 15 card(convoy2, truck)=[1,2] PIXML Data Model (Cardinality) • Example of cardinality Weak Instance W = Semistructured Instance + card

  8. PIXML Data Model (Weak Instance) • Example of a weak instance W card(S1,convoy)=[2,2] card(convoy1,ts)=[1,1] card(convoy1,truck)=[1,1] card(convoy1,tank)=[1,1] card(convoy2,ts)=[1,1] card(convoy2,truck)=[1,2]

  9. PIXML Data Model • Example of an instance compatible with W card(convoy1,ts)=[1,1] card(convoy1,truck)=[1,1] card(S1,convoy)=[2,2] card(convoy1,tank)=[1,1] card(convoy2,ts)=[1,1] card(convoy2,truck)=[1,2]

  10. D(W)= the set of all semistructured instances compatible with the weak instance W

  11. card(convoy2, ts)=[1,1] Time = 15 Time = 15 card(convoy2, truck)=[1,2] Time = 15 Time = 15 Potential child set of convoy2, PC(convoy2) = {{ts2, truck3, truck4}, {ts2, truck3}, {ts2, truck4}}

  12. card(convoy2, ts)=[1,1] Time = 15 Time = 15 Time = 15 card(convoy2, truck)=[1,2] Time = 15 Object probability function (OPF) for convoy2 w.r.t. W is a mapping w: PC(convoy2)  [0,1] s.t. wconvoy2({ts2, truck3 , truck4}) = 0.2 wconvoy2({ts2, truck3}) = 0.5 wconvoy2({ts2, truck4}) = 0.3

  13. Semantics (Local Interpretation) • Interpretation • Local interpretation, p • a mapping from the set of non-leaf objects to OPFs • Example • p(convoy2) = wconvoy2

  14. Semantics (Local Interpretation) • Here the opf assigns the probability to each possible set of children. • More independence assumptions are possible to make the representation more compact • e.g. independence between trucks and tanks. • e.g. all trucks are all indistinguishable.

  15. Semantics (Global Interpretation) • Previously, probabilities are assigned to the actual children of each non-leaf object in a local manner. • Now we are going to assign probabilities of each compatible instance globally.

  16. Semantics (Global Interpretation) • Interpretation • Global interpretation, P • a mapping from D(W) (the set of semistructured instances compatible with W) to [0,1] s.t.

  17. S1a S1b S1c P(S1a) = 0.12 P(S1b) = 0.08 P(S1c) = 0.2 S1d S1e S1f P(S1d) = 0.18 P(S1e) = 0.12 P(S1f) = 0.3

  18. Semantics (Local  Global) • We have defined operators to convert between local and global interpretations. • Theorems (Reversibility) • The conversions from local to global interpretation and from global to local interpretation are correct. • The conversion between local and global interpretations is reversible.

  19. Algebra • Operators • Projection • Selection • Cross-product • Path expression • o.l1.l2…ln S1.convoy.truck

  20. Algebra (Projection) • Ancestor projection • Descendant projection • Single projection

  21. Algebra (Projection) Semistructured Instance • Ancestor projection ( )

  22. Globally • Ancestor projection ( )

  23. Probabilistic Instance • Ancestor projection ( ) card(convoy1,ts)=[1,1] card(I2,convoy)=[1,1] card(convoy1,truck)=[1,1] p(I2)({convoy1})=0.8 card(convoy1,tank)=[1,1] p(convoy1)({ts1,truck1,tank1})=0 p(convoy1)({ts1,truck1,tank2})=0.1 p(convoy1)({ts1,truck2,tank1})=0.3 p(convoy1)({ts1,truck2,tank2})=0.6 PC(convoy1) card(I2,convoy)=[1,1] card(convoy1,truck)=[1,1] After normalization, p(I2)({convoy1})=1 Children of convoy1 before = CI2(convoy1)={ts1, truck1, truck2, tank1, tank2} Children of convoy1 after = CI2’(convoy1)={truck1, truck2} PC’(convoy1)={{truck1},{truck2}}

  24. Probabilistic Instance • Ancestor projection ( ) card(convoy1,ts)=[1,1] card(I2,convoy)=[1,1] card(convoy1,truck)=[1,1] p(I2)({convoy1})=0.8 card(convoy1,tank)=[1,1] p(convoy1)({ts1,truck1,tank1})=0 p(convoy1)({ts1,truck1,tank2})=0.1 p(convoy1)({ts1,truck2,tank1})=0.3 p(convoy1)({ts1,truck2,tank2})=0.6 PC(convoy1) card(I2,convoy)=[1,1] card(convoy1,truck)=[1,1] After normalization, p(I2)({convoy1})=1 For {truck1}, p(convoy1)({truck1}) = 0 + 0.1 = 0.1 For {truck2}, p(convoy1)({truck2}) = 0.3 + 0.6 = 0.9 After normalization, p(convoy1)({truck1}) = 0.1, p(convoy1)({truck2}) = 0.9

  25. Ancestor Projection • Experiments • running time is linear to the number of objects (selected objects and their ancestors) • time to update the OPF entries of an object o is sub-quadratic to the number of OPF entries

  26. card(I7, convoy)=[1,2], wI7({convoy1})=0.2, wI7({convoy2})=0.5, wI7({convoy1,convoy2})=0.3 Algebra (Selection) ( ) card(convoy1, tank)=[1,1] wconvoy1({tank1})=0.3, wconvoy1({tank2})=0.7 card(convoy2, tank)=[1,1] wconvoy2({tank2})=0.4, wconvoy2({tank3})=0.6 0.14 +0.2 +0.036 +0.084 +0.126 =0.586 D(I7)  0.036 / 0.586 0.06 0.054 0.14 / 0.586 0.084 0.2 / 0.586 / 0.586 0.3 0.126 / 0.586

  27. Algebra (Cross product (x)) card(I4, truck)=[1,1] p(I4)({truck1})=0.2 p(I4)({truck2})=0.8 card(I5, tank)=[1,1] p(I5)({tank1})=0.1 p(I5)({tank2})=0.9 card(I6, truck)=[1,1] card(I6, tank)=[1,1] I4 x I5 p(I6)({truck1, tank1})=0.2*0.1=0.02

  28. Algebra (Cross product (x)) card(I4, truck)=[1,1] p(I4)({truck1})=0.2 p(I4)({truck2})=0.8 card(I5, tank)=[1,1] p(I5)({tank1})=0.1 p(I5)({tank2})=0.9 card(I6, truck)=[1,1] card(I6, tank)=[1,1] I4 x I5 p(I6)({truck1, tank1})=0.2*0.1=0.02 p(I6)({truck1, tank2})=0.2*0.9=0.18

  29. Algebra (Cross product (x)) card(I4, truck)=[1,1] p(I4)({truck1})=0.2 p(I4)({truck2})=0.8 card(I5, tank)=[1,1] p(I5)({tank1})=0.1 p(I5)({tank2})=0.9 card(I6, truck)=[1,1] card(I6, tank)=[1,1] I4 x I5 p(I6)({truck1, tank1})=0.2*0.1=0.02 p(I6)({truck1, tank2})=0.2*0.9=0.18 p(I6)({truck2, tank1})=0.8*0.1=0.08

  30. Algebra (Cross product (x)) card(I4, truck)=[1,1] p(I4)({truck1})=0.2 p(I4)({truck2})=0.8 card(I5, tank)=[1,1] p(I5)({tank1})=0.1 p(I5)({tank2})=0.9 card(I6, truck)=[1,1] card(I6, tank)=[1,1] I4 x I5 p(I6)({truck1, tank1})=0.2*0.1=0.02 p(I6)({truck1, tank2})=0.2*0.9=0.18 p(I6)({truck2, tank1})=0.8*0.1=0.08 p(I6)({truck2, tank2})=0.8*0.9=0.72

  31. card(I7, convoy)=[1,2], wI7({convoy1})=0.2, wI7({convoy2})=0.5, wI7({convoy1,convoy2})=0.3 Probabilistic Point Query card(convoy1, tank)=[1,1] wconvoy1({tank1})=0.3, wconvoy1({tank2})=0.7 card(convoy2, tank)=[1,1] wconvoy2({tank2})=0.4, wconvoy2({tank3})=0.6 0.14 +0.2 +0.036 +0.084 +0.126 =0.586 D(I7)  0.036 0.06 0.054 0.14 0.084 0.2 0.3 0.126

  32. card(I7, convoy)=[1,2], wI7({convoy1})=0.2, wI7({convoy2})=0.5, wI7({convoy1,convoy2})=0.3 Probabilistic Point Query card(convoy1, tank)=[1,1] wconvoy1({tank1})=0.3, wconvoy1({tank2})=0.7 card(convoy2, tank)=[1,1] wconvoy2({tank2})=0.4, wconvoy2({tank3})=0.6 D(I7)  0.2*0.7+0.5*0.4+0.3*(1-(1-0.7)*(1-0.4))

  33. card(I7, convoy)=[1,2], wI7({convoy1})=0.2, wI7({convoy2})=0.5, wI7({convoy1,convoy2})=0.3 Probabilistic Point Query card(convoy1, tank)=[1,1] wconvoy1({tank1})=0.3, wconvoy1({tank2})=0.7 card(convoy2, tank)=[1,1] wconvoy2({tank2})=0.4, wconvoy2({tank3})=0.6 D(I7)  0.2*0.7+0.5*0.4+0.3*(1-(1-0.7)*(1-0.4))

  34. card(I7, convoy)=[1,2], wI7({convoy1})=0.2, wI7({convoy2})=0.5, wI7({convoy1,convoy2})=0.3 Probabilistic Point Query card(convoy1, tank)=[1,1] wconvoy1({tank1})=0.3, wconvoy1({tank2})=0.7 card(convoy2, tank)=[1,1] wconvoy2({tank2})=0.4, wconvoy2({tank3})=0.6 D(I7)  0.2*0.7+0.5*0.4+0.3*(1-(1-0.7)*(1-0.4)) = 0.14+0.2+0.246 = 0.586

  35. Related Work • Another paper of interval probability version in ICDT 2003: • Semantics • Interpretations • Satisfaction • Consistency • Query and r-answer (objects satisfying the query with minimal probability no less than r)

  36. Related Work • Semistructured Probabilistic Objects (SPOs) (Dekhtyar, Goldsmith, Hawkes, in SSDBM, 2001) • SPO: express contexts (not random variables) in a semistructured manner • PXML data model stores XML data AND probabilistic information.

  37. Related Work • ProTDB (Nierman, Jagadish, in VLDB, 2002) • Independent probabilities assigned to each child VS arbitrary distributions over sets of children • Tree-structured VS arbitrary acyclic • Our model theory provides two formal semantics • We propose a set of algebraic operators and point probabilistic query

  38. Questions and Answers Thank you very much!

  39. Future Work • System implementation • Query optimization

  40. Summary • PIXML data model • Semistructured instance • Weak instance (add cardinality) • Probabilistic instance (add ipf) • Semantics • Local and Global • Interpretation • Satisfaction

  41. Related Work • Semistructured Probabilistic Objects (SPOs) (Dekhtyar, Goldsmith, Hawkes, in SSDBM, 2001) • SPO: express contexts (not random variables) in a semistructured manner • PIXML data model stores XML data AND probabilistic information.

  42. Related Work • ProTDB (Nierman, Jagadish, in VLDB, 2002) • Point probabilities VS interval probabilities • Independent probabilities assigned to each child VS arbitrary distributions over sets of children • Tree-structured VS arbitrary acyclic • Our model theory provides two formal semantics • Differences in their queries and our algebra and query.

  43. Future Work • System implementation • Query optimization

  44. Summary • PXML data model • Semistructured instance • Weak instance (add cardinality) • Probabilistic instance (add ipf) • Semantics • Local and Global • Interpretation • Satisfaction • Algebra • Projections, selection, cross product

  45. Algebra (Projection) • Equivalence Equivalent

  46. Algebra (Projection) • Equivalence Equivalent e1 and e2 are a sequence of zero or more edges. Thus, I.e1.lm can include I.lm, I.l1.lm, I.l2.l3.lm, etc.

  47. In general non-equivalent

  48. Algebra (Cross product) • Equivalence • (I1 x I2) x I3 • I1 x (I2 x I3) • (I1 x I3) x I2 Equivalent

  49. Related Work • Bayesian net (Pearl, 1988) • random variables (probability of events) • ours: existence of children requires existence of parents

More Related