250 likes | 336 Views
A new class of lineage expressions over probabilistic databases computable in PTIME. SUM 2013 Batya Kenig Avigdor Gal Ofer Strichman. Probabilistic Databases for managing uncertain data.
E N D
A new class of lineage expressions over probabilistic databases computable in PTIME SUM 2013 BatyaKenig Avigdor Gal OferStrichman
Probabilistic Databases for managing uncertain data • A variety of data sources generate incomplete, noisy and uncertain data (sensor networks, information extraction, data integration…). • Probabilistic databases enable storing and querying such data • A lot of research in recent years MayBMS [Cornell], Trio [Stanford], SPROUT [Oxford], PrDB [U.Md]
Tuple Independent Probabilistic Databases Each possible world is a standard database instance with probability
Query Semantics • Let be a query evaluated against probabilistic DB • Let be the possible worlds that return . • Sum the probabilities of instances that return . • Goal: efficiently evaluate • In time polynomial in • Not always possible, in general #P-hard
Probabilistic Inference for queries • DalviSuciu04: Conjunctive queries (without self joins) are either: • Safe queries: Have query plans that run in on all DB instances • Unsafe queries: Data complexity is -hard • However, • Even for unsafe queries there are DB instances which will enable efficient computation
Why lineage? Each tuple is associated with a binary random variable S ) Compute probability of this formula
Why Lineage ? Efficient computation [Roy2011,Sen2010] Safe plans for safe queries produce formulas in read-once form • Expression in Read-Once form • Linear time probability computation [Olteanu&Huang2008]
Unsafe query • Solutions: • Jha&Suciu11: Compile to decision diagram • Jha&Suciu12: Exponential in pathwidth, double exponential in expression pathwidth Not read-once We will show how to compute the probability of disjoint branch lineage expressions in
Lineage as a hypergraph Primal Graph Hyperedges • In general, expanding a formula to its DNF form can lead to an exponential blowup. • For SPJ queries without self joins, the primal graph can be generated directly from the formula [Roy 2011].
Junction trees for lineages • Let be a hypergraph. • Hypergraph is acyclic iff it has a junction tree ,A) [Beeri et al 1981] • The junction tree property: for every the set of nodes in the tree that contain , induce a (connected) tree.
Background: junction trees for probabilistic inference node separator Each node and separator stores joint pdf • Send messages towards a given root node • Messages are passed by multiplication of factor entries • Once the root node has received messages from all of its neighbors, its factor holds the marginal of the joint probability distribution of the entire variable set.
(Naïve) Junction Tree Algorithmfor lineage computation PROBLEM: The JT Alg runs in time that is exponential in the largest factor. Restricted to lineage expressions that can be efficiently represented using a junction tree [i.e; low tree width] 0
Take advantage of Junction Tree structure Rooted Directed Path Graphs [Gavril1975]: A graph is a rooted directed path graph (RDPG) iff there exists a rooted directed junction tree such that for every vertex , the set of nodes that contain form a directed path of
Use compact factors We would ultimately like to calculate the entry probabilities. Their sum is exactly
The Algorithm This can be done due to the disjoint branch property. =
Projection/Marginalization • Sending a message involves summing out variables in the factor No longer mutual exclusive! Disables subsequent projections.
Projection/Marginalization • Solution: Perform marginalization by repeatedly projecting out only the last (rightmost) var. • Requires ordering message-vars before those to be summed out. • Due to the junction tree property this is always possible.
Complexity Analysis • Let be the size of the largest factor • Each node can have at most children • Therefore, each entry in the factor is updated at most times. • Overall
Conclusions • Define disjoint branch lineage expressions • Provide an algorithm for computing the probability of disjoint branch lineage expressions in PTIME -
Future Work • Are there other structural properties of junction trees that can facilitate efficient probabilistic inference ? • Real data is correlated • Drop tuple-independence assumption • Characterize queries and DB instances which induce lineage with “efficient” junction trees.