310 likes | 471 Views
Probabilistic RDF. Octavian Udrea 1 V.S. Subrahmanian 1 Zoran Majkić 2 1 University of Maryland College Park 2 University “La Sapienza”, Rome, Italy. Motivation. Not all information on the Web is easily expressible in “classic” models (i.e., relational) RDF extraction from text
E N D
Probabilistic RDF Octavian Udrea1 V.S. Subrahmanian1 Zoran Majkić2 1University of Maryland College Park 2University “La Sapienza”, Rome, Italy
Motivation • Not all information on the Web is easily expressible in “classic” models (i.e., relational) • RDF extraction from text • STORY is the first, very successful prototype • Need to extend RDF with temporal, uncertainty components • Goal: build a logical model of RDF with uncertainty and provide query algorithms
The Probabilistic RDF idea • An RDF theory is a set of triples (subject, property, value) • (USA hasCapital Washington DC), • (Washington DC hasPopulation 500,000) • Probabilistic RDF extends this model with uncertainty over the set of values. • (USA hasCapital {(Washington DC, 0.95), (State of Washington, 0.05)})
Probabilistic RDF example Extracted based on www.wrongdiagnosis .com
Probabilistic RDF syntax • Schema uncertainty: • (c subClassOf (C,δ)) • ΣdЄCδ(d) <= 1 • Class-instance uncertainty: • (x rdf:type (C,δ)) • ΣdЄCδ(d) <= 1 • Instance-based uncertainty: • (x p (Y, δ)) • ΣyЄYδ(y) <= 1
Probabilistic RDF syntax • Sanity requirements • (c subClassOf (C1,δ1)), ((c subClassOf (C2,δ2)) => (C1 = C2 and δ1 = δ2) or C1 ∩ C2 = Ø • Same applies for other types of uncertainty • Transitive properties • Simple inferential capability • Examples: associatedWith, controlledBy • P-path: • A set of triples connected by transitive properties
P-path semantics and t-norms • We cannot generally assume independence between triples on a transitive path • Flu, AcuteBronchitis, Pneumonia • T-norms are used to express the user’s knowledge of the relationship between triples • is associative, commutative • 0 x = 0, 1 x = x • x <= y, z <= w => x z <= y w • P-Path probability: t-norm applied to individual probabilities on the path
Example p-path (Flu, associatedWith, (Pneumonia, 0.455)) w.r.t. the product t-norm
pRDF semantics • A world W is a set of simple triples (with no probabilities) • An interpretation I associates a probability to each world • I satisfies a pRDF theory: • For each (s, p, (V,δ)), δ(v) <= Σ I(W), where W contains (s,p,v) • Same applies to paths w.r.t. to a given t-norm
pRDF semantics • A theory is consistent iff it has a satisfying interpretation • Every pRDF theory is consistent • Entailment: T entails T’ iff every satisfying interpretation of T satisfies T’ • Closure of a theory: The entire set of triples entailed by the theory • Maximal w.r.t. the probability values
pRDF fixpoint semantics • The closure operator Δ adds exactly one entailed triple at each step (Flu associatedWith, (Acute Bronchitis, .7)) and (Acute Bronchitis associatedWith (Pneumonia, .65)) yields: (Flu associatedWith, (Pneumonia, 0.455)) w.r.t. the product t-norm • Δ has a fixpoint which is the theory closure.
pRDF query processing • We will consider only simple queries: a triple with a variable term • Example (? associatedWith Pneumonia 4) • What is associated with Pneumonia with probability above .4? • Simple method: • Compute the closure • Select any triple in the closure that matches the query • VERY expensive computationally
pRDF query processing • Set of algorithms for answering simple queries and conjunctions: • pRDF_Subject, pRDF_Property, …, pRDF_conjunction • Central idea: • Apply Δ in only those directions that yield tuples relevant to the query • Cut off path computations when the threshold can no longer be reached. • min(current_probability, threshold)
Experimental results • Implementation • Java, 1700 LOC • Disk-based storage for pRDF theories • Synthetically generated datasets • According to varying underlying distributions • Datasets extracted from Web sources
Experimental questions • Does the underlying distribution affect query running time? • From a practical point of view, which are the “fastest” types of queries? • How does running time vary with the number of atoms in a conjunction? • What other theory-dependent factors affect running time? • Theory width • Number of properties
Take away points • RDF syntax with uncertainty • Model-theory and fixpoint semantics for pRDF • Efficient query algorithms for pRDF
The end http://om.umiacs.umd.edu/ Thank you! Questions & comments