Probabilistic RDF

Probabilistic RDF Octavian Udrea1 V.S. Subrahmanian1 Zoran Majkić2 1University of Maryland College Park 2University “La Sapienza”, Rome, Italy

Motivation • Not all information on the Web is easily expressible in “classic” models (i.e., relational) • RDF extraction from text • STORY is the first, very successful prototype • Need to extend RDF with temporal, uncertainty components • Goal: build a logical model of RDF with uncertainty and provide query algorithms

The Probabilistic RDF idea • An RDF theory is a set of triples (subject, property, value) • (USA hasCapital Washington DC), • (Washington DC hasPopulation 500,000) • Probabilistic RDF extends this model with uncertainty over the set of values. • (USA hasCapital {(Washington DC, 0.95), (State of Washington, 0.05)})

Probabilistic RDF example Extracted based on www.wrongdiagnosis .com

Probabilistic RDF example

Probabilistic RDF syntax • Schema uncertainty: • (c subClassOf (C,δ)) • ΣdЄCδ(d) <= 1 • Class-instance uncertainty: • (x rdf:type (C,δ)) • ΣdЄCδ(d) <= 1 • Instance-based uncertainty: • (x p (Y, δ)) • ΣyЄYδ(y) <= 1

Probabilistic RDF syntax • Sanity requirements • (c subClassOf (C1,δ1)), ((c subClassOf (C2,δ2)) => (C1 = C2 and δ1 = δ2) or C1 ∩ C2 = Ø • Same applies for other types of uncertainty • Transitive properties • Simple inferential capability • Examples: associatedWith, controlledBy • P-path: • A set of triples connected by transitive properties

Example p-path

P-path semantics and t-norms • We cannot generally assume independence between triples on a transitive path • Flu, AcuteBronchitis, Pneumonia • T-norms are used to express the user’s knowledge of the relationship between triples •  is associative, commutative • 0  x = 0, 1  x = x • x <= y, z <= w => x  z <= y  w • P-Path probability: t-norm applied to individual probabilities on the path

Example p-path (Flu, associatedWith, (Pneumonia, 0.455)) w.r.t. the product t-norm

pRDF semantics • A world W is a set of simple triples (with no probabilities) • An interpretation I associates a probability to each world • I satisfies a pRDF theory: • For each (s, p, (V,δ)), δ(v) <= Σ I(W), where W contains (s,p,v) • Same applies to paths w.r.t. to a given t-norm

pRDF semantics • A theory is consistent iff it has a satisfying interpretation • Every pRDF theory is consistent • Entailment: T entails T’ iff every satisfying interpretation of T satisfies T’ • Closure of a theory: The entire set of triples entailed by the theory • Maximal w.r.t. the probability values

pRDF fixpoint semantics • The closure operator Δ adds exactly one entailed triple at each step (Flu associatedWith, (Acute Bronchitis, .7)) and (Acute Bronchitis associatedWith (Pneumonia, .65)) yields: (Flu associatedWith, (Pneumonia, 0.455)) w.r.t. the product t-norm • Δ has a fixpoint which is the theory closure.

pRDF query processing • We will consider only simple queries: a triple with a variable term • Example (? associatedWith Pneumonia 4) • What is associated with Pneumonia with probability above .4? • Simple method: • Compute the closure • Select any triple in the closure that matches the query • VERY expensive computationally

pRDF query processing • Set of algorithms for answering simple queries and conjunctions: • pRDF_Subject, pRDF_Property, …, pRDF_conjunction • Central idea: • Apply Δ in only those directions that yield tuples relevant to the query • Cut off path computations when the threshold can no longer be reached. • min(current_probability, threshold)

Experimental results • Implementation • Java, 1700 LOC • Disk-based storage for pRDF theories • Synthetically generated datasets • According to varying underlying distributions • Datasets extracted from Web sources

Experimental questions • Does the underlying distribution affect query running time? • From a practical point of view, which are the “fastest” types of queries? • How does running time vary with the number of atoms in a conjunction? • What other theory-dependent factors affect running time? • Theory width • Number of properties

Query running time (Poisson)

Query running time (zipf)

Conjunctive queries running time

Dependence on property width

Number of properties

Take away points • RDF syntax with uncertainty • Model-theory and fixpoint semantics for pRDF • Efficient query algorithms for pRDF

The end http://om.umiacs.umd.edu/ Thank you! Questions & comments

Probabilistic RDF

Probabilistic RDF

Presentation Transcript

RDF Containers

RDF Gravity

Practical RDF Chapter 10. Querying RDF: RDF as Data

RDF Next

DDI-RDF

RDF Briefing

RDF, RDF, RDF….

RDF

RDF Tools

RDF

Practical RDF Ch.10 Querying RDF: RDF as Data

Graphically Querying RDF Using RDF-GL

RDF Schema

RDF

XML/RDF

RDF

Understanding RDF

Understanding RDF

RDF