170 likes | 326 Views
Katja Hose Ralf Schenkel MPI Informatik Saarland University. Towards Benefit-Based RDF Source Selection for SPARQL Queries. Context: Linked Data Cloud. General Knowledge Bases: DBPedia, Freebase, YAGO
E N D
Katja Hose Ralf Schenkel MPI Informatik Saarland University Towards Benefit-Based RDF Source Selection for SPARQL Queries
Context: Linked Data Cloud • General Knowledge Bases: DBPedia, Freebase, YAGO • Domain-specific knowledge: Biology, Geo, Government, Publications, Movies, Songs, … • Linked Open Data as large integrated knowledge base > 31 billion triples in the LOD cloud, 325 sources DBPedia: 3.6 million entities, 1.2 billion triples Ralf Schenkel
SPARQL: Querying Semantic Data SPARQL example (simplified – no prefixes etc.): SELECT ?a WHERE { ?a dc:authorOf ?p. ?p dc:publishedAt sigmod:2012/SWIM .} Source selection problem:Which of the 325 sources to query? • For each triple pattern, select all sources that have answers • 2 major solution branches: • Use ASK queries • Use source statistics Ralf Schenkel
Existing Approaches: ASK, VoID • Ask each source if results exists for triple pattern ASK {?p dc:publishedAt sigmod:2012/SWIM} • Send query to all relevant sources VoID: • Summary for each source::DBLP_L3S void:propertyPartition [ void:property dc:publishedAt; void:triples 400000 ]; • Less precise than ASK, but cheaper More on Database Techniques for LOD:Tutorial by A. Harth, K. Hose, R. Schenkel on Thursday 10:30-12:00 Ralf Schenkel
Focus of this Talk: Source Overlap Many sources contain the same facts Many duplicate results Many unnecessary requests Obvious problem: overlapping sources Ralf Schenkel
Example for Overlapping Sources • 6 results overall • 2 sources enough to retrieve all results • Source 1 alone is „optimal“ if • only one access possible • or 5 results are enough Source 1 Source 2 Source 3 Our contribution: Determine „optimal“set of sources without seeing the results Ralf Schenkel
Problem Definition Given SPARQL query with triple patterns P and possible sources S, compute query plan qpPS(which pattern is executed at which source)such that • all results are retrieved with a minimal number of requests to sources (minimal exact plan) • as many results as possible are retrieved with |qp|≤max (maximize recall) • as little requests as possible are performed to retrieve at least r results (minimal approximate plan) Ralf Schenkel
High-Level Solution Overview • Extend ASK operation to provide concise yet expressive summary of result bindings of each variable (instead of boolean yes/no) • Estimate source overlap with summaries • Select sources based on benefit Functional properties of summaries for sets: • Size of set (number of distinct elements) • Size of union of two sets • Size of intersection of two sets • Summary smaller than the data • Data not be reproducible from the summary Examples: Bloom Filters, kmv synopsis, … Ralf Schenkel
Bloom Filters • Represent elements in set by k bits in bitvector, determined by hash function • Summary of union/intersection by union/intersection of bit vectors • Estimation for number of elements in underlying set of vector with t 1-bits Example (k=2): {dblp:swim12/p1, dblp:swim12/p2} hash1 hash2 hash2 hash1 0 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 Ralf Schenkel
Source Selection for Single Triple Pattern • Benefit of a source: number of new results it can contribute • Incremental selection algorithm: • Maintain summary for union of results from sources already selected • Estimate source benefit from summary • Select source with highest benefit • Stop when target (# results or # requests) reached • Finally: Evaluate triple pattern at all selected sources; select more sources if too few results Ralf Schenkel
Example (Single Triple Pattern) 6: 0: 5: 2: 2: 5: 5: 3: 5: 3: 0 1 1 1 0 1 0 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 1 1 1 1 0 0 1 0 0 1 1 1 1 0 0 1 1 0 1 1 1 0 0 0 0 1 1 0 0 1 1 1 1 1 0 Source 1 Source 2 Source 3 1. ASK each source 2. Select source with highest number of results 3. Stop if stopping condition is met (recall or number of results) 4. Compute benefit for each remaining source Source 2: 2 - = 1 Source 3: 3 - = 1 5. Select source with highest benefit 6. Continue with step 3 current result summary Ralf Schenkel
Star-Shaped Queries Multiple triple patterns with a single identical variable • Not enough to consider each triple pattern separately • Need to focus on the intersection of the result sets • Extended incremental algorithm: • Init: Pick one source for each triple pattern with most results • Benefit of evaluating a triple pattern at a source: number of new results in the intersection • Estimated by intersection of per-pattern summaries (union of summaries from each selected source) ?x imdb:gender „female“.?x imdb:bornIn dbpedia:Germany.?x imdb:actedIn imdb:Titanic. Ralf Schenkel
Complex Queries Queries with >1 variable and >1 triple patterns • Summaries not applicable for whole query: • no connection of summaries for variables ?m and ?p • Do new bindings for ?p join with existing bindings for ?m ? • But: separate source selection for each pattern possible • Plus: exclude join candidates at execution time reduces effort for nested-loop joins run full query at sources if no cross-joins possible imdb:Tom_Cruise imdb:actedIn ?m.?m imdb:producedBy ?p. best: 3 local joins naive:3x3 joins improved:6 joins Ralf Schenkel
Experimental Evaluation: Setup • RDF Dataset from first 100,000 IMDB moviesand their actors and directors • Generate overlapping partitions • For movies based on genre (28 partitions) • For persons based on birthplace and birthdate (22 p.) • Queries: • 20 single triple patterns • 20 star-shaped queries • Consider minimal exact plan • Bloom filters of different sizes, kmv synopsis Ralf Schenkel
Triple Pattern Queries Much fewer requests while retrieving (almost) all results Ralf Schenkel
Star-Shaped Queries Good Efficiency, but effectiveness sometimes suboptimal Ralf Schenkel
Conclusions and Future Work • Benefit-Aware query routing can improve query performance for Linked Data • Additional benefit for join processing Future Work: • Integration of sameAs links • More general notions of benefit: • Transfer time • Access cost • Data quality Ralf Schenkel