Querying Distributed RDF Data Sources with SPARQL

Querying Distributed RDF Data Sources with SPARQL Presented by Bastian Quilitz and Ulf Leser Humboldt-Universitat zu Berlin ESWC 2008 2009-07-23 Summarized by Jaeseok Myung Intelligent Database Systems Lab School of Computer Science & Engineering Seoul National University, Seoul, Korea

Introduction • SPARQL has to deal with thousands of RDF data • with a local machine • with multiple and distributed machines • Integrated access to multiple RDF data sources is a key challenge for many semantic web applications • Current implementations of SPARQL load all RDF graphs to the local machine • This usually incurs a large overhead in network traffic Center for E-Business Technology

Introduction • DARQ, an engine for federated SPARQL queries • Provides transparent query access to multiple SPARQL services • Distributed ARQ, as an extension to ARQ (jena) • Available under GPL License at http://darq.sf.net/ In this presentation, .. Building Sub-queries Metadata for each DS Data Source Do not care Center for E-Business Technology

Preliminaries • A SPARQL query Q is defined as Q = (E, DS, R) • E : an algebra expression of the SPARQL query • DS : a RDF data source • R : Query Type (SELECT, CONSTRUCT, DESCRIBE, ASK) • The algebra expression E consists of • Graph Patterns • Triple Pattern : (s, p, o) • Basic Graph Pattern : a set of triple pattern • Filtered BGP : BGP with constraints • Solution Modifiers, • Such as PROJECTION, DISTINCT, LIMIT or ORDER BY Center for E-Business Technology

An Example SPARQ Query SELECT ?name ?mbox WHERE { ?x foaf:name ?name. ?x foaf:mbox ?mbox. FILTERregex(?name, “^Tim”) && regex(?mbox, “w3c”) } ORDERBY ?name LIMIT 5 Query Type Projection TP BGP FBGP Solution Modifiers Center for E-Business Technology

Query Processing • A query is processed in 4 stages: • Parsing : converts the query string into a tree model of SPARQL. The DARQ query engine reuses the parser shipped with ARQ • Query Planning : the query engine decomposes the original query and builds multiple sub-queries according to the information in the service descriptions, each of which can be answered by one known data source • Query Optimization : In the third stage, the query optimizer takes the sub-queries and rewrites them for optimization • Query Execution : the Query execution plan is executed. The sub-queries are sent to the data sources and the results are integrated Center for E-Business Technology

Service Descriptions • Information for each data sources is helpful • To find the relevant data sources for the different triples • To decompose the query into sub-queries • Service descriptions • Let us know whether the data available from a data source • Allow limitations on access patterns • Include statistical information used for query optimization • Are represented in RDF Center for E-Business Technology

Service Descriptions • Data Description • A service description defines the capabilities which indicates whether data is available or not • Ex) sd:capability [ sd:predicate rdf:type ]; • The definition of capabilities is based on predicates • DARQ currently only supports queries with bounded predicates • Limitation on Access Pattern • DARQ supports limitations on access patterns • Ex) sd:requiredBindings [ sd:subjectBinding foaf:name ]; • Ex) sd:requiredBindings [ sd:objectBinding foaf:name ]; Center for E-Business Technology

Service Descriptions • Statistical Information • Helps the query optimizer to find a cost-effective query plan • Includes • Ns : The total number of triples • Optional information for each predicate • nD(p) : The number of triples for the predicate p in the data source D • sselD(p) : The selectivity of a triple pattern for the predicate p when the subject is bounded (default = 1 / nD(p) ) • oselD(p) : The selectivity of a triple pattern for the predicate p when the object is bounded (default = 1) • Using simple statistics => every data source can provide them • More precise statistics would be preferable but will not be available Center for E-Business Technology

Service Descriptions • The data source defined in the example can answer queries for foaf:name, foaf:mbox and foaf:weblog. • Objects for a triple with predicate foaf:name must always start with a letter from A to R • In total it stores 112 triples • The data source has limitations on access patterns, i.e. a query must contain a triple pattern with predicate foaf:name or foaf:mbox with a bounded object Center for E-Business Technology

Query Planning • Query planning is based on the information provided by service descriptions • In this system, we have two stages • Source Selection: let us know which data source is relevant to a given query • The algorithm simply matches given triple patterns against the capabilities of the data sources • Ex) sd:capability [ sd:predicaterdf:type]; • SELECT ?x WHERE ?x rdf:typefoaf:Person; • As a result, every triple pattern in a BGP has a set of corresponding data sources • The results from source selection are used to build sub-queries that can be answered by the data source • Building Sub-Queries • Each data source has a sub-query • Each sub-query has a filtered BGP Center for E-Business Technology

Query Planning SELECT ?name ?mbox WHERE { ?x foaf:name ?name. ?x foaf:mbox ?mbox. FILTERregex(?name, “^Tim”) && regex(?mbox, “w3c”) } ORDERBY ?name LIMIT 5 DARQ (?x foaf:name ?name) (?x foaf:mbox ?mbox) (?x foaf:name ?name) (?x foaf:mbox ?mbox) sd:capability sd:predicate foaf:name. sd:capability sd:predicate foaf:mbox. sd:capability sd:predicate foaf:name; sd:predicate foaf:mbox. (Person, name, “TBL”) (Person, mbox, “T@x.y”) (Person, name, “ABC”) (Person, mbox, “A@b.c) Center for E-Business Technology

Query Optimization - Logical • Rule-based Query Rewriting • Based on [Perez, J. et al., ISWC 2006] • Reduces the number of BGP & variables • Moving value constraints into sub-queries Center for E-Business Technology

Query Optimization - Physical • Physical optimization is about the intermediate result size estimation (cost-based optimization) • The result size estimation is based on the statistics provided in the service descriptions • Join, Single Triple, Multiple Triples (BGP) • An example of a single triple pattern Center for E-Business Technology

Evaluation • Dataset : a subset of DBpedia, 31.5 million triples in total • Contains RDF data extracted from Wikipedia • http://dbpedia.org Center for E-Business Technology

Evaluation • 2 physical machines, 5 logical SPARQL endpoints Center for E-Business Technology

Evaluation • Optimization has made significant improvements • My opinion • The experiment doesn’t count the loading time • There need to be compared with other systems • http://esw.w3.org/topic/LargeTripleStores Center for E-Business Technology

Conclusion • DARQoffers a single interface for querying multiple, distributed SPARQL end-points • Using SPARQL Standard => Flexible • Using Service Descriptions • Data sources can be added and/or removed dynamically • A query can be federated and optimized with statistical information • Limitation • Predicates must be bounded (Sub. ?p Obj. is not allowed) • CONSTRUCT, DESCRIBE, ASK are not supported • GRAPH, UNION, OPTIONAL are not supported Center for E-Business Technology

Paper Evaluation • Pros • Good idea • Distributed SPARQL processing is relatively new research field • Defining service descriptions • Dealing with all aspects of query engine • Implementation • My Comments • Too simple, and still slow • Many limitations Center for E-Business Technology

Querying Distributed RDF Data Sources with SPARQL