410 likes | 531 Views
GRIN: A Graph Based RDF Index. Octavian Udrea 1 Andrea Pugliese 2 V. S. Subrahmanian 1 1 University of Maryland College Park 2 Università di Calabria. Motivation. Plenty of large RDF datasets: TAP, GovTrack, ChefMoz, CIA World Factbook Many many more (see rdfdata.org)
E N D
GRIN: A Graph Based RDF Index Octavian Udrea1 Andrea Pugliese2 V. S. Subrahmanian1 1University of Maryland College Park 2Università di Calabria
Motivation • Plenty of large RDF datasets: • TAP, GovTrack, ChefMoz, CIA World Factbook • Many many more (see rdfdata.org) • Query languages: RDQL, RQL, SPARQL • DB systems: Jena, Sesame, RDFBroker • Indexing? • Based on relational database indexes • Has to be rooted in the characteristics of the query language
Contributions • Lightweight mechanism for indexing large RDF datasets • GRIN: Graph-based RDF INdex • Query answer algorithms for SPARQL-like queries • Evaluation on two real-world datasets: TAP (Stanford) and ChefMoz (chefmoz.org)
Outline • RDF data and queries • The GRIN Index structure • Answering queries • Experimental evaluation
Query example in SPARQL X SELECT ?v1 ?v2 ?v3 WHERE { {(?v1 attire ?v3) . (?v1 cuisine Italian)} {(?v2 attire ?v3) . (?v2 cuisine Italian) . (?v2 location Norfolk)} {(Norfolk locatedIn NE/USA)} } FROM ChefMoz
Native RDF systems: Jena2 • Stores RDF as (subject, property, value) in a relational table • Indexes on each of the three attributes • Translates SPARQL/RDQL into SQL X 6self-joins
Native RDF systems: Sesame • Broekstra et al., ISWC 2002 • The Sesame SAIL API improves on Jena: • Supports RDF Schema inference • Separates RDFS from the triple table • Supports database schema generation based on the underlying RDF schema of a dataset • The problem of too many joins remains
Native RDF systems: RDFBroker • Sintek et al., ESWC 2006 • The database schema is built based on signatures – the set of properties used on a resource • Reduces the number of joins between tables
Outline • RDF data and queries • The GRIN Index structure • Answering queries • Experimental evaluation
GRIN intuition • Resources “closer” in the RDF graph are more likely to be part of the same answer • Hence they should appear on the same page • GRIN will group resources in circles around selected center resources • Query evaluation: • Find the smallest circle that contains the answer • Evaluate query only on resources in that circle
The GRIN Index structure • GRIN is a binary tree in which: • Leaf nodes are sets of resources (and the associated triples) • Inner nodes are circles consisting of a center resource and a radius • Each node is fully contained in its parent • Distance metric: shortest path distance in the undirected graph
Building the index: clustering • Standard k-medoids clustering (Kaufman & Rousseeuw, 1987) • How many clusters? • R is the set of resources • M is the maximum number of resources per page • Average link gives the best performance for the inter-cluster distance
Outline • RDF data and queries • The GRIN Index structure • Answering queries • Experimental evaluation
Queries to constraints • Extract constraints from the query: • d(?v1, Italian) ≤ 1 • d(?v2, Norfolk) ≤ 1 • d(?v3, Italian) ≤ 2 • …and so on
Query evaluation • Goal: identify the smallest circle that is guaranteed to contain an answer to the query • Perform a depth-first traversal • For each index node, evaluate the constraints • If the constraints guarantee an answer, perform subgraph matching
Evaluating constraints • Constraints: • d(?v1, Italian) ≤ 1, d(?v2, Norfolk) ≤ 1, d(?v3, Italian) ≤ 2 • Question: is ?v1 in the circle (Grivanti, 3)? • d(Grivanti,?v1) ≤ d(Grivanti, Italian) + d(?v1, Italian) ≤ 1 + 1 = 2 • ?v1 must be in the circle (Grivanti, 3)
Evaluating constraints • Question: is ?v3 in (Grivanti, 3)? • d(Grivanti, ?v3) ≤ d(Grivanti, Italian) + d(Italian, ?v3) ≤ 1 + 2 = 3 • ?v3 must be in (Grivanti, 3) • Similarly, ?v2 is in the same circle
Subgraph matching • Perform subgraph matching on the resources in the circles guaranteed to contain an answer • Algorithm by Cordella et. al, IEEE PAMI 26(10), 2006 • Worst-time complexity of O(N!) • Where N is the maximum number of nodes in either graph • In practice, GRIN makes N very small
Outline • RDF data and queries • The GRIN Index structure • Answering queries • Experimental evaluation
Experimental framework • Comparison between GRIN, Sesame, Jena2 and RDFBroker (in-memory) • Index build time • Memory consumption at query time • Query time • Two real-world datasets: • TAP (Stanford): datasets between 1.5MB and 300MB • ChefMoz (chefmoz.org): 220 MB
Conclusions • Method for indexing large RDF graphs adapted to the characteristics of RDF queries • Avoids expensive join operations • Gives better query times than Jena2, Sesame and RDFBroker • Current and future work: • Disk-based index • Analysis of overlap and coverage