1 / 40

GRIN: A Graph Based RDF Index

GRIN: A Graph Based RDF Index. Octavian Udrea 1 Andrea Pugliese 2 V. S. Subrahmanian 1 1 University of Maryland College Park 2 Università di Calabria. Motivation. Plenty of large RDF datasets: TAP, GovTrack, ChefMoz, CIA World Factbook Many many more (see rdfdata.org)

livvy
Download Presentation

GRIN: A Graph Based RDF Index

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. GRIN: A Graph Based RDF Index Octavian Udrea1 Andrea Pugliese2 V. S. Subrahmanian1 1University of Maryland College Park 2Università di Calabria

  2. Motivation • Plenty of large RDF datasets: • TAP, GovTrack, ChefMoz, CIA World Factbook • Many many more (see rdfdata.org) • Query languages: RDQL, RQL, SPARQL • DB systems: Jena, Sesame, RDFBroker • Indexing? • Based on relational database indexes • Has to be rooted in the characteristics of the query language

  3. Contributions • Lightweight mechanism for indexing large RDF datasets • GRIN: Graph-based RDF INdex • Query answer algorithms for SPARQL-like queries • Evaluation on two real-world datasets: TAP (Stanford) and ChefMoz (chefmoz.org)

  4. Outline • RDF data and queries • The GRIN Index structure • Answering queries • Experimental evaluation

  5. RDF graph example (ChefMoz)

  6. RDF query example

  7. Query example in SPARQL X SELECT ?v1 ?v2 ?v3 WHERE { {(?v1 attire ?v3) . (?v1 cuisine Italian)} {(?v2 attire ?v3) . (?v2 cuisine Italian) . (?v2 location Norfolk)} {(Norfolk locatedIn NE/USA)} } FROM ChefMoz

  8. Native RDF systems: Jena2 • Stores RDF as (subject, property, value) in a relational table • Indexes on each of the three attributes • Translates SPARQL/RDQL into SQL X 6self-joins

  9. Native RDF systems: Sesame • Broekstra et al., ISWC 2002 • The Sesame SAIL API improves on Jena: • Supports RDF Schema inference • Separates RDFS from the triple table • Supports database schema generation based on the underlying RDF schema of a dataset • The problem of too many joins remains

  10. Native RDF systems: RDFBroker • Sintek et al., ESWC 2006 • The database schema is built based on signatures – the set of properties used on a resource • Reduces the number of joins between tables

  11. The human perspective

  12. The human perspective

  13. The human perspective

  14. The human perspective

  15. The human perspective

  16. Outline • RDF data and queries • The GRIN Index structure • Answering queries • Experimental evaluation

  17. GRIN intuition • Resources “closer” in the RDF graph are more likely to be part of the same answer • Hence they should appear on the same page • GRIN will group resources in circles around selected center resources • Query evaluation: • Find the smallest circle that contains the answer • Evaluate query only on resources in that circle

  18. The GRIN Index structure • GRIN is a binary tree in which: • Leaf nodes are sets of resources (and the associated triples) • Inner nodes are circles consisting of a center resource and a radius • Each node is fully contained in its parent • Distance metric: shortest path distance in the undirected graph

  19. Building the index: clustering

  20. Building the index: clustering

  21. Building the index: clustering

  22. Building the index: clustering

  23. Building the index: clustering • Standard k-medoids clustering (Kaufman & Rousseeuw, 1987) • How many clusters? • R is the set of resources • M is the maximum number of resources per page • Average link gives the best performance for the inter-cluster distance

  24. Building the index: the tree

  25. Building the index: the tree

  26. Building the index: the tree

  27. Outline • RDF data and queries • The GRIN Index structure • Answering queries • Experimental evaluation

  28. Queries to constraints • Extract constraints from the query: • d(?v1, Italian) ≤ 1 • d(?v2, Norfolk) ≤ 1 • d(?v3, Italian) ≤ 2 • …and so on

  29. Query evaluation • Goal: identify the smallest circle that is guaranteed to contain an answer to the query • Perform a depth-first traversal • For each index node, evaluate the constraints • If the constraints guarantee an answer, perform subgraph matching

  30. Query evaluation

  31. Evaluating constraints • Constraints: • d(?v1, Italian) ≤ 1, d(?v2, Norfolk) ≤ 1, d(?v3, Italian) ≤ 2 • Question: is ?v1 in the circle (Grivanti, 3)? • d(Grivanti,?v1) ≤ d(Grivanti, Italian) + d(?v1, Italian) ≤ 1 + 1 = 2 • ?v1 must be in the circle (Grivanti, 3)

  32. Evaluating constraints • Question: is ?v3 in (Grivanti, 3)? • d(Grivanti, ?v3) ≤ d(Grivanti, Italian) + d(Italian, ?v3) ≤ 1 + 2 = 3 • ?v3 must be in (Grivanti, 3) • Similarly, ?v2 is in the same circle

  33. Subgraph matching • Perform subgraph matching on the resources in the circles guaranteed to contain an answer • Algorithm by Cordella et. al, IEEE PAMI 26(10), 2006 • Worst-time complexity of O(N!) • Where N is the maximum number of nodes in either graph • In practice, GRIN makes N very small

  34. Outline • RDF data and queries • The GRIN Index structure • Answering queries • Experimental evaluation

  35. Experimental framework • Comparison between GRIN, Sesame, Jena2 and RDFBroker (in-memory) • Index build time • Memory consumption at query time • Query time • Two real-world datasets: • TAP (Stanford): datasets between 1.5MB and 300MB • ChefMoz (chefmoz.org): 220 MB

  36. Index build time

  37. Memory consumption

  38. Query time

  39. Average degree of a query node

  40. Conclusions • Method for indexing large RDF graphs adapted to the characteristics of RDF queries • Avoids expensive join operations • Gives better query times than Jena2, Sesame and RDFBroker • Current and future work: • Disk-based index • Analysis of overlap and coverage

More Related