GRIN: A Graph Based RDF Index

GRIN: A Graph Based RDF Index Octavian Udrea1 Andrea Pugliese2 V. S. Subrahmanian1 1University of Maryland College Park 2Università di Calabria

Motivation • Plenty of large RDF datasets: • TAP, GovTrack, ChefMoz, CIA World Factbook • Many many more (see rdfdata.org) • Query languages: RDQL, RQL, SPARQL • DB systems: Jena, Sesame, RDFBroker • Indexing? • Based on relational database indexes • Has to be rooted in the characteristics of the query language

Contributions • Lightweight mechanism for indexing large RDF datasets • GRIN: Graph-based RDF INdex • Query answer algorithms for SPARQL-like queries • Evaluation on two real-world datasets: TAP (Stanford) and ChefMoz (chefmoz.org)

Outline • RDF data and queries • The GRIN Index structure • Answering queries • Experimental evaluation

RDF graph example (ChefMoz)

RDF query example

Query example in SPARQL X SELECT ?v1 ?v2 ?v3 WHERE { {(?v1 attire ?v3) . (?v1 cuisine Italian)} {(?v2 attire ?v3) . (?v2 cuisine Italian) . (?v2 location Norfolk)} {(Norfolk locatedIn NE/USA)} } FROM ChefMoz

Native RDF systems: Jena2 • Stores RDF as (subject, property, value) in a relational table • Indexes on each of the three attributes • Translates SPARQL/RDQL into SQL X 6self-joins

Native RDF systems: Sesame • Broekstra et al., ISWC 2002 • The Sesame SAIL API improves on Jena: • Supports RDF Schema inference • Separates RDFS from the triple table • Supports database schema generation based on the underlying RDF schema of a dataset • The problem of too many joins remains

Native RDF systems: RDFBroker • Sintek et al., ESWC 2006 • The database schema is built based on signatures – the set of properties used on a resource • Reduces the number of joins between tables

The human perspective

GRIN intuition • Resources “closer” in the RDF graph are more likely to be part of the same answer • Hence they should appear on the same page • GRIN will group resources in circles around selected center resources • Query evaluation: • Find the smallest circle that contains the answer • Evaluate query only on resources in that circle

The GRIN Index structure • GRIN is a binary tree in which: • Leaf nodes are sets of resources (and the associated triples) • Inner nodes are circles consisting of a center resource and a radius • Each node is fully contained in its parent • Distance metric: shortest path distance in the undirected graph

Building the index: clustering

Building the index: clustering • Standard k-medoids clustering (Kaufman & Rousseeuw, 1987) • How many clusters? • R is the set of resources • M is the maximum number of resources per page • Average link gives the best performance for the inter-cluster distance

Building the index: the tree

Queries to constraints • Extract constraints from the query: • d(?v1, Italian) ≤ 1 • d(?v2, Norfolk) ≤ 1 • d(?v3, Italian) ≤ 2 • …and so on

Query evaluation • Goal: identify the smallest circle that is guaranteed to contain an answer to the query • Perform a depth-first traversal • For each index node, evaluate the constraints • If the constraints guarantee an answer, perform subgraph matching

Query evaluation

Evaluating constraints • Constraints: • d(?v1, Italian) ≤ 1, d(?v2, Norfolk) ≤ 1, d(?v3, Italian) ≤ 2 • Question: is ?v1 in the circle (Grivanti, 3)? • d(Grivanti,?v1) ≤ d(Grivanti, Italian) + d(?v1, Italian) ≤ 1 + 1 = 2 • ?v1 must be in the circle (Grivanti, 3)

Evaluating constraints • Question: is ?v3 in (Grivanti, 3)? • d(Grivanti, ?v3) ≤ d(Grivanti, Italian) + d(Italian, ?v3) ≤ 1 + 2 = 3 • ?v3 must be in (Grivanti, 3) • Similarly, ?v2 is in the same circle

Subgraph matching • Perform subgraph matching on the resources in the circles guaranteed to contain an answer • Algorithm by Cordella et. al, IEEE PAMI 26(10), 2006 • Worst-time complexity of O(N!) • Where N is the maximum number of nodes in either graph • In practice, GRIN makes N very small

Experimental framework • Comparison between GRIN, Sesame, Jena2 and RDFBroker (in-memory) • Index build time • Memory consumption at query time • Query time • Two real-world datasets: • TAP (Stanford): datasets between 1.5MB and 300MB • ChefMoz (chefmoz.org): 220 MB

Index build time

Memory consumption

Query time

Average degree of a query node

Conclusions • Method for indexing large RDF graphs adapted to the characteristics of RDF queries • Avoids expensive join operations • Gives better query times than Jena2, Sesame and RDFBroker • Current and future work: • Disk-based index • Analysis of overlap and coverage

GRIN: A Graph Based RDF Index

GRIN: A Graph Based RDF Index

Presentation Transcript

Top-N Recommendation Algorithm Based on Item-Graph

Spectral Sequencing Based on Graph Distance

Graph-Based Binary Analysis

1.3 Graphs of Functions

Introduction to Graph Cluster Analysis

Graph 2.1

Graph-based Segmentation

Graph-based Segmentation

The Index for Inclusion

Lecture 34: Relaxation-Based Approach

Web Mining: Phrase-based Document Indexing and Document Clustering

Learning How to Graph!

Ant Based Optimization for Multiway Graph Partition

Graph Algorithms

Scenario based Dynamic Video Abstractions using Graph Matching

A Random-Surfer Web-Graph Model

STRG-Index: Spatio-Temporal Region Graph Indexing for Large Video Databases

Simple Graph Analysis

Introduction to Graph Transformation

Graph Colouring