SPARQ2L : Towards Supporting Subgraph Extraction Queries in RDF Databases

SPARQ2L : Towards Supporting Subgraph Extraction Queries in RDF Databases Kemafor Anyanwu1, Angela Maduko1, Amit Sheth21LSDIS lab, University of Georgia2 Kno.e.sis Center, Wright State University

Subgraph Extraction Queries • What are they? • Given a set X of data graph elements (nodes/edges) extract a subgraph connecting X • Why do we need them? • Useful for “connecting the dots” applications • “… Everything's connected, all along the line. Cause and effect. That's the beauty of it. Our job is to trace the connections and reveal them.” • -Gilliam’s 1985 film - “Brazil”

Example Application Scenarios • “Retrieve the interaction network for all genes known to be differentially regulated in advanced stage epithelial ovarian cancer” • Other example scenarios in • E-science • Homeland Security • Legal domain

Issue1 – Query Expression • Queries need to be able describe subgraphs of arbitrary or unknown structure • But current query languages support primarily a “pattern matching” paradigm • Query graph patterns – “known structure” • Few exceptions (PSPARQL, Versa, RxPath, SPARQLer *) support matching of graph traversals • Have no subgraph or path variables • Have no subgraph or path constraint expressions

Issue2 – Query Evaluation • We need to support efficient query evaluation in persistent databases? • Performing traversal algorithms on arbitrarily stored graphs is unlikely to be efficient • For triple store architectures queries will often require multiple multi-way join queries • Schema-agnostic  #joins • Schema-aware  #joins

Our Approach • SPARQ2L: Extends SPARQL with path variables and path constraint expressions • Good foundation for describing subgraphs • Propose a storage and query processing model for efficient evaluation of disk resident graphs

Syntax and Semantics of SPARQ2L

Syntax for SPARQ2L • Extends Perez et al ISWC’06 • Let T = RDF terms i.e. IRIs, literals and blank nodes • A term variable ?x ranges over T • A path variable??p ranges over 2T • A triple pattern is a triple with a term variable • (?x, email, ?y) • A tp-pattern - triple with path variable • (x, ??p, y)

Expressing Constraints on Paths • paths must contain a specific node type • from an MTB surface molecule to via a Phoshoinositide 3-Kinase enzyme to a cellular response event. • paths lengths are bounded • close connections (less than 4 hops)between SalesPersonA and CIO-Y. • supported by a few systems • paths must contain a specific pattern • paths from authorX to reviewerY that involves a “knows  coauthor” pattern

PATHFILTER in SPARQ2L • PATHFILTER using built-in path functions • containsAny(listofresources), containsAll(listofresources),containsPattern(expression) • Example • (X, ??p, Y) PATHFILTER (containsAny(??p, P3K)) • SPARQ2L supports extended regular expressions over triple/triple patterns • Variables on nodes and edges • (X, ??p, Y) PATHFILTER (containsPattern(??p, ([a, ·] foaf:knows [· , ?b] )+ ) )

Semantics of SPARQ2L Triple Patterns 1 Let VT and VP be the sets of term and path variables resp. A mapping  : VT  VP  2T : x  VT, (x)= t 2Tand | t | = 1 (?X, course_title,?Y) -- triple pattern TP (C1, course_title, “Semantic Web”), (U1, offers, C1), (C3, course_title, “Databases”), (U2, offers, C2), (C3, taught_by, P1), (S1A1, enrolled_in, C3) The evaluation of TP is the set of mappings that cause t to match the graph

Semantics of SPARQ2L TP-patterns Otherwise for x  VP, (x)= t 2T and| t | > 1 (S1A1,??P, P1) -- triple path pattern TPP 1 (C1, course_title, “Semantic Web”), (U1, offers, C1), (C3, course_title, “Databases”), (U2, offers, C2), (S1A1, enrolled_in, C3), (C3, taught_by, P1) The evaluation of TPP is the set of mappings that cause TPP to “match” a path in the graph

Semantics of PATHFILTER • A mapping ω satisfies |= the condition F if: • if F is containsAny(??P, L’), then L’ ∩ ω(??P) ≠ ∅. • if F is containsAll(??P, L’), then L’ ω(??P). • if F is containsPattern(??P, tr), then ground(tr) is a subpath of ω(??P). • if F is (¬F1), then ω(??P) |≠ F1 • if F is (F1  F2), then ω(??P) |= F1 and ω(??P) |= F2 The evaluation of (TPP PATHFILTER F) is the set of mappings that “satisfy” F.

Evaluating Path Extraction Queries on Persistent RDF Databases

Storage Model Requirements • All queries should be answerable in a single scan • Good performance for different classes of queries • Allow precomputed partial path information • Indexing • Based on mainstream indexing structures • Clustering techniques for “related” path information • Compact representation of path information

Foundation for Approach2 Given a directed graph G = (V, E) A P-Expression of type (u, v), (P, u, v), is a regular expression P over E such that s L(P)represents a path from u to v. Example Assume E = (u, p1, w), (u, p2, w), (w, p3, v) then (u, p1, w)  (u, p2, w)  (w, p3, v)is an p-expression of type (u, v). The Path Sequence for a graph G is the sequence (P1, s1, d1), (P2, s2, d2), (P3, s3, d3), …, (Pf, sf, df), …, (Pg, sg, dg), …, (Pl, sl, dl) : p = p1, p2, …, pk for any non-empty path p in G. g < l 2 < f < 2 Tarjan Fast Algorithms for solving path problems JACM81

c d h 2 1 6 3 a b e 5 f LU decomposition of a graph’s matrix!!! g 4 Path Sequence PS for G u < v : paths from u to v with all intermediate vertices w < u. u  v : paths from u to v with no intermediate vertex w > v. PS = p-expressions with u  v in increasing order of u, followed p-expressions u > v in decreasing order of u

c d h 2 1 6 3 a b e 5 f g 4 Solving (2, 6) (2, 3) = (2, 3)  (2, 2)  (2, 3) = ((a c)* a d) …….. (2, 2) =  (2, 2)  (1, 2) = (2, 2) =  (2, 6) = (2, 6)  (2, 3)  (3, 6) = (a c)* a d h ………. (2, 2)  (1, 3) = (2, 3) =  (2, 2)  (2, 2) = (2, 2) =  (a c)* O(pathsequencelength) !!

Indexing a Path Sequence • Use B+tree : query answering  extended range query. • Cluster based on the notions of prunability and prunability equivalence A p-expressionpeis said to be prunable from PS if Q can be solved using PS - pe Two p-expressionspe1, pe2are prunable equivalent with respect toQif determining the prunability ofpe1leads us to conclude the prunability ofpe Let Q = (s, d) be a query and PS be the path sequence for a data graph G.

Labeling a Path Sequence 1 16 course_in advises 15 enrolled_in has_subject_area 2 12 enrolled_in 14 enrolled_in author_of advises 13 author_of editor_of author_of 4 3 6 required_text 8 author_of advises enrolled_in 11 has_subject_area course_in 9 10 related_to_project taught_by 5 current_project 17 19 P-expressions for SCCs are prunable equivalent. project_in project_in Same for edges connecting SCCs 20 Such p-expressions may be assigned the same key values 18

1 2 1 16 course_in advises 15 teaches enrolled_in 2 12 enrolled_in 14 advises enrolled_in author_of author_of 6 editor_of 13 author_of 4 3 required_text author_of 8 advises enrolled_in has_subject_area 11 course_in 9 10 Dangling Trees taught_by 5 Disconnected related_to_project current_project 17 19 project_in project_in Interval of tree subgraph identifiers sids disjoint from that of non-tree sids 20 18 3 4

Tree-induced Prunability Equivalence Refine the partitioning of nodes and edges in nontree subgraphs using anoptimal spanning tree (OST) An OST selects edges that lie on the longest path from root to a node. 1 1 advises teaches 2 2 enrolled_in author_of 3 3 4 3 course_in has_subject_area Only nodes and edges at level j < i can reach a node at level i. !!!! 4 5

2-Color Code • Label each scc with three identifiers - subgraph identifier s, OST-level identifier l and a preorder identifier t. • 2-Color code is a sequence of key-value pairs: • for scci, [((s, l, t)i, (s, l, t)i) PSi] • for sccx, sccy connected by edges e1, e2, .. ek, [((s, l, t)x, (s, l, t))y {pee1, pee2,… peek}] • 2-Color code preserves path sequence ordering

2-Color Code Properties • Order Property : for G’N / G’T • uinG’N, v in G’T label(u) precedes label(v). • e = (u,v)  label(u) precedes label(e) • NonReachability Property: (su, lu, t)u, (sv, lv, tv)v • su ≠ sv result is empty. • lulv result is empty. • for query (u, v) with levels i, j, any node w with level k with k < i or k > j is prunable

1:[ (1,1,1), (1,1,1), { }], 2:[ (1,1,1), (1,2,2), { (advises, 1, 2) } ], ………………………. 11: [ (2,1,1), (2,1,1), { } ], 12: [ (2,1,1), (2,2,2), { (advises, 6, 12) } ], ………………………. 16:[(2,1,6), (2,1,6), { (enrolled_in, 8, 10), (advisesenrolled_in, 9, 10), ( (taught_byadvisesenrolled_in)*, 10, 10), (taught_by, 10, 9), (advises, 9, 8) } ], 20: ………………………. 31:[ (3,1,1), (3,1,1), { } ], 32: [ (3,2,1), (3,1,1), { (project_in, 17, 18) } ], 33:[(3,2,2), (3,2,2), { } ], 34: [ (4,1,1), (2,9,1), { (current_project, 11, 19) } ], 35:[ (4,1,1), (4,1,1), { } ], 36: [ (4,2,1), (4,1,1), { (project_in, 19, 20) } ], 37: [ (4,2,2), (4,2,2), { } ], 38: [ (5,1,1), (2,9,1), { (advises, 11, 21) } ], 39: [(5,1,1), (5,1,1), { } ],

Cost of 2-Color Code Construction • Find strong components of G - O(n + m) • Find roots of dangling trees - O(n + m) • Find optimal spanning tree - O(n + m) • Find PS for each strong component i in increasing order of level in OST – O ni3 Sample execution time in our current ( java multiphase implementation) • 120K/360K   10 mins

Evaluation • Java 1.5, 1.8GHz Dual AMD Opteron processor with 10GB available • Colt sparse matrix distribution • Berkeley DB Java Edition • Datasets • Queries • 6 query classes • 40  6 queries (+, -)

Positive Queries

Negative Queries

Ongoing and Future Work • Support of ContainsPattern predicate • Supporting Subgraph Extraction Queries • Undirected path variables • More advanced indexing techniques • Enable skipping of path sequence regions • Optimization of pattern matching queries

Thank you!! More at SemDis project pages: SemDis@Kno.e.sis, SemDis@LSDIS, SemDis@ebiquity

SPARQ2L : Towards Supporting Subgraph Extraction Queries in RDF Databases