480 likes | 605 Views
A(k)-index : Exploiting Local Similarity to Index Paths in Graph Data. Raghav Kaushik (UW) Pradeep Shenoy (UWash) Philip Bohannon (Bell Labs) Ehud Gudes (BGU). Outline. Problem statement Prior work and limitations Background A(k)-index Query Evaluation Preliminary experiments
E N D
A(k)-index :Exploiting Local Similarity to Index Paths in Graph Data Raghav Kaushik (UW) Pradeep Shenoy (UWash)Philip Bohannon (Bell Labs)Ehud Gudes (BGU)
Outline • Problem statement • Prior work and limitations • Background • A(k)-index • Query Evaluation • Preliminary experiments • Update • Conclusions
Data Model • Rooted, node-labeled graph with unique root; root has unique label • Nodes - objects • Arcs - object-subobject relationship • In XML context • Index tag structure • No distinction between elements and attributes • No distinction between tree and idref arcs • Order ignored
Problem Statement • Practical indexing schemes for large graph data (like XML data) (100K - 1M nodes) • Size ~10% of database size • Efficient construction and update • Tunable to a workload • Queries of the form R x, where R is a regular path expression • Schemaless data
Flavor of Approach • Different from traditional value indices • Structural summaries for indexing paths • Both data and index are rooted graphs • Example: Dataguide
Index Graph • Structural summary • Associate a set of data nodes with each index node, called its extent • Preserve data paths in index graph
Example index graph 0 0 2 1 2 1 3,4 4 3 5,6 6 5 Data graph Index graph
Index Graph (cont’d) • Can be constructed from any partition • Node for every equivalence class C • Edge between C and C’ if exists an edge v v’ with v in C and v’ in C’ • Preserves data paths, no false drops • Our structures are all index graphs
Prior Schemes • Dataguide [Goldman, Widom 1997] • Deterministic automaton corresponding to data graph • Each set of data nodes that can be distinguished by a path query is summarized by a single node in the index • Can be exponential in size!
Prior Schemes (cont’d) • 1-index [Milo, Suciu 1999] • NFA rather than DFA (smaller) • split graph nodes into equivalence classes based on incoming paths from the root • Computing best split is PSPACE complete • Go for refinements (approximations) • similarity • bisimilarity
Limitations of Prior Work • Size • Dataguide sizes subject to exponential blow-up • 1-index size can be big too! • Update • No known update algorithm for 1-index • Designed to answer queries involving arbitrarily complex paths, but... • such paths may never show up in queries
Local Similarity ROOT metro cultural neighborhoods business museum museum hotel nhd. nhd. nearby attr. attr. cult. cult.
Main Contributions • New family of approximate index structures • Applicable to • Approximate Schema • Statistics • Query evaluation using approximate indexes • Preliminary performance study • Update algorithms
Approximate Indexes • Motivation: • Smaller • More efficient query processing • Limited update cost - maintain local information • Approximate dataguide [Goldman, et.al] • path merging, object matching, etc • no formal basis (but different goal) • no study of effect on query processing
Outline • Problem statement • Prior work and limitations • Background • A(k)-index • Query Evaluation • Preliminary experiments • Update • Conclusions
Graph Bisimulation • A bisimulation is a symmetric relation R between nodes • If A1 R A2 then • A1 and A2 have the same labels • and ...
B1 A1 A2 R B1 B2 R A1 A2 R Graph Bisimulation (cont’d) and vice-versa!
Bisimilarity • Two nodes a and b are bisimilar if they are related in some bisimulation • 1-index is index graph constructed from bisimulation partition • Simulation partition: similar
Bisimulation on example ROOT metro cultural neighborhoods business museum museum hotel nhd. nhd. nearby attr. attr. cult. cult.
k-bisimulation • Nodes A1 and A2 are 0-bisimilar iff same label • A1 and A2 are k-bisimilar iff • k-1 bisimilar and • if (B1, A1), exists (B2, A2): B1 and B2 are k-1 bisimilar, and vice versa
0 0 0 1 2 1 2 1 2 3 4 3,4 3 4 5,6 5,6 5 6 Data graph 0-bisimulation 1-bisimulation Example for k-bisimulation
A(2) for example ROOT metro cultural neighborhoods business museum museum hotel nhd. nhd. nearby attr. attr. cult. cult.
Properties • If a and b are bisimilar • set of incoming paths into them is same • If a and b are k-similar or k-bisimilar • set of incoming paths of length <= k are same • If k-bisim = k+1-bisim then k-bisim = bisim • Size: certainly smaller than bisimulation
Query Evaluation • Only queries studied are regular path queries of the form R x • Query Evaluation Approach: • Create automaton for regexp query • Run automaton on the index graph • Result is union of extents belonging to index nodes accepted by automaton
0 1 2 3,4 5,6 Example Query Evaluation Automaton Graph Index Graph
Approximate Indexes • Caveat: False positives possible • Approach: verify each node on data graph by running reverse automaton • Prohibitive cost? • Then why use approx. indices? • In fact, frequently more efficient than data graph or precise index
Improving Validation • First cut: Keep track of accepting-path-length • for accepted nodes with path length <= k, verification not required • Second step: Share traversals among verification calls • mark node-state pairs on a successful verification path as accept • similar marking for failed path
Improving Validation (cont’d) • Third Step: Avoid needless verification • Example: For _*.R queries, no need to verify all the way up to the root • Generalize the above!
Outline • Problem statement • Prior work and limitations • Background • A(k)-index • Query Evaluation • Preliminary experiments • Update • Conclusions
Preliminary Experiments • Data used: Internet Move Database (http://www.imdb.com) • 250,000 movies & TV shows • 460,000 actors, etc • XML version = ~1GB • We used subsets of this database ranging from 200 - 2000 movies • Whole database --> future work!
Preliminary Experiments • Second source: Open Directory Project (http://www.dmoz.org) • Entire source available in RDF format • Subsets: (entire subtree under a topic, say shopping)
Storage Model • Results independent of any particular storage model • In-memory rooted graph • Performance metrics are abstract • Cost = total number of nodes visited (graph + index)
Bisimulation Sizes IMDB #Nodes: 190,000 ODP #Nodes: 143,000
Query Evaluation Plans 1. Forward eval 2. Backward eval(assume a label index)
A(k)-index Update • Edge added from u to v • A(0)-index -> no change except possible addition of edge • A(1)-index -> index node containing v may change • determined by set of labels in v’s parents
A(k)-index Update (contd) • A(k)-index • only nodes to be considered are those at distance < k from v • Maintain tree of splits • Work iteratively: • find new A(1) position of v • find new A(2) positions of v and its children • …
Updating the 1-index • One way is generalization of A(k) update • R - any binary relation on the nodes that is • reflexive • transitively closed. • A refinement of R is any subset that is • reflexive • transitively closed
Refinement • B - bisimulation relation • B’ - any refinement of B • B(G) - index graph built using B • B’(G) - index graph built using B’
Theorem • Theorem: B(B’(G)) = B(G) • Intuition: • Similar nodes behave similarly • So, fuse them together!
Lazy Update • Basic Idea: • G G’ , and meanwhile B(G) B(G’) • Instead, “relax” the graph B(G) to B’(G’) • How? • A “stable” partitioning of G is either B(G) or its refinement. • Propagate graph update on B(G) by splitting nodes until stable.
Conclusions • Novel approximate index structures and validation techniques • Experiments demonstrate k-bisimulation index is • Efficiently constructed • Effective for query answering
Future Work • Handle more query types • Branching queries • Queries with selection • Annotating A(k) with statistics for query optimization • Storage • Application of update algorithms to triggers