A(k)-index : Exploiting Local Similarity to Index Paths in Graph Data

A(k)-index :Exploiting Local Similarity to Index Paths in Graph Data Raghav Kaushik (UW) Pradeep Shenoy (UWash)Philip Bohannon (Bell Labs)Ehud Gudes (BGU)

Outline • Problem statement • Prior work and limitations • Background • A(k)-index • Query Evaluation • Preliminary experiments • Update • Conclusions

Data Model • Rooted, node-labeled graph with unique root; root has unique label • Nodes - objects • Arcs - object-subobject relationship • In XML context • Index tag structure • No distinction between elements and attributes • No distinction between tree and idref arcs • Order ignored

Problem Statement • Practical indexing schemes for large graph data (like XML data) (100K - 1M nodes) • Size ~10% of database size • Efficient construction and update • Tunable to a workload • Queries of the form R x, where R is a regular path expression • Schemaless data

Flavor of Approach • Different from traditional value indices • Structural summaries for indexing paths • Both data and index are rooted graphs • Example: Dataguide

Index Graph • Structural summary • Associate a set of data nodes with each index node, called its extent • Preserve data paths in index graph

Example index graph 0 0 2 1 2 1 3,4 4 3 5,6 6 5 Data graph Index graph

Index Graph (cont’d) • Can be constructed from any partition • Node for every equivalence class C • Edge between C and C’ if exists an edge v v’ with v in C and v’ in C’ • Preserves data paths, no false drops • Our structures are all index graphs

Prior Schemes • Dataguide [Goldman, Widom 1997] • Deterministic automaton corresponding to data graph • Each set of data nodes that can be distinguished by a path query is summarized by a single node in the index • Can be exponential in size!

Prior Schemes (cont’d) • 1-index [Milo, Suciu 1999] • NFA rather than DFA (smaller) • split graph nodes into equivalence classes based on incoming paths from the root • Computing best split is PSPACE complete • Go for refinements (approximations) • similarity • bisimilarity

Limitations of Prior Work • Size • Dataguide sizes subject to exponential blow-up • 1-index size can be big too! • Update • No known update algorithm for 1-index • Designed to answer queries involving arbitrarily complex paths, but... • such paths may never show up in queries

Local Similarity ROOT metro cultural neighborhoods business museum museum hotel nhd. nhd. nearby attr. attr. cult. cult.

Main Contributions • New family of approximate index structures • Applicable to • Approximate Schema • Statistics • Query evaluation using approximate indexes • Preliminary performance study • Update algorithms

Approximate Indexes • Motivation: • Smaller • More efficient query processing • Limited update cost - maintain local information • Approximate dataguide [Goldman, et.al] • path merging, object matching, etc • no formal basis (but different goal) • no study of effect on query processing

Graph Bisimulation • A bisimulation is a symmetric relation R between nodes • If A1 R A2 then • A1 and A2 have the same labels • and ...

B1 A1 A2 R B1 B2 R A1 A2 R Graph Bisimulation (cont’d) and vice-versa!

Bisimilarity • Two nodes a and b are bisimilar if they are related in some bisimulation • 1-index is index graph constructed from bisimulation partition • Simulation partition: similar

Bisimulation on example ROOT metro cultural neighborhoods business museum museum hotel nhd. nhd. nearby attr. attr. cult. cult.

k-bisimulation • Nodes A1 and A2 are 0-bisimilar iff same label • A1 and A2 are k-bisimilar iff • k-1 bisimilar and • if (B1, A1), exists (B2, A2): B1 and B2 are k-1 bisimilar, and vice versa

0 0 0 1 2 1 2 1 2 3 4 3,4 3 4 5,6 5,6 5 6 Data graph 0-bisimulation 1-bisimulation Example for k-bisimulation

A(2) for example ROOT metro cultural neighborhoods business museum museum hotel nhd. nhd. nearby attr. attr. cult. cult.

Properties • If a and b are bisimilar • set of incoming paths into them is same • If a and b are k-similar or k-bisimilar • set of incoming paths of length <= k are same • If k-bisim = k+1-bisim then k-bisim = bisim • Size: certainly smaller than bisimulation

Query Evaluation • Only queries studied are regular path queries of the form R x • Query Evaluation Approach: • Create automaton for regexp query • Run automaton on the index graph • Result is union of extents belonging to index nodes accepted by automaton

0 1 2 3,4 5,6 Example Query Evaluation Automaton Graph Index Graph

Approximate Indexes • Caveat: False positives possible • Approach: verify each node on data graph by running reverse automaton • Prohibitive cost? • Then why use approx. indices? • In fact, frequently more efficient than data graph or precise index

Improving Validation • First cut: Keep track of accepting-path-length • for accepted nodes with path length <= k, verification not required • Second step: Share traversals among verification calls • mark node-state pairs on a successful verification path as accept • similar marking for failed path

Improving Validation (cont’d) • Third Step: Avoid needless verification • Example: For _*.R queries, no need to verify all the way up to the root • Generalize the above!

Preliminary Experiments • Data used: Internet Move Database (http://www.imdb.com) • 250,000 movies & TV shows • 460,000 actors, etc • XML version = ~1GB • We used subsets of this database ranging from 200 - 2000 movies • Whole database --> future work!

Preliminary Experiments • Second source: Open Directory Project (http://www.dmoz.org) • Entire source available in RDF format • Subsets: (entire subtree under a topic, say shopping)

Storage Model • Results independent of any particular storage model • In-memory rooted graph • Performance metrics are abstract • Cost = total number of nodes visited (graph + index)

Bisimulation Sizes IMDB #Nodes: 190,000 ODP #Nodes: 143,000

Query Evaluation Plans 1. Forward eval 2. Backward eval(assume a label index)

Short Queries - IMDB

Long Queries - IMDB

Queries beginning with _*

Queries containing _*

Approximate Answers

A(k)-index Update • Edge added from u to v • A(0)-index -> no change except possible addition of edge • A(1)-index -> index node containing v may change • determined by set of labels in v’s parents

A(k)-index Update (contd) • A(k)-index • only nodes to be considered are those at distance < k from v • Maintain tree of splits • Work iteratively: • find new A(1) position of v • find new A(2) positions of v and its children • …

Updating the 1-index • One way is generalization of A(k) update • R - any binary relation on the nodes that is • reflexive • transitively closed. • A refinement of R is any subset that is • reflexive • transitively closed

Refinement • B - bisimulation relation • B’ - any refinement of B • B(G) - index graph built using B • B’(G) - index graph built using B’

Theorem • Theorem: B(B’(G)) = B(G) • Intuition: • Similar nodes behave similarly • So, fuse them together!

Lazy Update • Basic Idea: • G  G’ , and meanwhile B(G)  B(G’) • Instead, “relax” the graph B(G) to B’(G’) • How? • A “stable” partitioning of G is either B(G) or its refinement. • Propagate graph update on B(G) by splitting nodes until stable.

Lazy Update Performance

Conclusions • Novel approximate index structures and validation techniques • Experiments demonstrate k-bisimulation index is • Efficiently constructed • Effective for query answering

Future Work • Handle more query types • Branching queries • Queries with selection • Annotating A(k) with statistics for query optimization • Storage • Application of update algorithms to triggers

A(k)-index : Exploiting Local Similarity to Index Paths in Graph Data

A(k)-index : Exploiting Local Similarity to Index Paths in Graph Data

Presentation Transcript

EVA: An Introduction

Visual Basic 1

Indexing and Hashing

Graph Algorithms

CROP PRODUCTIVITY INDEX CPI SOIL ROOT SHOOT CROP YIELD

Data-moderate assessments for 9 groundfishes

On the index card, please write:

Efficient Exact Set-Similarity Joins

Index

Resource Constraints

Index Structures

Data Mining using Fractals and Power laws

Index 1 of 2 Select the subject to see Sayings of the Prophet Muhammad related to it..

Discrete Mathematics

Graphs

Area Wage Index

Index by Weeks