D(k)-Index: An Adaptive Structural Summary for Graph-Structured Data

D(k)-Index: An Adaptive Structural Summary for Graph-Structured Data Qun Chen, Andrew Lim and Kian Win Ong

Outline • Introduction: XML Query and Path Expression • Previous Structural Summaries for XML • 1-Index • A(k)-Index • D(k)-Index • Construction • Update • Experimental Results • Conclusion and Future Work

An XML Document <?xml version="1.0"?> <!DOCTYPE MovieDB SYSTEM “moviedb.dtd”> <MovieDB> <director name=“Steven Pat”> <movie> <title> Titanic </title> … </movie> … </director> </MovieDB>

XML Data Model

Regular Path Expression • Example: • director.movie.title • movieDB.(_)?.movie.actor.name • Definition: • A sequence of labels(or_) • Alternation(|), repetition(*), optional expression(?) allowed

Path Matching P: director.movie.title {15,16,18}

Purpose of Structural Summary P: A.C.D To improve evaluation performance by pruning the search space!

Bisimilarity • Existing summary structures, 1-index and A(k)-Index, are based on bisimilarity; • Definition: • Two data nodes u and v are bisimilar(uv) if • u and v have the same label; • if u’ is a parent of u, then there is a parent v’ of v such that u’v’, and vice versa; • Intuitively, the set of paths coming into them is the same if two nodes are bisimilar

1-Index • Each index node represents an equivalence class, in which data nodes are mutually bisimilar. • Evaluation on 1-index is • safe: its result always contains the result of evaluating on the data graph; • sound:its result contains nofalse data node;

A 1 2 3 4 B B C 5 6 7 D E E 8 9 F F 1-Index (cont’d) 1 A 2,3 4 C B 5 6 7 D E E 8 9 F F 源数据图 1-Index图

Local Bisimilarity • k-bisimilarity(k) is defined inductively: • For any two nodes, u and v, u0v iff u and v have the same label; • Node ukv iff u(k-1)v, and for every parent u’ of u, there is a parent v’ of v such that u’(k-1)v’, and vice versa; • Intuitively, if two data nodes are k-bisimilar, the set of paths coming into them with length ( k) is the same.

A(k)-Index • In A(k)-Index, data nodes in each index nodes are mutually k-bisimilar; • Evaluation on A(k)-index is • Safe ; • sound if the length of the query path is  k, otherwise the result on the index graph should be validated on the data graph.

1 1 A A A 1 2 3 4 2 3 4 2 3 4 B B C B B C B B C 5 6 5 5 6 7 6,7 7 D E D D E E E E 8 9 8,9 8,9 F F F F A(1)索引图 A(0)索引图 A(k)-Index 源数据图

D(k)-Index • Each index node in D(k) has its own local bisimilarity • A clear generalization of 1-Index and A(k)-Index; • Advantage over 1-Index and A(k)-Index • workload-sensitive; • can more efficiently updated

D(k)-Index(Cont.) • The D(k)-index is the index graph based on the local bisimilarity. It satisfies the condition that for any two index nodes ni and nj, k(ni)k(nj)-1 if there is an edge from ni to nj, in which k(ni) and k(nj) are ni and nj’s local bisimilarities, respectively. k(A)k(B)-1

Properties of D(k)-Index • The evaluation on the D(k)-Index is safe; • The D(k)-Index is sound for a path expression P if, for each matching index node ni of P, k(ni) m;

Construction of D(k)-Index

A Construction Example Label E has a local bisimilarity requirement of 2, other labels’ are 1

Update on D(k)-Index • Two types of updates: • The addition of a subgraph; • The addition of a new edge; this represents a small incremental change to the source data; • For the addition of a subgraph, no major difference between D(k)-Index and previous static summary structures; • For the addition of a new edge, D(k)-Index is significantly more efficient!

Subgraph Addition

Edge Addition

Update Comparison Splitting up index nodes is computationally expensive!

Experiments (Data Sets) • The Xmark benchmark data. It simulates information about activities of an auction site. • The Nasa data. This data set is generated by the IBM data generator using a real DTD file, which is a markup language for the data and metadate at NASA/GSFC.

D(k) VS. A(k) • We compare our D(k)-index with the previous structural index A(k)-index, since the A(k)-index has been shown to outperform the 1-index. • We randomly generate 100 test paths with lengths between 2 and 5 for the Xmark and Nasa data. So we compare D(k)-index’s performance with A(0), A(1), up to A(4). Because evaluating test paths on the A(4)-index is already sound.

Cost Model • Because no standard storage scheme or query cost model exists, the simple in-memory cost model used in evaluating A(k)-index is adopted. The cost of a query is defined to be the total number of nodes visited in the index and data graph during evaluation: TotalCost(P)=NumNodesVisited(G)+NumNodesVisited(IG)

Evaluation before Updating(Xmark)

Evaluation before Updating(Nasa)

Running Time(msec) Xmark Nasa A(1) 1,022 3,863 A(2) 3,322 11,126 A(3) 5,196 31,992 A(4) 23,262 53,090 D(k) 2 1377 Updating Performance 1:100 new references are added to XML documents randomly Notes: 2: Our Machine features Linux OS, a Pentium 41.8 Ghz processor and a 512RAM

Evaluation after Updating(Xmark)

Evaluation after Updating(Nasa)

Conclusion and Future Work • D(k)-Index, as a clean generalization of 1-index and A(k)-Index, has a clear advantage over them: • Adaptive to workload • More efficient update operations • Future works: • Query pattern mining • Extending D(k)-Index to handle more complicated, branching path queries

D(k)-Index: An Adaptive Structural Summary for Graph-Structured Data