D(k)-Index: An Adaptive Structural Summary for Graph-Structured Data

D(k)-Index: An Adaptive Structural Summary for Graph-Structured Data Qun Chen, Andrew Lim and Kian Win Ong SIGMOD 2003

Outline • Introduction: XML Query and Path Expression • Previous Structural Summaries for XML • 1-Index • A(k)-Index • D(k)-Index • Construction • Update • Experimental Results • Conclusion and Future Work

An XML Document <?xml version="1.0"?> <!DOCTYPE MovieDB SYSTEM “moviedb.dtd”> <MovieDB> <director name=“Steven Pat”> <movie> <title> Titanic </title> … </movie> … </director> </MovieDB>

XML Data Model

Regular Path Expression • Example: • director.movie.title • movieDB.(_)?.movie.actor.name • Definition: • A sequence of labels(.or_) • Alternation(|), repetition(*), optional expression(?) allowed

Path Matching P: director.movie.title {15,16,18}

Purpose of Structural Summary P: A.C.D To improve evaluation performance by pruning the search space!

Bisimilarity • Existing summary structures, 1-index and A(k)-Index, are based on bisimilarity; • Definition: • Two data nodes u and v are bisimilar(uv) if • u and v have the same label; • if u’ is a parent of u, then there is a parent v’ of v such that u’v’, and vice versa; • Intuitively, the set of paths coming into them is the same if two nodes are bisimilar

1-Index • Each index node represents an equivalence class, in which data nodes are mutually bisimilar. • Evaluation on 1-index is • safe: its result always contains the result of evaluating on the data graph; • sound:its result contains nofalse data node;

A 1 2 3 4 B B C 5 6 7 D E E 8 9 F F 1-Index (cont’d) 1 A 2,3 4 C B 5 6 7 D E E 8 9 F F 源数据图 1-Index图

1-Index (cont’d)

Local Bisimilarity • k-bisimilarity(k) is defined inductively: • For any two nodes, u and v, u0v iff u and v have the same label; • Node ukv iff u(k-1)v, and for every parent u’ of u, there is a parent v’ of v such that u’(k-1)v’, and vice versa; • Intuitively, if two data nodes are k-bisimilar, the set of paths coming into them with length ( k) is the same.

A(k)-Index • In A(k)-Index, data nodes in each index nodes are mutually k-bisimilar; • Evaluation on A(k)-index is • 1. If nodes u and v are k-bisimilar, then the set of label paths of length ≤ k into them is the same. • 2. The set of label-paths of length m(m ≤ k) into an A(k)-index node is the set of label paths of length m into any data node in its extent. • Safe its results on a path expression always contain the data graph results for that query. • sound if the length of the query path is  k, otherwise the result on the index graph should be validated on the data graph.

1 1 A A A 1 2 3 4 2 3 4 2 3 4 B B C B B C B B C 5 6 5 5 6 7 6,7 7 D E D D E E E E 8 9 8,9 8,9 F F F F A(1)索引图 A(0)索引图 A(k)-Index 源数据图

D(k)-Index • Each index node in D(k) has its own local bisimilarity • A clear generalization of 1-Index and A(k)-Index; • Advantage over 1-Index and A(k)-Index • workload-sensitive; • can more efficiently updated

D(k)-Index(Cont.) • The D(k)-index is the index graph based on the local bisimilarity. It satisfies the condition that for any two index nodes ni and nj, k(ni)k(nj)-1 if there is an edge from ni to nj, in which k(ni) and k(nj) are ni and nj’s local bisimilarities, respectively. k(A)k(B)-1

Properties of D(k)-Index • The set of label paths of length s(≤ k(ni)) into a node niin the D(k)-index is the set of label paths of length s into any data node in its extent; • The D(k)-index is safe, i.e , its result on a path expression always contains the data graph result for that query; • The D(k)-index is sound for a path expression P of length m, l1l2 · · · lm+1, if, for each matching index node niof P, k(ni) ≥ m.

Construction of D(k)-Index

A Construction Example Label E has a local bisimilarity requirement of 2, other labels’ are 1

Update on D(k)-Index • Two types of updates: • The addition of a subgraph; • The addition of a new edge; this represents a small incremental change to the source data; • For the addition of a subgraph, no major difference between D(k)-Index and previous static summary structures; • For the addition of a new edge, D(k)-Index is significantly more efficient!

Subgraph Addition

Edge Addition

Update Comparison Splitting up index nodes is computationally expensive!

Experiments (Data Sets) • The Xmark benchmark data. It simulates information about activities of an auction site. • The Nasa data. This data set is generated by the IBM data generator using a real DTD file, which is a markup language for the data and metadate at NASA/GSFC.

D(k) VS. A(k) • We compare our D(k)-index with the previous structural index A(k)-index, since the A(k)-index has been shown to outperform the 1-index. • We randomly generate 100 test paths with lengths between 2 and 5 for the Xmark and Nasa data. So we compare D(k)-index’s performance with A(0), A(1), up to A(4). Because evaluating test paths on the A(4)-index is already sound.

Evaluation before Updating(Xmark)

Evaluation before Updating(Nasa)

Running Time(msec) Xmark Nasa A(1) 1,022 3,863 A(2) 3,322 11,126 A(3) 5,196 31,992 A(4) 23,262 53,090 D(k) 2 1377 Updating Performance 1:100 new references are added to XML documents randomly Notes: 2: Our Machine features Linux OS, a Pentium 41.8 Ghz processor and a 512RAM

Evaluation after Updating(Xmark)

Evaluation after Updating(Nasa)

Conclusion and Future Work • D(k)-Index, as a clean generalization of 1-index and A(k)-Index, has a clear advantage over them: • Adaptive to workload • More efficient update operations • Future works: • Query pattern mining • Extending D(k)-Index to handle more complicated, branching path queries

D(k)-Index: An Adaptive Structural Summary for Graph-Structured Data

D(k)-Index: An Adaptive Structural Summary for Graph-Structured Data

Presentation Transcript

Structured Mortgage-Backed Securities (MBS)

Advanced DFS, BFS, Graph Modeling

Keyword Search on Structured and Semi-Structured Data

Chapter 2: Data Preprocessing

Structured Belief Propagation for NLP

Data-moderate assessments for 9 groundfishes

Introduction to Computer Science I Topic 3: Recursive Data Types and Structural Recursion

Structured Forests for Fast Edge Detection

Principles of Adaptive Thermal Comfort

An Introduction to Adaptive Learning

Data Preprocessing

Adaptive Hypermedia From Concepts to Authoring

بنام خدا SQL

Structured Query Language – The Basics

Learning structured ouputs

Computing and SE II Chapter 5: Requirements Analysis

Utilizing DeltaV Adaptive Control

Data Mining: Concepts and Techniques

Graphs

Outlier Detection for Graph Data

Chapter 2: Data Preprocessing

Turbo-Charge Your Search Traffic with Structured Data