340 likes | 470 Views
Efficiently Creation and Incremental Maintenance of the HOPI Index for Complex XML Document Collections. Ralf Schenkel joint work with Anja Theobald, Gerhard Weikum. Outline. The Problem: Connections in XML Collections HOPI Basics [EDBT 2004] Efficiently Building HOPI
E N D
Efficiently Creation and Incremental Maintenance of the HOPI Index for Complex XML Document Collections Ralf Schenkel joint work with Anja Theobald, Gerhard Weikum
Outline • The Problem: Connections in XML Collections • HOPI Basics [EDBT 2004] • Efficiently Building HOPI • Why Distances are Difficult • Incremental Index Maintenance
XML Basics article article title sec references title sec references entry entry XML … … XML document Element-level graph <article> <title>XML</title> <sec>…</sec> <references> <entry>…</entry> </references> </article>
XML Basics link <article> <title>XML</title> <sec>…</sec> <references> <entry>…</entry> </references> </article> <researcher> <name>Schenkel</name> <topics>…</topics> <pubs> <book>…</book> </pubs> </researcher> <book> <title>UML</title> <author>…</author> <content> <chap>…</chap> </content> </book> XML collection= docs + links
XML Basics article researcher title sec references name topics pubs entry book book Element-level graphof the collection title author content chap
XML Basics Document-level graphof the collection
Connections in XML article researcher article researcher title sec references name topics pubs entry book • (Naive) Answers: • Use Transitive Closure! • Use any APSP algorithm!(+ store information) • Questions: • Is there a path from article to researcher? • How long is the shortest path from article to researcher? book title author content chap XPath(++)/NEXI(++)-Query //article[about(“XML“)]//researcher[about(“DBS“)]
Why naive is not enough Small example from real world: subset of DBLP 6,210 documents (publications) 168,991 elements 25,368 links (citations) 14 Megabytes (uncompressed XML) Element-level graph has 168,991 nodes and 188,149 edges Its transitive closure: 344,992,370 connections 2,632.1 MB Complete DBLP has about 600,000 documents The Web has …?
Goal Find a compact representation for the transitive closure • whose size is comparable to the data‘s size • that supports connection tests (almost) as fast as the transitive closure • that can be built efficiently for large data sets
HOPI: Use Two-Hop Cover a c b • For each node a, maintain two sets of labels (which are nodes): Lin(a) and Lout(a) • For each connection (a,b), • choose a node c on the path from a to b (center node) • add c to Lout(a) and to Lin(b) • Then (a,b)Transitive Closure T Lout(a)Lin(b)≠ Two-hop Cover of T (Edith Cohen et al., SODA 2002) • Minimize the sum of the label sizes(NP-complete approximation required)
Approximation Algorithm 1 2 4 5 3 6 initial density: 2 4 1 I O 5 2 6 What are good center nodes? Nodes that can cover many uncovered connections. Initial step:All connections are uncovered 2 Consider the center graph of candidates density of densest subgraph (here: same as initial density) (We can cover 8 connections with 6 cover entries)
Approximation Algorithm 1 2 4 5 3 6 initial density: 1 4 2 5 I O 6 3 density of densest subgraph = initial density (graph is complete) 4 What are good center nodes? Nodes that can cover many uncovered connections. Initial step:All connections are uncovered 4 Consider the center graph of candidates Cover connections in subgraph with greatest density with corresponding center node
Approximation Algorithm 1 2 4 5 3 6 1 I O 2 2 What are good center nodes? Nodes that can cover many uncovered connections. Next step:Some connections already covered 2 Consider the center graph of candidates Repeat this algorithm until all connections are covered Theorem: Generated Cover is optimal up to a logarithmic factor
Optimizing Performance [EDBT04] • Density of densest subgraph of a node‘s center graph never increases when connections are covered • Precompute estimates, recompute on demand(using a Priority Queue) ~2 computations per node • Initial Center Graphs are always their densest subgraphs
Is that enough? For our example: Transitive Closure: 344,992,370 connections Two-Hop Cover: 1,289,930 entries compression factor of ~267 queries are still fast (~7.6 entries/node) But: Computation took 45 hours and 80 GB RAM!
HOPI: Divide and Conquer Framework of an Algorithm: • Partition the graph such that the transitive closures of the partitions fit into memory and the weight of crossing edges is minimized • Compute the two-hop cover for each partition • Combine the two-hop covers of the partitions into the final cover
Step 3: Cover Joining Using current Lin and Lout t Naive Algorithm (from EDBT ’04) s For each cross-partition link st: • Choose t as center node for all connectionsover st • Add t to Lin(d) of all descendants d of t and t itself • Add t to Lout(a) of all ancestors a of s and s itself Join has to be done sequentially for all links
Results with Naive Join Best combination of algorithms: Transitive Closure: 344,992,370 connections Two-Hop Cover: 15,976,677 entries compression factor of ~21.6 queries are still ok (~94.5 entries/node) build time is feasible (~3 hours with 1 CPU and 1GB RAM) Can we do better?
Structurally Recursive Join Alg Basic Idea • Compute (small) graph from partitioning • Compute its two-hop cover Hin,Hout • Combine this cover with the partition covers
Example 7 8 4 5 2 3 1 6 Build partition-level skeleton graph PSG
Example (ctd.) 1 2 7 8 Hin 2 2 7 2 Hout 2 2 2,7 2 8 7 1 2 Join Algorithm: • For each link source s,add Hout(s) to Lout(a) for each ancestor a of sin s‘ partition • For each link target t,add Hin(t) to LIn(t) for each descendant d of tin t‘s partition Join can be done concurrently for all links
Example (ctd.) Lout={…,2,7} Lin={…,2} 7 8 4 5 Lemma:It is enough to cover connections from link sources to link targets 2 3 1 6
Final Results for Index Creation Transitive Closure: 344,992,370 connections Two-Hop Cover: 9,999,052 entries compression factor of ~34.5 queries are still ok (~59.2 entries/node) build time is good (~23 minutes with 1 CPU and 1GB RAM) Cover size 8 times larger than best,but ~118 times faster with ~1% memory
Outline • The Problem: Connections in XML Collections • HOPI Basics [EDBT 2004] • Efficiently Building HOPI • Why Distances are Difficult • Incremental Index Maintenance
Why Distances are Difficult 2 4 Lout(v)={(u,2), …} Lin(w)= {(u,4), …} v u w • Should be simple to add: Lout(v)={u, …} Lin(w)= {u, …} dist(v,w)=dist(v,u)+dist(u,w)=2+4=6 • But the devil is in the details…
Why Distances are Difficult 2 4 v u w dist(v,w)=1 Center node u does not reflect the correct distance of v and w
Solution: Distance-aware Centergraph 1 2 4 5 3 6 1 4 2 5 I O 6 3 4 • Add edges to the center graph only if the corresponding connection is a shortest path • Correct, but two problems: • Expensive to build the center graph (2 additional lookups per connection) • Initial graphs are no longer complete bound is no longer tight
New Bound for Distance-Aware CGs Estimation for Initial Density Assume we know the CG (E=#edges). Then But: precomputation takes 4h Reduces time to build two-hop cover by 2 hours Solution: random sampling of large center graphs
Outline • The Problem: Connections in XML Collections • HOPI Basics [EDBT 2004] • Efficiently Building HOPI • Why Distances are Difficult • Incremental Index Maintenance
Incremental Maintenance (join) (delete+insert) How to update the two-hop cover when documents (nodes, elements) are • inserted in the collection • deleted from the collection • updated Rebuilding the complete cover should be the last resort!
Deleting „good“ documents 2 3 4 1 6 5 7 8 9 „good“ documents separate the document-level graph: Ancestors of d and descendants of d are connected only through d Delete document 6 Deletions in covers of elements in documents 3,4,8,9 (+ doc 6)
Deleting „bad“ documents 2 3 4 1 6 5 7 8 9 „bad“ documents don‘t separate the doc-level graph: Ancestors of d and descendants of d are connected through d and by other docs • Delete document 5 • Deletions in covers of elements in documents 1,2,3,7 (+ doc 5) • Add 2-hop cover for connections starting in docs 1,2,3 (but not 4) and ending in 7
Future Work • Applications with non-XML data • Length-Bound Connections: n-Hop-Cover • Distance-Aware Solution for Large Graphs with many cycles (partitioning breaks cycles) • Large-scale experiments with huge data • Complete DBLP (~600,000 docs) • IMDB (>1 Mio docs,cycles) with many concurrent threads/processes • 64 CPU Sun server • 16 or 32 cluster nodes
Conclusion • HOPI as connection and distance index for linked XML documents • Efficient Divide-and-Conquer Build Algorithm • Efficient Insertion and (sometimes) Deletion of Documents, Elements, Edges