Compressing and Searching XML Data Via Two Zips

Compressing and Searching XML Data Via Two Zips Paolo Ferragina Dipartimento di Informatica, Università di Pisa [Joint with F. Luccio, G. Manzini, S. Muthukrishnan] Paolo Ferragina, Università di Pisa

Opportunistic Data Structures with Applications P. Ferragina, G. Manzini Survey byNavarro-Makinencites more than50papers on the subject !! Six years ago... [now, J. ACM 05] Paolo Ferragina, Università di Pisa

An XML excerpt <dblp> <book> <author> Donald E. Knuth</author> <title> The TeXbook </title> <publisher> Addison-Wesley </publisher> <year> 1986 </year> </book> <article> <author> Donald E. Knuth </author> <author> Ronald W. Moore </author> <title> An Analysis of Alpha-Beta Pruning </title> <pages> 293-326 </pages> <year> 1975 </year> <volume> 6 </volume> <journal> Artificial Intelligence </journal> </article> ... </dblp> It is verbose ! Paolo Ferragina, Università di Pisa

A tree interpretation... • XML document exploration  Tree navigation • XML document search  Labeled subpath searches Subset of XPath [W3C] Paolo Ferragina, Università di Pisa

The Problem We wish to devise acompressed representationfor a labeled tree T that efficiently supports some operations: • Navigational operations: parent(u), child(u, i), child(u, i, c) • Subpath searches:given a sequence P of k labels • Content searches: subpath + substring search • Visualization operation: given a node, visualize its descending subtree • XML-aware compressors (like XMill, XmlPpm, ScmPpm,...) • need the wholedecompression XML-native search engines might exploit this tool as a core block for query optimization and (compressed) storage • XML-queriable compressors (like XPress, XGrind, XQzip,...) • poorcompression and scan of the whole (compressed) file • Summary indexes (like Dataguide, 1-index or 2-index) • largespace and do not support “content” searches • Theoreticallydo exist many solutions, starting from [Jacobson, IEEE Focs ’89] • nosubpath/content searches, and poorperformance on labeled trees Paolo Ferragina, Università di Pisa

A transform for “labeled trees”[Ferragina et al, IEEE Focs ’05] • We proposed the XBW-transform that mimics on trees the nice structural properties of the Burrows-and-Wheeler Trasform on strings (do you know bzip !?). • The XBWlinearizes the tree T in 2 arrays s.t.: • the compression of T reduces to use any k-th order entropy compressor (gzip, bzip,...) over these two arrays • the indexing of T reduces to implement simple rank/select query operations over these two arrays Paolo Ferragina, Università di Pisa

a c C B B D b A a D c D a b c Permutation of tree nodes upward labeled paths The XBW-Transform Sa Sp C B D c a c A b a D c B D b a e C B C D B C D B C B C C A C A C A C D A C C B C D B C B C Step 1. Visit the tree in pre-order. For each node, write down its label and the labels on its upward path Paolo Ferragina, Università di Pisa

a c C B B D b A a D c D a b c upward labeled paths The XBW-Transform Sa Sp C b a D D c D a B A B c c a b e A C A C A C B C B C B C B C C C C D A C D B C D B C D B C Step 2. Stably sort according to Sp Paolo Ferragina, Università di Pisa

a D B A B a c C b D c D a c b XBW The XBW-Transform Sp Slast Sa 1 0 0 1 0 1 0 1 0 0 1 1 0 1 1 C b a D D c D a B A B c c a b e A C A C A C B C B C B C B C C C C D A C D B C D B C D B C XBW can be built and inverted in optimal O(t) time Key fact Nodes correspond to items in <Slast,Sa> Step 3. Add a binary array Slast marking the rows corresponding to last children XBW takes optimal t log |S| + 2t bits Paolo Ferragina, Università di Pisa

Tags, Attributes and symbol = Pcdata XBzip – a simple XML compressor • XBW is compressible: • Sa and Spcdata are locally homogeneous • Slast has some structure Paolo Ferragina, Università di Pisa

XBzip = XBW + PPMd String compressors are not so bad: within 5% Paolo Ferragina, Università di Pisa

c c a b a c B D B A D B D a b C D b C a A B D c B c b a D c a XBW Some structural properties Sp Slast Sa 1 0 0 1 0 1 0 1 0 0 1 1 0 1 1 C b a D D c D a B A B c c a b e A C A C A C B C B C B C B C C C C D A C D B C D B C D B C • Two useful properties: • Children are contiguous and delimited by 1s • Children reflect the order of their parents Paolo Ferragina, Università di Pisa

a a c B b a D c B A B D c a b C a D c D D b b C A B D c c a Get_children XBW XBW is navigational C A 2 B 5 C 9 D 12 Sp Slast Sa 1 0 0 1 0 1 0 1 0 0 1 1 0 1 1 C b a D D c D a B A B c c a b e A C A C A C B C B C B C B C C C C D A C D B C D B C D B C Select in Slast the 2° item 1 from here... Rank(B,Sa)=2 • XBW is navigational: • Rank-Select data structures on Slast and Sa • The array C of |S| integers Paolo Ferragina, Università di Pisa

P[i+1] C B A B D D c c b a D c a b a lr fr fr lr Rows whose Sp starts with ‘B’ XBW-index XBW is searchable (count subpaths) C A 2 B 5 C 9 D 12 Sp Slast Sa P = B D e A C A C A C B C B C B C B C C C C D A C D B C D B C D B C 1 0 0 1 0 1 0 1 0 0 1 1 0 1 1 C b a D D c D a B A B c c a b Their children have upward path = ‘D B’ • Inductive step: • Pick the next char in P[i+1], i.e. ‘D’ • Search for the first and last ‘D’ in Sa[fr,lr]  Jump to their children • XBW is searchable: • Rank-Select data structures on Slast and Sa • Array C of |S| integers 2 occurrences of P because of two 1s Paolo Ferragina, Università di Pisa

XBzipIndex:XBW + FM-index DBLP: 1.75 bytes/node,Pathways: 0.31 bytes/node,News: 3.91bytes/node Upto 36% improvement in compression ratio Query (counting) time  8 ms, Navigation time  3 ms Paolo Ferragina, Università di Pisa

This is a powerfulparadigm to design compressed indexes: Transform the input in few arrays (via BWT or XBW) Index (+ Compress) the arrays to support rank/select ops http://pizzachili.di.unipi.it or http://pizzachili.dcc.uchile.cl Theory: Soda ’06 (2), Cpm ’06 (2), Icalp ’06 (2), DCC ’06 (1) Experimental: Wea ’06(2) The overall picture on Compressed Indexing... Data type Indexing [Kosaraju, Focs ‘89] Strong connection Compressed Indexing [IEEE Focs ’00] [J. ACM ’05] [IEEE Focs ’05] [WWW ’06] Paolo Ferragina, Università di Pisa

Compressing and Searching XML Data Via Two Zips