1 / 16

Compressing and Searching XML Data Via Two Zips

Compressing and Searching XML Data Via Two Zips. Paolo Ferragina Dipartimento di Informatica, Università di Pisa [Joint with F. Luccio, G. Manzini, S. Muthukrishnan]. Opportunistic Data Structures with Applications P. Ferragina, G. Manzini. Survey by Navarro-Makinen cites more

huey
Download Presentation

Compressing and Searching XML Data Via Two Zips

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Compressing and Searching XML Data Via Two Zips Paolo Ferragina Dipartimento di Informatica, Università di Pisa [Joint with F. Luccio, G. Manzini, S. Muthukrishnan] Paolo Ferragina, Università di Pisa

  2. Opportunistic Data Structures with Applications P. Ferragina, G. Manzini Survey byNavarro-Makinencites more than50papers on the subject !! Six years ago... [now, J. ACM 05] Paolo Ferragina, Università di Pisa

  3. An XML excerpt <dblp> <book> <author> Donald E. Knuth</author> <title> The TeXbook </title> <publisher> Addison-Wesley </publisher> <year> 1986 </year> </book> <article> <author> Donald E. Knuth </author> <author> Ronald W. Moore </author> <title> An Analysis of Alpha-Beta Pruning </title> <pages> 293-326 </pages> <year> 1975 </year> <volume> 6 </volume> <journal> Artificial Intelligence </journal> </article> ... </dblp> It is verbose ! Paolo Ferragina, Università di Pisa

  4. A tree interpretation... • XML document exploration  Tree navigation • XML document search  Labeled subpath searches Subset of XPath [W3C] Paolo Ferragina, Università di Pisa

  5. The Problem We wish to devise acompressed representationfor a labeled tree T that efficiently supports some operations: • Navigational operations: parent(u), child(u, i), child(u, i, c) • Subpath searches:given a sequence P of k labels • Content searches: subpath + substring search • Visualization operation: given a node, visualize its descending subtree • XML-aware compressors (like XMill, XmlPpm, ScmPpm,...) • need the wholedecompression XML-native search engines might exploit this tool as a core block for query optimization and (compressed) storage • XML-queriable compressors (like XPress, XGrind, XQzip,...) • poorcompression and scan of the whole (compressed) file • Summary indexes (like Dataguide, 1-index or 2-index) • largespace and do not support “content” searches • Theoreticallydo exist many solutions, starting from [Jacobson, IEEE Focs ’89] • nosubpath/content searches, and poorperformance on labeled trees Paolo Ferragina, Università di Pisa

  6. A transform for “labeled trees”[Ferragina et al, IEEE Focs ’05] • We proposed the XBW-transform that mimics on trees the nice structural properties of the Burrows-and-Wheeler Trasform on strings (do you know bzip !?). • The XBWlinearizes the tree T in 2 arrays s.t.: • the compression of T reduces to use any k-th order entropy compressor (gzip, bzip,...) over these two arrays • the indexing of T reduces to implement simple rank/select query operations over these two arrays Paolo Ferragina, Università di Pisa

  7. a c C B B D b A a D c D a b c Permutation of tree nodes upward labeled paths The XBW-Transform Sa Sp C B D c a c A b a D c B D b a e C B C D B C D B C B C C A C A C A C D A C C B C D B C B C Step 1. Visit the tree in pre-order. For each node, write down its label and the labels on its upward path Paolo Ferragina, Università di Pisa

  8. a c C B B D b A a D c D a b c upward labeled paths The XBW-Transform Sa Sp C b a D D c D a B A B c c a b e A C A C A C B C B C B C B C C C C D A C D B C D B C D B C Step 2. Stably sort according to Sp Paolo Ferragina, Università di Pisa

  9. a D B A B a c C b D c D a c b XBW The XBW-Transform Sp Slast Sa 1 0 0 1 0 1 0 1 0 0 1 1 0 1 1 C b a D D c D a B A B c c a b e A C A C A C B C B C B C B C C C C D A C D B C D B C D B C XBW can be built and inverted in optimal O(t) time Key fact Nodes correspond to items in <Slast,Sa> Step 3. Add a binary array Slast marking the rows corresponding to last children XBW takes optimal t log |S| + 2t bits Paolo Ferragina, Università di Pisa

  10. Tags, Attributes and symbol = Pcdata XBzip – a simple XML compressor • XBW is compressible: • Sa and Spcdata are locally homogeneous • Slast has some structure Paolo Ferragina, Università di Pisa

  11. XBzip = XBW + PPMd String compressors are not so bad: within 5% Paolo Ferragina, Università di Pisa

  12. c c a b a c B D B A D B D a b C D b C a A B D c B c b a D c a XBW Some structural properties Sp Slast Sa 1 0 0 1 0 1 0 1 0 0 1 1 0 1 1 C b a D D c D a B A B c c a b e A C A C A C B C B C B C B C C C C D A C D B C D B C D B C • Two useful properties: • Children are contiguous and delimited by 1s • Children reflect the order of their parents Paolo Ferragina, Università di Pisa

  13. a a c B b a D c B A B D c a b C a D c D D b b C A B D c c a Get_children XBW XBW is navigational C A 2 B 5 C 9 D 12 Sp Slast Sa 1 0 0 1 0 1 0 1 0 0 1 1 0 1 1 C b a D D c D a B A B c c a b e A C A C A C B C B C B C B C C C C D A C D B C D B C D B C Select in Slast the 2° item 1 from here... Rank(B,Sa)=2 • XBW is navigational: • Rank-Select data structures on Slast and Sa • The array C of |S| integers Paolo Ferragina, Università di Pisa

  14. P[i+1] C B A B D D c c b a D c a b a lr fr fr lr Rows whose Sp starts with ‘B’ XBW-index XBW is searchable (count subpaths) C A 2 B 5 C 9 D 12 Sp Slast Sa P = B D e A C A C A C B C B C B C B C C C C D A C D B C D B C D B C 1 0 0 1 0 1 0 1 0 0 1 1 0 1 1 C b a D D c D a B A B c c a b Their children have upward path = ‘D B’ • Inductive step: • Pick the next char in P[i+1], i.e. ‘D’ • Search for the first and last ‘D’ in Sa[fr,lr]  Jump to their children • XBW is searchable: • Rank-Select data structures on Slast and Sa • Array C of |S| integers 2 occurrences of P because of two 1s Paolo Ferragina, Università di Pisa

  15. XBzipIndex:XBW + FM-index DBLP: 1.75 bytes/node,Pathways: 0.31 bytes/node,News: 3.91bytes/node Upto 36% improvement in compression ratio Query (counting) time  8 ms, Navigation time  3 ms Paolo Ferragina, Università di Pisa

  16. This is a powerfulparadigm to design compressed indexes: Transform the input in few arrays (via BWT or XBW) Index (+ Compress) the arrays to support rank/select ops http://pizzachili.di.unipi.it or http://pizzachili.dcc.uchile.cl Theory: Soda ’06 (2), Cpm ’06 (2), Icalp ’06 (2), DCC ’06 (1) Experimental: Wea ’06(2) The overall picture on Compressed Indexing... Data type Indexing [Kosaraju, Focs ‘89] Strong connection Compressed Indexing [IEEE Focs ’00] [J. ACM ’05] [IEEE Focs ’05] [WWW ’06] Paolo Ferragina, Università di Pisa

More Related