XML Compression and Indexing

The Future of Web Search Barcelona, May 2006 XMLCompression and Indexing Paolo Ferragina Dipartimento di Informatica, Università di Pisa [Joint with F. Luccio, G. Manzini, S. Muthukrishnan] Under patenting by Pisa-Rutgers Univ. Paolo Ferragina, Università di Pisa

Compressed Permuterm Index Paolo Ferragina, Rossano Venturini Dipartimento di Informatica, Università di Pisa Under Y!-patenting Paolo Ferragina, Università di Pisa

A basic problem Given a dictionary D of strings, having variable length, design a compressed data structure that supports • string  id • Prefix(a): find all strings in D that are prefixed by a • Suffix(b): find all strings in D that are suffixed byb • Substring(g):find all strings in D that contain g • PrefixSuffix(a,b) = Prefix(a)  Suffix(b) IR book of Manning-Raghavan-Schutze  Tolerant Retrieval Problem (wildcards) Prefix(a) = a* Suffix(b) = *b Substring(g) = *g* PrefixSuffix(a,b) = a*b Paolo Ferragina, Università di Pisa

A basic problem Given a dictionary D of strings, having variable length, design a compressed data structure that supports • string  id • Prefix(a): find all s in D that are prefixed by a • Suffix(b): find all s in D that are suffixed byb • Substring(g):find all s in D that contain g • PrefixSuffix(a,b) = Prefix(a)  Suffix(b) • Hashing •  Not exact searches Paolo Ferragina, Università di Pisa

A basic problem Given a dictionary D of strings, having variable length, design a compressed data structure that supports • string  id • Prefix(a): find all s in D that are prefixed by a • Suffix(b): find all s in D that are suffixed byb • Substring(g):find all s in D that contain g • PrefixSuffix(a,b) = Prefix(a)  Suffix(b) • (Compacted) Trie •  Two versions: for D and for DR + Intersect answers •  No substring search (unless using Suffix Trie) •  Need to store D for resolving edge-labels Paolo Ferragina, Università di Pisa

A basic problem Given a dictionary D of strings, having variable length, design a compressed data structure that supports • string  id • Prefix(a): find all s in D that are prefixed by a • Suffix(b): find all s in D that are suffixed byb • Substring(g):find all s in D that contain g • PrefixSuffix(a,b) = Prefix(a)  Suffix(b) • Front coding... Paolo Ferragina, Università di Pisa

0 http://checkmate.com/All_Natural/ 33 Applied.html 34 roma.html 38 1.html 38 tic_Art.html 34 yate.html 35 er_Soap.html 35 urvedic_Soap.html 33 Bath_Salt_Bulk.html 42 s.html 25 Essence_Oils.html 25 Mineral_Bath_Crystals.html 38 Salt.html 33 Cream.html 0 http://checkmate.com/All/Natural/Washcloth.html... 3035% http://checkmate.com/All_Natural/ http://checkmate.com/All_Natural/Applied.html http://checkmate.com/All_Natural/Aroma.html http://checkmate.com/All_Natural/Aroma1.html http://checkmate.com/All_Natural/Aromatic_Art.html http://checkmate.com/All_Natural/Ayate.html http://checkmate.com/All_Natural/Ayer_Soap.html http://checkmate.com/All_Natural/Ayurvedic_Soap.html http://checkmate.com/All_Natural/Bath_Salt_Bulk.html http://checkmate.com/All_Natural/Bath_Salts.html http://checkmate.com/All/Essence_Oils.html http://checkmate.com/All/Mineral_Bath_Crystals.html http://checkmate.com/All/Mineral_Bath_Salt.html http://checkmate.com/All/Mineral_Cream.html http://checkmate.com/All/Natural/Washcloth.html ... Front-coding uk-2002 crawl ≈250Mb bzip≈ 10% Be back on this, later on! •  Two versions: for D and for DR + Intersect answers • Need some extra data structures for bucket identification • No substring search Paolo Ferragina, Università di Pisa

A basic problem Given a dictionary D of strings, having variable length, compress them in a way that we can efficiently support • string  id • Prefix(a): find all s in D that are prefixed by a • Suffix(b): find all s in D that are suffixed byb • Substring(g):find all s in D that contain byg • PrefixSuffix(a,b) = Prefix(a)  Suffix(b) • Permuterm Index (Garfield, 76) • Reduce any query to a “prefix query” over a larger dictionary Paolo Ferragina, Università di Pisa

Premuterm Index [Garfield, 1976] • Take a dictionary D={yahoo,google} • Append a special char $ to the end of each string • Generate all rotations of these strings • yahoo$ • ahoo$y • hoo$ya • oo$yah • o$yaho • $yahoo • google$ • oogle$g • ogle$go • gle$goo • le$goog • e$googl • $google Prefix(ya) = Prefix($ya) Suffix(oo) = Prefix(oo$) Substring(oo) = Prefix(oo) PrefixSuffix(y,o)= Prefix(o$y) Permuterm Dictionary Space problems Any query on D reduces to a prefix-query on P[D] Paolo Ferragina, Università di Pisa

SIGIR ‘07 Compressed Permuterm Index It deploys two ingredients: • Permuterm index • Compressed full-text index Theoretically: • Query ops take optimal time: proportional to pattern length • Space occupancy is |D| Hk(D) + o(|D| log |S|) bits Technically: A simple reduction step: Permuterm  Compressed index • Re-use known machinery on compressed indexes • Achieve bzip-compression at Front-coding speed Paolo Ferragina, Università di Pisa

#mississipp i i#mississipp ippi#mississ issippi#miss ississippi# m Sort the rows mississippi# T pi#mississi p ppi#mississ i sippi#missi s sissippi#mi s ssippi#miss i ssissippi#m i The Burrows-Wheeler Transform(1994) Take the text T = mississippi# L F mississippi# ississippi#m ssissippi#mi sissippi#mis issippi#miss ssippi#missi sippi#missis ippi#mississ ppi#mississi pi#mississip i#mississipp #mississippi Paolo Ferragina, Università di Pisa

L is highly compressible Compressing L is effective Key observation: • L is locally homogeneous • Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression ! Paolo Ferragina, Università di Pisa

The main idea is to reduce substring search to some basicoperations over arrays of symbols The FM-index [Ferragina-Manzini, JACM ‘05] Survey of Navarro-Makinen contains many other indexes The result: • Count(P): O(p) time • Locate(P): O(occ * polylog(|T|)) time • Display( T[i,i+L] ): O( L + polylog(|T|) ) time • Space occupancy: |T| Hk(T) + o(|T| log |S|) bits   New concept:The FM-index is an opportunistic data structure Compressed Permuterm index builds upon the best two features of the FM-index Paolo Ferragina, Università di Pisa

i ssippi#miss How do we map L’s onto F’s chars ? i ssissippi# m ... Need to distinguishequal chars in F... m ississippi# p i#mississi p p pi#mississ i s ippi#missi s s issippi#mi s s sippi#miss i s sissippi#m i Take two equal L’s chars Rotate rightward their rows Same relative order !! First ingredient: L  F mapping F L unknown # mississipp i i #mississipp i ppi#mississ Paolo Ferragina, Università di Pisa

1 2 i ssippi#miss 6 The oracle Rank( s , 9 )= 3 i ssissippi# m 7 m ississippi# p i#mississi p p pi#mississ i s ippi#missi s s issippi#mi s s sippi#miss i s sissippi#m i 9 First ingredient: L  F mapping F L unknown # mississipp i i #mississipp i ppi#mississ FM-index is actually Rank ds over BWT O(1) time and Hk-space Paolo Ferragina, Università di Pisa

i ssippi#miss i ssissippi# m m ississippi# p i#mississi p p pi#mississ i s ippi#missi s s issippi#mi s s sippi#miss i s sissippi#m i Backward step(i):  Return LF[i], in O(1) time Second ingredient: Backward step F L unknown # mississipp i i #mississipp i ppi#mississ T scanned backward by using LF-mapping LF ...s s i... LF Paolo Ferragina, Università di Pisa

P = si Count(P[1,p]):  Finds <fr,lr> in O(p) time fr occ=2 [lr-fr+1] lr Third ingredient: substring search L unknown #mississipp i#mississip ippi#missis issippi#mis ississippi# mississippi pi#mississi ppi#mississ sippi#missi sissippi#mi ssippi#miss ssissippi#m i p s s m # p i s s i i Paolo Ferragina, Università di Pisa

Lexicographically sorted Build FM-index to support substring searches The Comprressed Permuterm Z = $hat$hip$hop$hot$# Some queries are trivial...  Prefix(a) = Substring search($a) within Z  Suffix(b) = Substring search(b$) within Z  Substr(g) = Substring search(g) within Z Paolo Ferragina, Università di Pisa

i=3 Key property: Last char of si is at L[i+1] Cyclic-LF[i] If (i > #D) return LF[i] else return LF[i+1] LF[3] CLF[3] PrefixSuffix search unknown Paolo Ferragina, Università di Pisa

PrefixSuffix(P): Search FM-index of Z using Cyclic-LF instead of LF PrefixSuffix(ho,p) unknown $ho LF CLF No change in time/space bounds of compressed indexes Paolo Ferragina, Università di Pisa

Rank and Select of strings unknown Z = $hat$hip$hop$hot$# Other queries...  Rank(s) = row of $s$  Select(i)= backw from L[i+1] Paolo Ferragina, Università di Pisa

Experiments Three dictionaries: • Term dictionary: Trec WT10G • Host dictionary (reversed): UK-2005 • Url dictionary (host reversed): first 190Mb of UK-2005 PrefixSuffix search needs *2 Paolo Ferragina, Università di Pisa

Paolo Ferragina, Università di Pisa

Choose your trade-off A test on URLs MRS book says: “one disadvantage of the PI is that its dictionary becomes quite large, including as it does all rotations of each term”. % dict-size Now, they mention CPI  Trade-off • Time of 2060 msec/char, and space close to bzip • Time close to Front-Coding (4 msec/char), but <50% of its space Paolo Ferragina, Università di Pisa

We proposed an approach for dictionary storage: +Theory: optimal time and entropy-bounds for space +Practice:trades time vs space, thus fitting user needs Paolo Ferragina, Università di Pisa

XML Compression and Indexing

XML Compression and Indexing

Presentation Transcript

XML Indexing Structure

XML Indexing Techniques

XML Compression Techniques

XML Storage and Indexing Native XML

Universal, Composable Indexing Queries, Text, Spatial Data, XML Structure, and XML Semantics

Secure Layer Based Compound Image Compression using XML Compression

XML Compression

Indexing and Querying XML Data for Regular Path Expressions

Indexing Methods for Efficient XML Query Processing

Indexing of XML Data

Optimizing XML Compression

“XIRAF – XML-based indexing and querying for digital forensics”

XML Storage and Indexing Native XML

XML Compression Techniques: Survey and Comparison

Indexing and Querying XML Data for Regular Path Expressions

On Compression and Indexing: two sides of the same coin

XML Compression

XML Indexing and Search