250 likes | 361 Views
The Future of Web Search Barcelona, May 2006. XML Compression and Indexing. Paolo Ferragina Dipartimento di Informatica, Università di Pisa [Joint with F. Luccio, G. Manzini, S. Muthukrishnan]. Under patenting by Pisa-Rutgers Univ. Compressed Permuterm Index.
E N D
The Future of Web Search Barcelona, May 2006 XMLCompression and Indexing Paolo Ferragina Dipartimento di Informatica, Università di Pisa [Joint with F. Luccio, G. Manzini, S. Muthukrishnan] Under patenting by Pisa-Rutgers Univ. Paolo Ferragina, Università di Pisa
Compressed Permuterm Index Paolo Ferragina, Rossano Venturini Dipartimento di Informatica, Università di Pisa Under Y!-patenting Paolo Ferragina, Università di Pisa
A basic problem Given a dictionary D of strings, having variable length, design a compressed data structure that supports • string id • Prefix(a): find all strings in D that are prefixed by a • Suffix(b): find all strings in D that are suffixed byb • Substring(g):find all strings in D that contain g • PrefixSuffix(a,b) = Prefix(a) Suffix(b) IR book of Manning-Raghavan-Schutze Tolerant Retrieval Problem (wildcards) Prefix(a) = a* Suffix(b) = *b Substring(g) = *g* PrefixSuffix(a,b) = a*b Paolo Ferragina, Università di Pisa
A basic problem Given a dictionary D of strings, having variable length, design a compressed data structure that supports • string id • Prefix(a): find all s in D that are prefixed by a • Suffix(b): find all s in D that are suffixed byb • Substring(g):find all s in D that contain g • PrefixSuffix(a,b) = Prefix(a) Suffix(b) • Hashing • Not exact searches Paolo Ferragina, Università di Pisa
A basic problem Given a dictionary D of strings, having variable length, design a compressed data structure that supports • string id • Prefix(a): find all s in D that are prefixed by a • Suffix(b): find all s in D that are suffixed byb • Substring(g):find all s in D that contain g • PrefixSuffix(a,b) = Prefix(a) Suffix(b) • (Compacted) Trie • Two versions: for D and for DR + Intersect answers • No substring search (unless using Suffix Trie) • Need to store D for resolving edge-labels Paolo Ferragina, Università di Pisa
A basic problem Given a dictionary D of strings, having variable length, design a compressed data structure that supports • string id • Prefix(a): find all s in D that are prefixed by a • Suffix(b): find all s in D that are suffixed byb • Substring(g):find all s in D that contain g • PrefixSuffix(a,b) = Prefix(a) Suffix(b) • Front coding... Paolo Ferragina, Università di Pisa
0 http://checkmate.com/All_Natural/ 33 Applied.html 34 roma.html 38 1.html 38 tic_Art.html 34 yate.html 35 er_Soap.html 35 urvedic_Soap.html 33 Bath_Salt_Bulk.html 42 s.html 25 Essence_Oils.html 25 Mineral_Bath_Crystals.html 38 Salt.html 33 Cream.html 0 http://checkmate.com/All/Natural/Washcloth.html... 3035% http://checkmate.com/All_Natural/ http://checkmate.com/All_Natural/Applied.html http://checkmate.com/All_Natural/Aroma.html http://checkmate.com/All_Natural/Aroma1.html http://checkmate.com/All_Natural/Aromatic_Art.html http://checkmate.com/All_Natural/Ayate.html http://checkmate.com/All_Natural/Ayer_Soap.html http://checkmate.com/All_Natural/Ayurvedic_Soap.html http://checkmate.com/All_Natural/Bath_Salt_Bulk.html http://checkmate.com/All_Natural/Bath_Salts.html http://checkmate.com/All/Essence_Oils.html http://checkmate.com/All/Mineral_Bath_Crystals.html http://checkmate.com/All/Mineral_Bath_Salt.html http://checkmate.com/All/Mineral_Cream.html http://checkmate.com/All/Natural/Washcloth.html ... Front-coding uk-2002 crawl ≈250Mb bzip≈ 10% Be back on this, later on! • Two versions: for D and for DR + Intersect answers • Need some extra data structures for bucket identification • No substring search Paolo Ferragina, Università di Pisa
A basic problem Given a dictionary D of strings, having variable length, compress them in a way that we can efficiently support • string id • Prefix(a): find all s in D that are prefixed by a • Suffix(b): find all s in D that are suffixed byb • Substring(g):find all s in D that contain byg • PrefixSuffix(a,b) = Prefix(a) Suffix(b) • Permuterm Index (Garfield, 76) • Reduce any query to a “prefix query” over a larger dictionary Paolo Ferragina, Università di Pisa
Premuterm Index [Garfield, 1976] • Take a dictionary D={yahoo,google} • Append a special char $ to the end of each string • Generate all rotations of these strings • yahoo$ • ahoo$y • hoo$ya • oo$yah • o$yaho • $yahoo • google$ • oogle$g • ogle$go • gle$goo • le$goog • e$googl • $google Prefix(ya) = Prefix($ya) Suffix(oo) = Prefix(oo$) Substring(oo) = Prefix(oo) PrefixSuffix(y,o)= Prefix(o$y) Permuterm Dictionary Space problems Any query on D reduces to a prefix-query on P[D] Paolo Ferragina, Università di Pisa
SIGIR ‘07 Compressed Permuterm Index It deploys two ingredients: • Permuterm index • Compressed full-text index Theoretically: • Query ops take optimal time: proportional to pattern length • Space occupancy is |D| Hk(D) + o(|D| log |S|) bits Technically: A simple reduction step: Permuterm Compressed index • Re-use known machinery on compressed indexes • Achieve bzip-compression at Front-coding speed Paolo Ferragina, Università di Pisa
#mississipp i i#mississipp ippi#mississ issippi#miss ississippi# m Sort the rows mississippi# T pi#mississi p ppi#mississ i sippi#missi s sissippi#mi s ssippi#miss i ssissippi#m i The Burrows-Wheeler Transform(1994) Take the text T = mississippi# L F mississippi# ississippi#m ssissippi#mi sissippi#mis issippi#miss ssippi#missi sippi#missis ippi#mississ ppi#mississi pi#mississip i#mississipp #mississippi Paolo Ferragina, Università di Pisa
L is highly compressible Compressing L is effective Key observation: • L is locally homogeneous • Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression ! Paolo Ferragina, Università di Pisa
The main idea is to reduce substring search to some basicoperations over arrays of symbols The FM-index [Ferragina-Manzini, JACM ‘05] Survey of Navarro-Makinen contains many other indexes The result: • Count(P): O(p) time • Locate(P): O(occ * polylog(|T|)) time • Display( T[i,i+L] ): O( L + polylog(|T|) ) time • Space occupancy: |T| Hk(T) + o(|T| log |S|) bits New concept:The FM-index is an opportunistic data structure Compressed Permuterm index builds upon the best two features of the FM-index Paolo Ferragina, Università di Pisa
i ssippi#miss How do we map L’s onto F’s chars ? i ssissippi# m ... Need to distinguishequal chars in F... m ississippi# p i#mississi p p pi#mississ i s ippi#missi s s issippi#mi s s sippi#miss i s sissippi#m i Take two equal L’s chars Rotate rightward their rows Same relative order !! First ingredient: L F mapping F L unknown # mississipp i i #mississipp i ppi#mississ Paolo Ferragina, Università di Pisa
1 2 i ssippi#miss 6 The oracle Rank( s , 9 )= 3 i ssissippi# m 7 m ississippi# p i#mississi p p pi#mississ i s ippi#missi s s issippi#mi s s sippi#miss i s sissippi#m i 9 First ingredient: L F mapping F L unknown # mississipp i i #mississipp i ppi#mississ FM-index is actually Rank ds over BWT O(1) time and Hk-space Paolo Ferragina, Università di Pisa
i ssippi#miss i ssissippi# m m ississippi# p i#mississi p p pi#mississ i s ippi#missi s s issippi#mi s s sippi#miss i s sissippi#m i Backward step(i): Return LF[i], in O(1) time Second ingredient: Backward step F L unknown # mississipp i i #mississipp i ppi#mississ T scanned backward by using LF-mapping LF ...s s i... LF Paolo Ferragina, Università di Pisa
P = si Count(P[1,p]): Finds <fr,lr> in O(p) time fr occ=2 [lr-fr+1] lr Third ingredient: substring search L unknown #mississipp i#mississip ippi#missis issippi#mis ississippi# mississippi pi#mississi ppi#mississ sippi#missi sissippi#mi ssippi#miss ssissippi#m i p s s m # p i s s i i Paolo Ferragina, Università di Pisa
Lexicographically sorted Build FM-index to support substring searches The Comprressed Permuterm Z = $hat$hip$hop$hot$# Some queries are trivial... Prefix(a) = Substring search($a) within Z Suffix(b) = Substring search(b$) within Z Substr(g) = Substring search(g) within Z Paolo Ferragina, Università di Pisa
i=3 Key property: Last char of si is at L[i+1] Cyclic-LF[i] If (i > #D) return LF[i] else return LF[i+1] LF[3] CLF[3] PrefixSuffix search unknown Paolo Ferragina, Università di Pisa
PrefixSuffix(P): Search FM-index of Z using Cyclic-LF instead of LF PrefixSuffix(ho,p) unknown $ho LF CLF No change in time/space bounds of compressed indexes Paolo Ferragina, Università di Pisa
Rank and Select of strings unknown Z = $hat$hip$hop$hot$# Other queries... Rank(s) = row of $s$ Select(i)= backw from L[i+1] Paolo Ferragina, Università di Pisa
Experiments Three dictionaries: • Term dictionary: Trec WT10G • Host dictionary (reversed): UK-2005 • Url dictionary (host reversed): first 190Mb of UK-2005 PrefixSuffix search needs *2 Paolo Ferragina, Università di Pisa
Choose your trade-off A test on URLs MRS book says: “one disadvantage of the PI is that its dictionary becomes quite large, including as it does all rotations of each term”. % dict-size Now, they mention CPI Trade-off • Time of 2060 msec/char, and space close to bzip • Time close to Front-Coding (4 msec/char), but <50% of its space Paolo Ferragina, Università di Pisa
We proposed an approach for dictionary storage: +Theory: optimal time and entropy-bounds for space +Practice:trades time vs space, thus fitting user needs Paolo Ferragina, Università di Pisa