On Compression and Indexing: two sides of the same coin

On Compression and Indexing:two sides of the same coin Paolo Ferragina Dipartimento di Informatica, Università di Pisa Paolo Ferragina, Università di Pisa

Types of data DNA sequences Audio-video files Executables Linguistic or tokenizable text Raw sequence of characters or bytes Types of query Word-based query Character-based query Arbitrary substring Complex match What do we mean by “Indexing” ? Two indexing approaches : • Word-based indexes, here a notion of “word” must be devised ! • Inverted files, Signature files, Bitmaps. • Full-text indexes, no constraint on text and queries ! • Suffix Array, Suffix tree, String B-tree [Ferragina-Grossi, JACM 99]. Paolo Ferragina, Università di Pisa

Moral: More economical to store data in compressed form than uncompressed • From March 2001 the Memory eXpansion Technology (MXT) is available on IBM eServers x330MXT • Same performance of a PC with double memory but at half cost What do we mean by “Compression” ? • Compression has two positive effects: • Space saving (or, double memory at the same cost) • Performance improvement • Better use of memory levels close to processor • Increased disk and memory bandwidth • Reduced (mechanical) seek time • CPU speed nowadays makes (de)compression “costless” !! Paolo Ferragina, Università di Pisa

Moral: CPM researchers must have a multidisciplinary background, ranging from Data structure design to Data compression, from Architectural Knowledge to Database principles, till Algoritmic Engineering and more... In terms of space occupancy Also in terms of compression ratio Compression and Indexing: Two sides of the same coin ! • Do we witness a paradoxical situation ? • An index injects redundant data, in order to speed up the pattern searches • Compression removes redundancy, in order to squeeze the space occupancy • NO, new results proved a mutual reinforcement behaviour ! • Better indexes can be designed by exploiting compression techniques • Better compressors can be designed by exploiting indexing techniques Paolo Ferragina, Università di Pisa

Compressed Index • Space close to gzip, bzip • Query time close to O(|P|) Compression Booster Tool to transform a poorcompressor into a better compression algorithm Wavelet Tree Improved Indexes and Compressors Our journey, today... Index design (Weiner ’73) Compressor design (Shannon ’48) Burrows-Wheeler Transform (1994) Suffix Array (1990) Paolo Ferragina, Università di Pisa

5 Q(N2) space T = mississippi# SA SUF(T) 12 11 8 5 2 1 10 9 7 4 6 3 # i# ippi# issippi# ississippi# mississippi# pi# ppi# sippi# sissippi# ssippi# ssissippi# suffix pointer SA + T occupy Q(N log2 N) bits The Suffix Array [BaezaYates-Gonnet, 87 and Manber-Myers, 90] Prop 1.All suffixes in SUF(T) having prefix P are contiguous Prop 2.These suffixes follow P’s lexicographic position T = mississippi# P=si Paolo Ferragina, Università di Pisa

SA 12 11 8 5 2 1 10 9 7 4 6 3 final ppi# sippi# sissippi# P=si P=si P=si O(|P| + log2 N) time • Suffix Array search • O(log2 N) binary-search steps • Each step takes O( |P| ) char cmp • overall, O(|P| log2 N) time O(|P|/B + logB N) I/Os [JACM 99] Self-adjusting version on disk [FOCS 02] Searching in Suffix Array [Manber-Myers, 90] T = mississippi# Suffix permutation cannot be any of {1,...,N} # binary texts = 2N« N! = # permutations on {1, 2, ..., N}  N log2 N is not a lower bound to the bit space occupancy  Paolo Ferragina, Università di Pisa

# mississipp i i #mississipp i ppi#mississ i ssippi#miss i ssissippi# m Sort the rows m ississippi# T p i#mississi p p pi#mississ i s ippi#missi s s issippi#mi s s sippi#miss i s sissippi#m i The Burrows-Wheeler Transform (1994) Let us given a text T = mississippi# F L mississippi# ississippi#m ssissippi#mi sissippi#mis issippi#miss ssippi#missi sippi#missis ippi#mississ ppi#mississi pi#mississip i#mississipp #mississippi Paolo Ferragina, Università di Pisa

L is highly compressible i ssippi#miss i ssissippi# m m ississippi# p i#mississi p p pi#mississ i s ippi#missi s s issippi#mi s s sippi#miss i s sissippi#m i Algorithm Bzip : • Move-to-Front coding of L • Run-Length coding • Statistical coder: Arithmetic, Huffman Why L is so interesting for compression ? F L unknown # mississipp i A key observation: • L is locally homogeneous i #mississipp i ppi#mississ Building the BWT  SA construction Inverting the BWT  array visit ...overall O(N) time... T = m i s s i s s i p p i # • Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression ! Paolo Ferragina, Università di Pisa

SA L Rotated text L includes SA and T. Can we search within L ? 12 11 8 5 2 1 10 9 7 4 6 3 #mississipp i#mississip ippi#missis issippi#mis ississippi# mississippi pi#mississi ppi#mississ sippi#missi sissippi#mi ssippi#miss ssissippi#m i p s s m # p i s s i i Suffix Array vs. BW-transform #mississipp i#mississip ippi#missis issippi#mis ississippi# mississippi pi#mississi ppi#mississ sippi#missi sissippi#mi ssippi#miss ssissippi#m mississippi Paolo Ferragina, Università di Pisa

Index does not depend on k Bound holds for all k, simultaneously The theoretical result: • Query complexity: O(p + occ logeN) time • Space occupancy: O( N Hk(T)) + o(N) bits k-th order empirical entropy A compressed index[Ferragina-Manzini, IEEE Focs 2000] Bridging data-structure design and compression techniques: • Suffix array data structure • Burrows-Wheeler Transform  o(N) if T compressible O(n) space: A plethora of papers Hk: Grossi-Gupta-Vitter (03), Sadakane (02),... Now, more than 20 papers with more than 20 authors on related subjects The corollary is that: • The Suffix Array is compressible • It is a self-index In practice, the index is much appealing: • Space close to the best known compressors, ie. bzip • Query time of few millisecs on hundreds of MBs Paolo Ferragina, Università di Pisa

i ssippi#miss How do we map L’s onto F’s chars ? i ssissippi# m ... Need to distinguish equal chars... m ississippi# p i#mississi p p pi#mississ i s ippi#missi s s issippi#mi s s sippi#miss i s sissippi#m i A useful tool: L  F mapping F L unknown # mississipp i i #mississipp i ppi#mississ Paolo Ferragina, Università di Pisa

C P[ j ] # 0 i 1 m 6 p 7 s 9 P = si L Available info First step #mississipp i#mississip ippi#missis issippi#mis ississippi# mississippi pi#mississi ppi#mississ sippi#missi sissippi#mi ssippi#miss ssissippi#m i p s s m # p i s s i i fr rows prefixed by char “i” lr mississippi Inductive step: Given fr,lr for P[j+1,p] • Take P[j] fr rows prefixed by “si” occ=2 [lr-fr+1] lr Substring search in T (Count occurrences) unknown s s • Find first P[j] in L[fr, lr] • Find last P[j] in L[fr, lr] • L-to-F mapping of these chars Paolo Ferragina, Università di Pisa

 The column L is actually kept compressed: • Still guarantee O(p) time to count the P’s occurrences TheLocateoperation takes O(logeN) time • Some additional data structure, in o(n) bits Interesting issues: • What about arbitrary alphabets ?[Grossi et al., 03; Ferragina et al., 04] • What about disk-aware, or cache-oblivious, or self-adjusting versions ? • What about challenging applications: bio,db,data mining,handheld PC, ... Many details are missing... Efficient and succinct index construction [Hon et al., Focs 03] Bio-application: fit Human-genome index in a PC[Sadakane et al., 02] Paolo Ferragina, Università di Pisa

Where we are ... We investigated the reinforcement relation: Compression ideas Index design Let’s now turn to the other direction Indexing ideasCompressor design Booster Paolo Ferragina, Università di Pisa

It is a technique that takes a poor compressor A and turns it into a compressor with better performance guarantee What do we mean by “boosting” ? A memoryless compressor is poor in that it assigns codewords to symbols according only to their frequencies (e.g. Huffman) It incurs in some obvious limitations: T = anbn T’= random string of length 2n and same number of ‘a,b’ Paolo Ferragina, Università di Pisa

It is a technique that takes a poor compressor A and turns it into a compressor with better performance guarantee A T c Booster c’ What do we mean by “boosting” ? Qualitatively, we would like to achieve • c’is shorter thanc, ifTis compressible • Time(Aboost)= O(Time(A)), i.e. no slowdown • Ais used as a black-box The more compressible is T, the shorter is c’ Two Key Components: Burrows-Wheeler Transform and Suffix Tree Paolo Ferragina, Università di Pisa

H0(T) = ̶∑i (ni/n) log2 (ni/n) Frequency in T of the i-th symbol The emprirical entropyH0 • |T|H0(T)is the best you can hope for a memoryless compressor • E.g. Huffman or Arithmetic coders come close to this bound We get a better compression using a codeword that depends on the k symbols preceding the one to be compressed (context) Paolo Ferragina, Università di Pisa

For any k-long context • CompressTup to Hk(T)  compress all T[w]up to their H0 Use Huffman or Arithmetic T[i]= mssp, T[is] = ms BWT Suffix Tree The empirical entropy Hk = (1/|T|) ∑|w|=k| T[w] | H0(T[w]) Hk(T) • T[w]= string of symbols that precedewinT Example: Given T = “mississippi”, we have • Problems with this approach: • How to go from all T[w]back to the string T ? • How do we choose efficiently the best k ? Paolo Ferragina, Università di Pisa

= (1/|T|) ∑|w|=k|T[w]| H0(T[w]) Hk(T) • CompressTup to Hk(T)  compress all T[w]up to their H0 issippi#miss ississippi# m mississippi# pi#mississi p ppi#mississ i sippi#missi s sissippi#mi s ssippi#miss i ssissippi#m i T[is] = ms w w Use BWTto approximate Hk bwt(T) unknown Remember that... #mississipp i i#mississipp ippi#mississ  compress pieces of bwt(T) up to H0 T = mississippi# T[w] is a permutation of a piece of bwt(T) Paolo Ferragina, Università di Pisa

w T[w]’spermutation #mississipp i i#mississipp ippi#mississ H1(T) issippi#miss H2(T) Compressing those pieces up to their H0 ,we achieve H2(T) Compressing those pieces up to their H0 ,we achieve H1(T) ississippi# m mississippi# pi#mississi p ppi#mississ i sippi#missi s sissippi#mi s ssippi#miss i ssissippi#m i We have a workable way to approximate Hk • Compress Tup to Hk  compress pieces of bwt(T) up to H0 What are the pieces of BWT to compress ? Bwt(T) unknown Recall that Paolo Ferragina, Università di Pisa

bwt(T) #mississipp i i#mississipp s # i m p ippi#mississ 1 12 ssi i # pi# si # i# i issippi#miss ppi# 10 9 11 9 ississippi# m p i p s ppi# mississippi# ssippi# ssippi# ppi# ssippi# pi#mississi p ppi#mississ i sippi#missi s sissippi#mi s ssippi#miss i ssissippi#m i 5 2 7 4 ppi# s m s s 1 12 6 3 Row order i i 10 9 11 9 1 12 1 12 Finding the “best pieces” to compress... Leaf cover ? unknown 12 11 9 5 2 1 10 9 7 4 6 3 L1 L2 H1(T) H2(T) Goal: find the best BWT-partition induced by a Leaf Cover !! Some leaf covers are “related” to Hk !!! Paolo Ferragina, Università di Pisa

Technically, we show that |c| ≤ λ|s|H(s)+f(|s|) Researchers may now concentrate on the “apparently” simpler task of designing 0-th order compressors ’ + log2 |s| + gk [further results joint with Giancarlo-Sciortino] A compression booster [Ferragina-Manzini, SODA 04] • Let A be the compressor we wish to boost • Let bwt(T)=t1, …, tr be the partition induced by the leaf cover L, and let us define cost(L,A)=∑j |A(tj)| Goal: Find the leaf cover L* of minimum cost • It suffices a post-order visit of the suffix tree, hence linear time • We have: Cost(L*, A) ≤ Cost(Lk,A)  Hk(T), k k 0 k Paolo Ferragina, Università di Pisa

May we close “the mutual reinforcement cycle” ? • The Wavelet Tree[Grossi-Gupta-Vitter, Soda 03] • Using the WT within each piece of the optimal BWT-partition, we get: • A compressed index that scales well with the alphabet size • Reduce the compression problem to achieve H0 on binary strings [joint with Manzini, Makinen, Navarro] [joint with Giancarlo, Manzini,Sciortino] Interesting issues: • What about space construction of BWT ? • What about these tools for “XML” or “Images” ? • Other application contexts: bio,db,data mining, network, ... • From Theory to Technology ! Libraries, Libraries,.... [e.g. LEDA] Paolo Ferragina, Università di Pisa

Paolo Ferragina, Università di Pisa

A historical perspective Shannon showed a “narrower” result for a stationary ergodic S • Idea: Compress groups of k chars in the string T • Result: Compress ratio  the entropy ofS, for k   • Various limitations • It works for a source S • It must modify A’sstructure, because of the alphabet change • For a given string T, the best k is found by trying k=0,1,…,|T| • W(|T|2) time slowdown • k is eventually fixed and this is not an optimal choice ! Any string s Black-box O(|s|) time Variable length contexts Two Key Components: Burrows-Wheeler Transform and Suffix Tree Paolo Ferragina, Università di Pisa

How do we find the “best” partition (i.e. k) • “Approximate” via MTF [Burrows-Wheeler, ‘94] • MTF is efficient in practice [bzip2] • Theory and practice showed that we can aim for more ! • Use Dynamic Programming [Giancarlo-Sciortino, CPM ’03] • It finds the optimal partition • Very slow, the time complexity is cubic in |T| Surprisingly, full-text indexes help in finding the optimal partition in optimal linear time !! Paolo Ferragina, Università di Pisa

Example as= d2n < bas= dn , cas= dn s = (bad)n (cad)n (xy)n (xz)n 2-long contexts 1-long contexts Example: not one k xs= ynzn > yxs= yn-1 , zxs= zn-1 Paolo Ferragina, Università di Pisa

word prefix substring suffix Word-based compressed index What about word-based occurrences of P ? P=bzip T = …bzip…bzip2unbzip2unbzip … ...the post-processing phase can be time consuming ! The FM-index can be adapted to support word-based searches: • Preprocess T and transform it into a “digested” text DT Word-search in T  Substring-search in DT • Use the FM-index over the “digested” DT Paolo Ferragina, Università di Pisa

7 bits huffman tagging a w r a w r 1 0 0 = 1a 0b Codeword Byte-aligned codeword WFM-index 1. Dictionary of words 2. Huffman tree 3. FM-index built on DT T = “bzip or not bzip” no yes 2. Huffman tree yes a a DT 1 0  1  1  0 0  [bzip] [ ] [or] space ~ 22 % word search ~ 4 ms a a 1  1 0  1  1  0  0 [bzip] [not] [ ] [ ] no The WFM-index • Variant of Huffman algorithm: • Symbols of the huffman tree are the words of T • The Huffman tree has fan-out 128 • Codewords are byte-aligned and tagged Any word P= bzip 1. Dictionary of words 3. FM-index built on DT Paolo Ferragina, Università di Pisa

Two key properties: 1. We can map L’s to F’s chars 2. T = .... L[ i ] F[ i ] ... i ssippi#miss i ssissippi# m m ississippi# p p i T = .... # i p i#mississi p p pi#mississ i s ippi#missi s s issippi#mi s s sippi#miss i s sissippi#m i The BW-Trasform is invertible F L unknown # mississipp i i #mississipp i ppi#mississ Reconstruct T backward: Building the BWT  SA construction Inverting BWT  array visit ...overall O(N) time... Paolo Ferragina, Università di Pisa

mississippi miiii sssspp 10000 111100 i m p s The Wavelet Tree [Grossi-Gupta-Vitter, Soda 03] 00110110110 • Use WT within each BWT piece • Alphabet independence • Binary string compression/indexing [collaboration: Giancarlo, Manzini, Makinen, Navarro, Sciortino] Paolo Ferragina, Università di Pisa

On Compression and Indexing: two sides of the same coin