270 likes | 476 Views
FM-KZ: An even simpler alphabet-indepent FM-index. Rafał Przywarski Computer Engineering Dept., Tech. Univ. of Łódź, Poland rafal.przywarski@svensson.com.pl. Szymon Grabowski Computer Engineering Dept., Tech. Univ. of Łódź, Poland sgrabow@kis.p.lodz.pl.
E N D
FM-KZ: An even simpler alphabet-indepent FM-index Rafał PrzywarskiComputer Engineering Dept., Tech. Univ. of Łódź, Polandrafal.przywarski@svensson.com.pl Szymon GrabowskiComputer Engineering Dept., Tech. Univ. of Łódź, Polandsgrabow@kis.p.lodz.pl Gonzalo NavarroDept. of Computer Science Univ. of Chile, Chilegnavarro@dcc.uchile.cl Alejandro SalingerDavid R. Cheriton Schoolof Computer ScienceUniv. of Waterloo, Canada asalinge@dcc.uchile.cl Prague Stringology Club, Praha, Aug. 2006
Full text indexing – now compressed suffix array (CSA) (Grossi & Vitter, 2000; ...)FM-indexbased on the BWT (Ferragina & Manzini, 2000) LZ-indexbased on the suffix tree with LZ78(Navarro, 2003)alphabet-friendly FM (Ferragina et al., 2004)...... Full text indexing – past suffix tree (aka lord of the strings):powerful, flexible, but needs at least 10n space (avg. case, assuming indices 4x larger than characters); suffix array: 4n space, otherwise quite practical. R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06
Compressed indexes Common feature: the original text may be omitted, only its compressed representation suffices for handling queries. Most of the compressed indexes are based on the Burrows–Wheeler transform (BWT). Rapid development in theory (see the survey by Navarro & Mäkinen, 2006); implementations somewhat lag behind... This work – practice oriented. A step on from our earlier work (SPIRE04, PSC05). R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06
Burrows-Wheeler transform (BWT),an example F L rotations as they go sorted rotations R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06
Pattern searching in BWT sequence:LF-mapping mechanism Starting point in Ferragina & Manzini’s index (2000):search time:O(m log n),space occupancy:O(n log n) bits. Note that in such form the complexities are like with the plain suffix array, but text T itself may be eliminated! Better? Ferragina & Manzini (2000) reach O(m) timewith (roughly) O(Hkn) space, assuming small alphabet. R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06
Searching in BWT sequence, an example F L feasible form of L column BWT matrix R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06
FM-Huffman (Grabowski et al., 2004) Idea: Search in BWT sequence, but use a binary(or, generally, constant size) alphabet. Use rank() operation in binary sequence(Jacobson, 1989; Munro, 1996; Clark, 1996).Rank(k) tells the number of 1’s in T[1...k], k n, in O(1) timeand needs o(n) extra space. Binary representation? Yes, you guessed: Huffman coding (approximation of order-0 entropy). Soon we’ll see this is not so good as might first seem. R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06
Searching (counting query) in FM-Huffman Searching for pattern P’ in bit-vector B R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06
FM-Huffman index • Huffman encode the text T, obtaining T’ (n’ bits). • Calculate the BWT for the T’, call it B. • Create another bit array, Bh, such that indicates the bits in B which start Huffman codewords. • Huffman encode the pattern P, obtaining P’. • Search in a similar manner as shown at slide 7,BUT the BWT sequence is kept naturally (array B) and the additional space overhead is sublinear in n’. • Verify a match with additional bits (Bh array + again extra structures sublinear in n’). • Main drawback: Bh as large as B. indexconstruction queryhandling R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06
What instead of the binary Huffman?(Consider both space and search time.) k-ary Huffman (Grabowski et al., PSC05)k typically 4 or 16: - B array needs more space (usu. slightly more). +++ Bh array is almost halved. -- rank structures for each of 4 symbols needed (but for a halved sequence). In total: some 10% space gain for English and proteins(almost no gain for DNA). Significant speedup in most cases (fewer codeword chars fewer rank operations). R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06
Now more radical:remove Bh completely Removing Bh is possible if our encoding has some self-synchronizing property. Every codeword beginning must be recognized instantanously. Very naïve solution: unary coding. Anything better? Yes. Kautz-Zeckendorf coding. The search is exactly like in slide 8 (for binary FM-Huffman),only line 9 will be nowifep < spthenocc = 0 elseocc = ep – (sp – 1) R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06
Kautz–Zeckendorf coding Basic variant (we denote it as KZ2): all the codewords start with 110; nowhere else 110 appear. Let B be encoded with KZ2. If during the LF-mapping we read 0 followed by two 1’s, we know we are at a codeword boundaries. Note we allow 1 at a codeword end! So even three 1’s can be “in a row”. But 110 only at a codeword beginning. R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06
Kautz–Zeckendorf coding, cont’d KZ2 encoding (in an alterantive variant: each codeword has 1 at the beginning and at a start and no two adjacent 1’s elsewhere) presents an integer as a sum of Fibonacci numbers in a unique form. Fib. sequence (note a single 1 at the start): 1, 2, 3, 5, 8, 13, 21, 34, 55... So, for example 27 will be represented as (LSDigit first):1001001 Since 27 = 1 + 5 + 21. R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06
What is the avg codeword lengthfor KZ? We don’t know. But asymptotically (large alphabet) it can be upper-boundedby 1.618... • (H0 + r), where r < 1 is the Huffman redundancy for a given distribution. 1.618... = 1+sqrt(5) / 2 (golden ratio) R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06
Drawbacks of KZ B (and its rank) is longer, as KZ code is longer than Huffman. Longer encoded patterns mean more rank operations(as opposed to FM-Huff4). Harder analysis... Benefits of KZ No Bh array (and its rank structure). ...So we don’t perform the final pair of rank operations either. With FM-Huffman, selectnext (telling the pos. of the next 1)is needed at a start of report / display query handling. Now all the matches are in a contiguous range of rows. R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06
On a Fibonacci numbers application... The number 1.618... Does it ring a bell? 1 mile = 1.609... km How does a mathematician convert miles into kilometers? (According to Graham, Knuth, and Patashnik, Concrete Mathematics.) Represent the distance in the Fibonacci base (e.g. KZ2),shift left by 1, sum what you’ve obtained. Example: 80 miles. 80 =1+3+21+55 = 101000101(fib) After the << : 0101000101(fib) = 2+5+34+89 = 130 kmRatio 1.625, not bad... R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06
Generalized Kautz–Zeckendorf KZ1: 10 prefix (unary coding!) KZ2: 110 prefix KZ3: 1110 prefix, etc. Is KZ2 best? Not always.For example, for DNA (4 symbols) the seemingly very naive unary codinghas 2.5 bit avg codeword length (assuming non-compressible symbols, ie. H0 = 2 bits / symbol). ...Ops, this is for a slightly twisted variant: the codewords are simply 1, 10, 100 and 1000. R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06
Reporting queries (basic idea)– same for all the FM-* One extra bit per original symbol needed (plus some sublinear data), and one position index per h symbols (h user-selected parameter, e.g., 32). We sample positions of T’ in regular intervals, but only at codewords’ beginnings. Handling a query: for each found occurence of P continue digit-by-digit backward moving in T’ until a sampled position (signalled with a flag) is met. Read its index (original position) and it’s done. The backward moving in T’ is limited. R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06
Experimental results Datasets: 80 MB of English text (from TREC-3) 60 MB of DNA (from BLAST database), 5 characters! 55 MB of proteins (from BLAST database) Test platform: Intel Xeon 3.06 GHz 2 GB of RAM 512 KB cache Gentoo Linux 2.6.10 Gcc 3.3.5 -O9 R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06
Experimental results, cont’d Counting queries: Pattern length from 10 to 100, for each length 1000 patterns taken from random positions of each text. Reporting queries: Pattern length 10. 1000 random patterns taken. Display queries: 1000 random patterns, 100 chars to display around each of the found occurence. Competitors:FM-index (very simple and fast (byte-oriented) variant by Navarro, 2004),Compressed Suffix Array (CSA) (Sadakane, 2000),Run-Length FM (RLFM) (Mäkinen & Navarro, 2005),Succinct Suffix Array (SSA) (Mäkinen & Navarro, 2005),LZ-index (Navarro, 2004),FM-Huffman2 and FM-Huffman4 (Grabowski et al., 2005)FM-KZ1 and FM-KZ2 (this work). R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06
English text, search time time in sec for varying pattern lengths R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06
English text, space vs. search time time in sec per character R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06
DNA, space vs. search time time in sec per character R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06
Proteins, space vs. search time time in sec per character R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06
Observations CSA and RLFM: hardly ever competitive. FM-Huff-16 fastest for counting queries for English and proteins. FM-KZ1: most succinct and among the fastest on DNA. Reporting time: FM-Huff variants lose to FM-index for English and proteins. They (k=2 and k=4) win on DNA instead (but there SSA is even better, and more flexible for low space use). Display time: FM-KZ1 best for DNA. Best for proteins: FM and then FM-KZ2 (but the fastest is FM-Huff16). English text: similar to proteins but LZ-index equally fast to FM-Huff16 and needs about 25% less space. Original binary Huffman: never competitive. R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06
Presented algorithm – properties • Search time:O((H0+1)m + occ) avg search time.O(m log n + occ) worst-case search time. • Space occupancy: less than 1.618... • (H0+1)n + o(H0 n) bits. • Pros and cons (summary): • very simple and practical succinct index; • no dependence on the alphabet size; • among the fastest (but not the most succinct) compressed indexes; • worse “in theory” than some recent indexes (but simpler); • quite flexible R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06
To do: Better analysis? Some more little tricks (and tweaks), e.g., the B array may be truncated somewhat. Good for space and even also for speed (elimination of some rank operations). More experiments with more succinct rank (e.g. 5% overhead rank is only moderately slower than the 10% one; definitely not twice; quite an option for Huff4 and Huff16). Higher arity KZ? R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06