260 likes | 377 Views
Linear-time construction of CSA using o( n log n )-bit working space for large alphabets. Joong Chae Na School of Computer Sci. & Eng. Seoul National University, Korea. Overview. Background Suffix arrays(SA) Compressed suffix arrays (CSA) Problem definition Previous works
E N D
Linear-time construction of CSAusing o(nlogn)-bit working spacefor large alphabets Joong Chae Na School of Computer Sci. & Eng. Seoul National University, Korea
Overview • Background • Suffix arrays(SA) • Compressed suffix arrays (CSA) • Problem definition • Previous works • Our contributions • Description of our algorithm • Conclusions
Background (1) Given a string T of length n over an alphabet Σ, • Suffix array (SA) of T[Manber&Myers ’93] • Lexicographically sorted list of the suffixes of T T : b a b a a b b a $ O(nlog n)-bits
Background (2) • Compressed suffix array (CSA) [Grossi&Vitter ’00] • Compressed version of SA • Space requirement of O(nlog|Σ|)-bit • FM-index [Ferragina&Manzini 2000] O(nlog |Σ|)-bits T : b a b a a b b a $
Problem definition • Constructing SA, CSA and FM-index using • o(nlog n)-time and • o(nlog n)-bitworking space • Working space • Temporary space required for executing an algorithm • Not including the space for the input and output
Related works • Constructing SA and CSA ※ O(n log n)-bit working space • Manber & Myers [1993] : O(nlogn)-time • Kim et al. [2003] : O(n)-time • Kärkkäinen & Sanders [2003] : O(n)-time • Ko & Aluru [2003]: O(n)-time ※ O(n log |Σ| )-bit working space • Lam et al. [COCOON 2002]: O(|Σ|n log n )-time • Hon et al. [ISAAC 2003]: O(n log n )-time • None of these algorithms satisfy both time and space requirement of our problem.
Previous results • Hon et al. [FOCS 2003] • An algorithm using O(n loglog|Σ|)-time and O(n log|Σ|)-bit working space • The first algorithm using o(nlog n)-time and o(nlog n)-bit working space • following ½-recursion (the odd-even scheme)
Our contributions • Another algorithm using o(nlog n)-time and o(nlog n)-bit working space • O(n)-time and O(nlog|Σ|·log|Σ|αn)-bit working space • α = log3 2 ≈ 0.63 • The first alphabet-independent linear-time algorithm for constructing SA, CSA, and FM-index using o(nlog n)-bit working space • Following ⅔-recursion (the skew scheme)
Hon et al. vs. Our results *The encoding step is the most complex and time-consuming step in 2/3-recursion. However, both algorithms don’t need the encoding step.
Overview • Preliminaries • Basic definitions and notations • Main technique • Outline of our algorithm
Preliminaries-Ψ function T[k..n] : lexicographically the ith smallest suffix of T ■SA[i] = k ■ The position in SA where T[k+1..n] is stored 1 2 3 4 5 6 7 8 9 T : b a b a a b b a $
Preliminaries-Lemmas Hon et al. [FOCS 2003] • Text, Ψ → SA, CSA • O(n) time, O(n log|Σ|)-bit working space • Text, Ψ → C array (BWT) → FM-index • O(n) time, O(n log|Σ|)-bit working space • Note : goal • Text → Ψ
Basic def. and not. (1) • Residue-1 suffixes of T • T[3i-2..n] for 1 ≤ i ≤ n/3 • T[1..n], T[4..n], T[7..n],… • Residue-2 suffixes of T • T[3i-1..n] for 1 ≤ i ≤ n/3 • T[2..n], T[5..n], T[8..n],… • Residue-3 suffixes of T • T[3i..n] for 1 ≤ i ≤ n/3 • T[3..n], T[6..n], T[9..n],…
length : ⅔ n alphabet : Σ3 SA12 : suffix array of T12 length : ⅓ n alphabet : Σ3 SA3 : suffix array of T3 Basic def. and not. (2) alphabet Σ T12 [1..⅔n] = T[1..n]T[2..n]T[1] T3 [1.. ⅓n] = T[3..n]T[1]T[2]
Main technique–Ψ’ function • Ψ’ is just like Ψ, but Ψ’ is defined in SA12and SA3 • Ψ’ points to the position in SA12or SA3 where T[k+1..n] (the next suffix of current suffix T[k..n]) is stored. ※Note that Ψ’ is not the Ψ-function of T12 and T3. • Ψ’-functionconsists of Ψ’T12, and Ψ’T3
Ψ’ function (residue-1) • Ψ’T12 (residue-1 suffixes of T) • Let T[3k-2..n] be a suffix stored in SA12[i]. • Then, Ψ’T12[i] is the position in SA12 where the next suffix T[3k-1..n] is stored. • Ψ’T12 (residue-2 suffixes of T) Let T[3k-1..n] be a suffix stored in SA12[i]. Then, Ψ’T12[i] is the position in SA3 where the next suffix T[3k..n] is stored. • Ψ’T3 (residue-3 suffixes of T) Let T[3k..n] be a suffix stored in SA3[i]. Then, Ψ’T3[i] is the position in SA12 where the next suffix T[3k+1..n] is stored.
Ψ’ function (residue-2) • Ψ’T12 (residue-1 suffixes) Let T[3k-2..n] be a suffix stored in SA12[i]. Then, Ψ’T12[i] is the position in SA12 where the next suffix T[3k-1..n] is stored. • Ψ’T12 (residue-2 suffixes) • Let T[3k-1..n] be a suffix stored in SA12[i]. • Then, Ψ’T12 [i] is the position in SA3 where the next suffix T[3k..n] is stored. • Ψ’T3 (residue-3 suffixes) Let T[3k..n] be a suffix stored in SA3[i]. Then, Ψ’T3[i] is the position in SA12 where the next suffix T[3k+1..n] is stored.
Ψ’ function (residue-3) • Ψ’T12 (residue-1 suffixes) Let T[3k-2..n] be a suffix stored in SA12[i]. Then, Ψ’T12[i] is the position in SA12 where the next suffix T[3k-1..n] is stored. • Ψ’T12 (residue-2 suffixes) Let T[3k-1..n] be a suffix stored in SA12[i]. Then, Ψ’T12 [i] is the position in SA3 where the next suffix T[3k..n] is stored. • Ψ’T3 (residue-3 suffixes) • Let T[3k..n] be a suffix stored in SA3[i]. • Then, Ψ’T3[i] is the position in SA12 where the next suffix T[3k+1..n] is stored.
Framework- outline • How to construct Ψ function of T • Bottom-up approach length alphabet step 0 T ΨT step 1 T12ΨT12 … … step i step h Ψ h = log3log|Σ|n Use any linear time construction algorithm
ΨS merge → Ψ’S12 Ψ’S3 Ψ’S12 ΨS Ψ’S3 Step i - outline S S3 S12ΨS12 ΨS12 (from step i+1)
Merging step * Comparing entries of SA12 with entries of SA3 in order - compare two suffixes by following Ψ’-functoin at most twice
Conclusions & future works • We presented an alphabet-independent linear-time algorithm to construct SA, CSA, FM-index using o(nlog n)-bit working space • Future works • To Construct SA, CSA, and FM-index optimally, i.e., using O(n)-time andO(n log|Σ|)-bit working space