250 likes | 384 Views
Ayat A.Dawood CIS, Nile University Joined work with: Mohamed AbouelHoda. Fine Tuning the Enhanced Suffix Arrays. Table of Contents. Suffix array The enhanced suffix array Our accomplishment: Minimal Perfect Hashing Function The exact pattern matching problem
E N D
AyatA.Dawood CIS, Nile University Joined work with: Mohamed AbouelHoda Fine Tuning the Enhanced Suffix Arrays Ayat A.Dawood
Table of Contents • Suffix array • The enhanced suffix array • Our accomplishment: • Minimal Perfect Hashing Function • The exact pattern matching problem • Improving the bucket table representation Ayat A.Dawood
Suffix array • Array of integers in the range from 0to (n+1), specifying the lexicographic order of the (n+1) suffixes of S$. • e.g., S = acaaacatat$ Ayat A.Dawood
Suffix array • Array of integers in the range from 0to (n+1), specifying the lexicographic order of the (n+1) suffixes of S$. • e.g., S = acaaacatat$ Ayat A.Dawood
Enhanced suffix array • Basically it is the suffix array enhanced with a set of tables. • Using those tables, best performance and complexity are achieved • lcptab[i] stores the length of longest common prefix of the suffixes suftab[i] and suftab[i-1]. Ayat A.Dawood
Enhanced suffix array: l-interval • L-interval: interval of suffixes sharing the same prefix 1-[0..5] AyatA.Dawood
Enhanced suffix array: l-interval 1-[0..5] a 2-[0..1] • L-interval: interval of suffixes sharing the same prefix AyatA.Dawood
Enhanced suffix array: l-interval • L-interval: interval of suffixes sharing the same prefix 0-[0..10] c t a 1-[8..9] 1-[0..5] 2-[6..7] a c t 3-[2..3] 2-[0..1] 2-[4..5] AyatA.Dawood
Our accomplishment • Improvement (Fine Tuning): • Alphabet-independent exact pattern matching. • Improving bucket table representation • Improving access to the lcp-table. • Improvements are achieved using minimal perfect hashing techniques. Ayat A.Dawood
Minimal perfect hashing(MPHF) • Storing n static keys from universe U in O(n) space with O(1) access time.[Botelho et. al] • Look up table requires O(|U|) space to achieve constant access time Ayat A.Dawood
Exact pattern matching problem • e.g., pattern = aca 0-[0..10] c t a 1-[8..9] 1-[0..5] 2-[6..7] a c t 3-[2..3] 2-[0..1] 2-[4..5] AyatA.Dawood
Exact pattern matching problem • e.g., pattern = aca 0-[0..10] c t a 1-[8..9] 1-[0..5] 2-[6..7] a c t 3-[2..3] 2-[0..1] 2-[4..5] AyatA.Dawood
Exact pattern matching problem • e.g., pattern = aca 0-[0..10] c t a 1-[8..9] 1-[0..5] 2-[6..7] a c t 3-[2..3] 2-[0..1] 2-[4..5] AyatA.Dawood
Exact pattern matching problem • e.g., pattern = aca 0-[0..10] c t a 1-[8..9] 1-[0..5] 2-[6..7] a c t 3-[2..3] 2-[0..1] 2-[4..5] AyatA.Dawood
Exact pattern matching problem • Using normal method: takes O(nm) • Using the enhanced suffix arrays, it can be achieved in O(|∑|m) [AbouElHoda et. al] • Other modification to the enhanced suffix arrays allows it to be done in O(m log (|∑|)).[Kim et. al],[Fischer et. al] Ayat A.Dawood
Exact pattern matching problem • Our work: • Using minimal perfect hashing technique, it can be achieved in O(m), removing the alphabet factor. 0-[0..10] MPHF table c t a 1-[8..9] 1-[0..5] 2-[6..7] MPHF table a c t 3-[2..3] 2-[0..1] 2-[4..5] Ayat A.Dawood
Exact pattern matching problem • Our work: • Using minimal perfect hashing technique, it can be achieved in O(m), removing the alphabet factor. 0-[0..10] c t a 1-[8..9] 1-[0..5] 2-[6..7] a c t 3-[2..3] 2-[0..1] 2-[4..5] Ayat A.Dawood
Exact pattern matching problem • Our work: • Using minimal perfect hashing technique, it can be achieved in O(m), removing the alphabet factor. 0-[0..10] c t a 1-[8..9] 1-[0..5] 2-[6..7] a c t 3-[2..3] 2-[0..1] 2-[4..5] Ayat A.Dawood
Improving the bucket table representation • Bucket table: is a table used to jump directly to a certain lcp-interval in the suffix array Ayat A.Dawood
Improving the bucket table representation • Bucket table: is a table used to jump directly to a certain lcp-interval in the suffix array Ayat A.Dawood
Improving the bucket table representation cont’ • Problem: • Space consumption of the look up table is prohibitive for large d and ∑ (d ^ |∑|). • Solution: • Use minimal perfect hashing techniques to store the look up table. Ayat A.Dawood
Improving the bucket table representation cont’ • Results: • For the bacterial ecoli genome (size = 5400 bp) and for d= 12 *N for undefined nucleotide or dummy character Ayat A.Dawood
Conclusion • Exact pattern matching problem • Improving the bucket table representation. • Improving access to the lcp-table. Ayat A.Dawood
Questions??? Ayat A.Dawood
Improving access to the lcp-table • To reduce space, lcp- table is stored in 1 byte. • If a common prefix is longer than 255, then it is stored in another table. • To access this table, it is accessed sequential or using binary search • Our Enhancement: • Use MPHF to store the extra table to access it in constant time. lcp-table 0 Extra lcp-table 2 257 279 3 300 2 260 0 Ayat A.Dawood