250 likes | 359 Views
Memory-aware BWT by Segmenting Sequences. presented by Jiaying Wang April 12 , 2012. The 14th Asia-Pacific Web Conference (APWeb). Northeastern University, China. Motivation. Most interesting massive data sets contain string data (web data, record data, genome data, etc.)
E N D
Memory-aware BWT by Segmenting Sequences presented byJiaying Wang April 12, 2012 The 14th Asia-Pacific Web Conference (APWeb) Northeastern University, China
Motivation Most interesting massive data sets contain string data (web data, record data, genome data, etc.) BWT as a full text index provides fast substring search over large text collections Enormous memory cost while building BWT(n log n + n logσ)
Preliminaries text: T[0..n − 1], T[i]∈Σ, |Σ| = σ We add a $ to the end of the text. $ do not belong to Σ T[i...j] is a sequence starting at i position and ending at j position empty string iff i>j prefix iff i = 0 suffix iff j = 0
Problem definition Let T[0..n−1] be a text, and P[0..m-1] be a query. Subsequence matching problem is to find all the start positions of occurrences of P in T, i.e. {i | 0 ≤ i ≤ n; T[i..i+m-1] = P[0..m-1]}. We take the memory cost into account. The process should guarantee the efficiency of query and memory cost at the same time.
Bwt transformation SA i ssippi$miss L $ mississipp i i ssissippi$ m 11 10 7 4 1 0 9 8 6 3 5 2 m ississippi$ p i$mississi p p pi$mississ i s ippi$missi s s issippi$mi s s sippi$miss i s sissippi$m i F i $mississipp i ppi$mississ text: mississippi$ bwt: ipssm$pissii mississippi$ ississippi$m ssissippi$mi sissippi$mis issippi$miss ssippi$missi sippi$missis ippi$mississ ppi$mississi pi$mississip i$mississipp $mississippi
Backward search on BWT L 0, hbwt.length For i from pat.length-1 to 0 k = pat[i] l = C[k] + occ(k,l) h = C[k] + occ(k,h) Return h - l searching "ssi" i ssippi$miss L i ssissippi$ m m ississippi$ p i$mississi p p pi$mississ i s ippi$missi s s issippi$mi s s sippi$miss i s sissippi$m i F $ mississipp i i $mississipp i ppi$mississ
Memory cost analysis Enormous memory cost for building BWT. n log n + n logσ. About 5*n Bytes. (1G 5G) For example: mississippi mississippi mississippi$ ipssm$pissii SA:11 10 7 4 1 0 9 8 6 3 5 2 12 + 12×4 = 12×5
Our idea(1/2) mississippi missis sippi Load one segment each time will help us save the memory search ssi How to find the segmented sequence?
Our idea(2/2) mississippi issippi mississi search ssi Oops, we find another one
BWT on Overlapped Segments T T1 L T2 l … bwt Tk bwt BWT1 bwt BWT2 … BWTk
Searching cases prerequisite : query length ≤ l • For the second case, we have to remove duplicates of the results
Filtering method f Filter interval f = l - m All the occurrences starting at positions in a filter interval should be filtered.
BWT on Disjoint Segments T T1 T2 … bwt Tk bwt BWT1 bwt BWT2 … BWTk
Searching cases • For the second case, we need to • 1 Find the suffix of the query as the prefix of a segment. • 2 Verify rest prefix of the query needs on the left segment.
Suffix checking Time complexity: Θ(m)
Prefix verification • To verify the prefix, we can • 1 keep text. (waste space) • 2 revert text on the fly.(waste a little time)
Analysis Overlap method Memory cost (n + l + k) × (log σ+ log(n + l + k) − log(k))/k Time complexity Θ(occ+δ+mk) Backwalk method Memory cost n(log σ+log n−log k)/k bits. Time complexity Θ(occ + (η + k)m)
Experiment Environment C++ language PC with 2.93 GHz Intel Core CPU 4 GB main memory Ubuntu operating system (Linux distribution). data sets English text at Pizza&Chili Corpus Genome sequence at UCSC goldenPath
Performance on English Memory cost Build time Query time Query time
Performance on genome Memory cost Build time Query time Query time
Conclusion We propose a novel variation of BWT called S-BWT Our index save more memory than BWT Two query method based on S-BWT Our method is faster than BWT method on large text.