Memory-aware BWT by Segmenting Sequences

Memory-aware BWT by Segmenting Sequences presented byJiaying Wang April 12, 2012 The 14th Asia-Pacific Web Conference (APWeb) Northeastern University, China

Motivation Most interesting massive data sets contain string data (web data, record data, genome data, etc.) BWT as a full text index provides fast substring search over large text collections Enormous memory cost while building BWT(n log n + n logσ)

Preliminaries text: T[0..n − 1], T[i]∈Σ, |Σ| = σ We add a $ to the end of the text. $ do not belong to Σ T[i...j] is a sequence starting at i position and ending at j position empty string iff i>j prefix iff i = 0 suffix iff j = 0

Problem definition Let T[0..n−1] be a text, and P[0..m-1] be a query. Subsequence matching problem is to find all the start positions of occurrences of P in T, i.e. {i | 0 ≤ i ≤ n; T[i..i+m-1] = P[0..m-1]}. We take the memory cost into account. The process should guarantee the efficiency of query and memory cost at the same time.

Bwt transformation SA i ssippi$miss L $ mississipp i i ssissippi$ m 11 10 7 4 1 0 9 8 6 3 5 2 m ississippi$ p i$mississi p p pi$mississ i s ippi$missi s s issippi$mi s s sippi$miss i s sissippi$m i F i $mississipp i ppi$mississ text: mississippi$ bwt: ipssm$pissii mississippi$ ississippi$m ssissippi$mi sissippi$mis issippi$miss ssippi$missi sippi$missis ippi$mississ ppi$mississi pi$mississip i$mississipp $mississippi

Backward search on BWT L  0, hbwt.length For i from pat.length-1 to 0 k = pat[i] l = C[k] + occ(k,l) h = C[k] + occ(k,h) Return h - l searching "ssi" i ssippi$miss L i ssissippi$ m m ississippi$ p i$mississi p p pi$mississ i s ippi$missi s s issippi$mi s s sippi$miss i s sissippi$m i F $ mississipp i i $mississipp i ppi$mississ

Memory cost analysis Enormous memory cost for building BWT. n log n + n logσ. About 5*n Bytes. (1G 5G) For example: mississippi mississippi mississippi$ ipssm$pissii SA:11 10 7 4 1 0 9 8 6 3 5 2 12 + 12×4 = 12×5

Our idea(1/2) mississippi missis sippi Load one segment each time will help us save the memory search ssi How to find the segmented sequence?

Our idea(2/2) mississippi issippi mississi search ssi Oops, we find another one

BWT on Overlapped Segments T T1 L T2 l … bwt Tk bwt BWT1 bwt BWT2 … BWTk

Searching cases prerequisite : query length ≤ l • For the second case, we have to remove duplicates of the results

Filtering method f Filter interval f = l - m All the occurrences starting at positions in a filter interval should be filtered.

Searching algorithm

BWT on Disjoint Segments T T1 T2 … bwt Tk bwt BWT1 bwt BWT2 … BWTk

Searching cases • For the second case, we need to • 1 Find the suffix of the query as the prefix of a segment. • 2 Verify rest prefix of the query needs on the left segment.

Suffix checking Time complexity: Θ(m)

Prefix verification • To verify the prefix, we can • 1 keep text. (waste space) • 2 revert text on the fly.(waste a little time)

Searching algorithm

Analysis Overlap method Memory cost (n + l + k) × (log σ+ log(n + l + k) − log(k))/k Time complexity Θ(occ+δ+mk) Backwalk method Memory cost n(log σ+log n−log k)/k bits. Time complexity Θ(occ + (η + k)m)

Experiment Environment C++ language PC with 2.93 GHz Intel Core CPU 4 GB main memory Ubuntu operating system (Linux distribution). data sets English text at Pizza&Chili Corpus Genome sequence at UCSC goldenPath

Performance on English Memory cost Build time Query time Query time

Performance on genome Memory cost Build time Query time Query time

More performance

Conclusion We propose a novel variation of BWT called S-BWT Our index save more memory than BWT Two query method based on S-BWT Our method is faster than BWT method on large text.

Thank you!Q&A

Memory-aware BWT by Segmenting Sequences

Memory-aware BWT by Segmenting Sequences

Presentation Transcript

Segmenting language

Segmenting and Targeting

Application-Aware Memory Channel Partitioning

Memory-aware compilation enables fast, energy-efficient, timing predictable memory accesses

QoS -Aware Memory Systems (Wrap Up)

Scalable Many-Core Memory Systems Topic 3 : Memory Interference and QoS -Aware Memory Systems

NUMA aware heap memory manager

Segmenting Consumer Markets

Segmenting Markets

Parallel Implementation of BWT

Segmenting Nonsense

Memory-Aware Compilation

Segmenting Global Markets:

Dynamic Conditional Random Fields for Labeling and Segmenting Sequences

Memory-Aware Scheduling for LU in Charm++

QoS-Aware Memory Systems

On Burstiness-Aware Search for Document Sequences

Segmenting, Targeting and positioning

WP Segmenting Machine REVIEW & WP Segmenting Machine (SECRET) Bonuses

Segmenting and Targeting