260 likes | 862 Views
Full-Text Indexing via Burrows-Wheeler Transform. Wing-Kai Hon Oct 18, 2006. Outline. The Text Searching Problem What is Full-Text Indexing ? Burrows-Wheeler Transform (BWT) BWT as a Full-Text Index Related work. Text Searching. ?. Text : acacaaccagtcacactagac……. Pattern: acac.
E N D
Full-Text Indexingvia Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006
Outline • The Text Searching Problem • What is Full-Text Indexing? • Burrows-Wheeler Transform (BWT) • BWT as a Full-Text Index • Related work
Text Searching ? Text: acacaaccagtcacactagac…… Pattern: acac Where does the pattern occur in the text?
How fast can we search? • Let n be the length of text m be the length of pattern • We can find all positions that the pattern appears in O( n + m ) time • Knuth-Morris-Pratt, Boyer-Moore • Is O(n+m) time good? • Yes, because it is optimal!
Text Searching (take 2) ? ? we know the text in advance and can preprocess it Text: acacaaccagtcacactagac…… Pattern: acac Where does the pattern occur in the text?
Can we do better? • Yes, there is a data structure for the text, and by creating that, pattern search only takes O( m + ) time, where = number of times the pattern appears in the text • Such a data structure is called an index • Is O(m+) time useful? • Yes, if the text is very long and it is searched many times for different patterns
Full-Text Index • Full-Text Index • Deals with creating an index for a text • Also, each position in the text corresponds to an appearance of at least one pattern (full) • Word-Level Index • Text is a sequence of words • The positions within a word does not correspond to appearance of any pattern • E.g., Text: Was it a cat I saw? (Pattern: “at” does not have an appearance)
Suffix Tree:An Optimal Full-Text Index • As mentioned, we can create an index for the text such that pattern searching can be done in O(m+) time • This time is optimal • One such index is the Suffix Tree • Introduced independently by E. McCreight in 1976 and P. Weiner in 1973
Suffix and Suffix Tree • Given a string S, a substring of S that ends at the last position is called a suffix of S • If S consists of n chars, S has exactly n suffixes • Theorem: If a pattern P appears at position j in S, P appears at the beginning of the suffix of S that starts at position j
acacaac# acacaac# acacaac# acacaac# acaac# ac# E.g., S: acacaac# Suffix of S: acacaac# (start at pos 1) cacaac# (start at pos 2) acaac# (start at pos 3) caac# (start at pos 4) aac# (start at pos 5) ac# (start at pos 6) c# (start at pos 7) • # (start at pos 8) Suppose P = ac is a pattern. Then, P appears at pos 1, pos 3 and pos 6 in S.
Suffix and Suffix Tree (2) • The suffix tree is an edge-labeledcompact tree (no degree-1 nodes) with n leaves such that • each leaf corresponds to a suffix • Concatenating edge labels along the path from root to leaf gives the corresponding suffix • Edge-label to each child starts with different character • Example (next slide)
c # a 8 # a a 7 c c c # a a c a # c # 5 a # 6 4 2 a c c a # a c # 3 1 The Suffix Tree of acacaac#
Searching with Suffix Tree • To search P, we match P starting from the root • If we can match P successfully in the tree, the leaves under the stop point are all suffixes that corresponds to an appearance of P in the text • Then, we traverse the tree under the stop point to report where P appears • So, searching is done in O(m+) time
Is Suffix Tree good? • Yes, because optimal search time • No, because of space requirement… • The space can be much larger than the text • E.g., Text = DNA of Human • To store the text, we need 0.8 Gbyte • To store the suffix tree, we need 64 Gbyte!
Something Wrong?? • Both the suffix tree and the text has n things, so they both need O(n) space… • How come there is a big difference?? • Let us have a better analysis • Let A be the alphabet (i.e., the set of distinct characters) of a text T • E.g., in DNA, A = {a,c,g,t}
Something Wrong?? (2) • To store T, we need only n log |A| bits • But to store the suffix tree, we will need n log n bits • When n is very large compared to |A|, there is a huge difference • Question: Is there an index that supports fast searching, but occupies O( n log |A| ) bits only??
Burrows-Wheeler Transform • By arranging the suffix in ‘sorted’ order, the Burrows-Wheeler Transform is an array storing their ‘preceding chars’ • Example (next slide)
Text = acacaac# BWT Suffix in sorted order
BWT is useful • BWT is shown to be compressed more easily than the original text • Also, given the position in the BWT array where the last character appears, we can get back the original text • How?
Text = acacaac# Sorted BWT BWT Suffix in sorted order
BWT Index • Ferragina and Manzini (2000) observes that we can use BWT to support pattern searching by storing some additional O(n)-bit arrays • Precisely, let B[1..n] be the BWT. With the additional arrays, for any x, we can count the number of any char in B[1..x] in constant time • Then, we can count the number of times that a pattern appears in the text in O(m) time (How?)
Text = acacaac#, Pattern = aca Sorted BWT BWT Suffix in sorted order
BWT Index • They also show that, by storing another O(n) bit array, we can report where the pattern appears in O( log n) time • So, searching is done in O(m + log n) time • What is the space? O( n log |A| ) bits
Related Work • Further compress the index • Space is now measured in terms of the entropy(or the randomness) of a text • Support text with large alphabet • Efficient Construction • Challenge is in minimizing working space • More complex queries and operations • Library problem, Dictionary problem
Pointers for Further Study • The Pizza & Chili website http://pizzachili.di.unipi.it • The FM-index paper by P. Ferragina and G. Manzini, FOCS 2000 • The CSA paper by R. Grossi and J.S. Vitter, STOC 2000 • Discuss with me ^_^ (email: wkhon@)