1 / 25

Full-Text Indexing via Burrows-Wheeler Transform

Full-Text Indexing via Burrows-Wheeler Transform. Wing-Kai Hon Oct 18, 2006. Outline. The Text Searching Problem What is Full-Text Indexing ? Burrows-Wheeler Transform (BWT) BWT as a Full-Text Index Related work. Text Searching. ?. Text : acacaaccagtcacactagac……. Pattern: acac.

asta
Download Presentation

Full-Text Indexing via Burrows-Wheeler Transform

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Full-Text Indexingvia Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006

  2. Outline • The Text Searching Problem • What is Full-Text Indexing? • Burrows-Wheeler Transform (BWT) • BWT as a Full-Text Index • Related work

  3. Text Searching ? Text: acacaaccagtcacactagac…… Pattern: acac Where does the pattern occur in the text?

  4. How fast can we search? • Let n be the length of text m be the length of pattern • We can find all positions that the pattern appears in O( n + m ) time • Knuth-Morris-Pratt, Boyer-Moore • Is O(n+m) time good? • Yes, because it is optimal!

  5. Text Searching (take 2) ? ? we know the text in advance and can preprocess it Text: acacaaccagtcacactagac…… Pattern: acac Where does the pattern occur in the text?

  6. Can we do better? • Yes, there is a data structure for the text, and by creating that, pattern search only takes O( m +  ) time, where  = number of times the pattern appears in the text • Such a data structure is called an index • Is O(m+) time useful? • Yes, if the text is very long and it is searched many times for different patterns

  7. Full-Text Index • Full-Text Index • Deals with creating an index for a text • Also, each position in the text corresponds to an appearance of at least one pattern (full) • Word-Level Index • Text is a sequence of words • The positions within a word does not correspond to appearance of any pattern • E.g., Text: Was it a cat I saw? (Pattern: “at” does not have an appearance)

  8. Suffix Tree:An Optimal Full-Text Index • As mentioned, we can create an index for the text such that pattern searching can be done in O(m+) time • This time is optimal • One such index is the Suffix Tree • Introduced independently by E. McCreight in 1976 and P. Weiner in 1973

  9. Suffix and Suffix Tree • Given a string S, a substring of S that ends at the last position is called a suffix of S • If S consists of n chars, S has exactly n suffixes • Theorem: If a pattern P appears at position j in S, P appears at the beginning of the suffix of S that starts at position j

  10. acacaac# acacaac# acacaac# acacaac# acaac# ac# E.g., S: acacaac# Suffix of S: acacaac# (start at pos 1) cacaac# (start at pos 2) acaac# (start at pos 3) caac# (start at pos 4) aac# (start at pos 5) ac# (start at pos 6) c# (start at pos 7) • # (start at pos 8) Suppose P = ac is a pattern. Then, P appears at pos 1, pos 3 and pos 6 in S.

  11. Suffix and Suffix Tree (2) • The suffix tree is an edge-labeledcompact tree (no degree-1 nodes) with n leaves such that • each leaf corresponds to a suffix • Concatenating edge labels along the path from root to leaf gives the corresponding suffix • Edge-label to each child starts with different character • Example (next slide)

  12. c # a 8 # a a 7 c c c # a a c a # c # 5 a # 6 4 2 a c c a # a c # 3 1 The Suffix Tree of acacaac#

  13. Searching with Suffix Tree • To search P, we match P starting from the root • If we can match P successfully in the tree, the leaves under the stop point are all suffixes that corresponds to an appearance of P in the text • Then, we traverse the tree under the stop point to report where P appears • So, searching is done in O(m+) time

  14. Is Suffix Tree good? • Yes, because optimal search time • No, because of space requirement… • The space can be much larger than the text • E.g., Text = DNA of Human • To store the text, we need 0.8 Gbyte • To store the suffix tree, we need 64 Gbyte!

  15. Something Wrong?? • Both the suffix tree and the text has n things, so they both need O(n) space… • How come there is a big difference?? • Let us have a better analysis • Let A be the alphabet (i.e., the set of distinct characters) of a text T • E.g., in DNA, A = {a,c,g,t}

  16. Something Wrong?? (2) • To store T, we need only n log |A| bits • But to store the suffix tree, we will need n log n bits • When n is very large compared to |A|, there is a huge difference • Question: Is there an index that supports fast searching, but occupies O( n log |A| ) bits only??

  17. Burrows-Wheeler Transform • By arranging the suffix in ‘sorted’ order, the Burrows-Wheeler Transform is an array storing their ‘preceding chars’ • Example (next slide)

  18. Text = acacaac# BWT Suffix in sorted order

  19. BWT is useful • BWT is shown to be compressed more easily than the original text • Also, given the position in the BWT array where the last character appears, we can get back the original text • How?

  20. Text = acacaac# Sorted BWT BWT Suffix in sorted order

  21. BWT  Index • Ferragina and Manzini (2000) observes that we can use BWT to support pattern searching by storing some additional O(n)-bit arrays • Precisely, let B[1..n] be the BWT. With the additional arrays, for any x, we can count the number of any char in B[1..x] in constant time • Then, we can count the number of times that a pattern appears in the text in O(m) time (How?)

  22. Text = acacaac#, Pattern = aca Sorted BWT BWT Suffix in sorted order

  23. BWT  Index • They also show that, by storing another O(n) bit array, we can report where the pattern appears in O( log n) time • So, searching is done in O(m +  log n) time • What is the space? O( n log |A| ) bits

  24. Related Work • Further compress the index • Space is now measured in terms of the entropy(or the randomness) of a text • Support text with large alphabet • Efficient Construction • Challenge is in minimizing working space • More complex queries and operations • Library problem, Dictionary problem

  25. Pointers for Further Study • The Pizza & Chili website http://pizzachili.di.unipi.it • The FM-index paper by P. Ferragina and G. Manzini, FOCS 2000 • The CSA paper by R. Grossi and J.S. Vitter, STOC 2000 • Discuss with me ^_^ (email: wkhon@)

More Related