1 / 79

Recuperació de la informació

Explore exact and approximate string matching techniques, suffix trees, and generalized suffix trees. Understand algorithms for pattern matching and sequence assembly using modern information retrieval methods.

karlwilson
Download Presentation

Recuperació de la informació

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Recuperació de la informació • Modern Information Retrieval (1999) • Ricardo-Baeza Yates and Berthier Ribeiro-Neto • Flexible Pattern Matching in Strings (2002) • Gonzalo Navarro and Mathieu Raffinot • Algorithms on strings (2001) • M. Crochemore, C. Hancart and T. Lecroq • http://www-igm.univ-mlv.fr/~lecroq/string/index.html

  2. String Matching String matching: definition of the problem (text,pattern) depends on what we have: text or patterns • Exact matching: • The patterns ---> Data structures for the patterns • 1 pattern ---> The algorithm depends on |p| and || • k patterns ---> The algorithm depends on k, |p| and || • Extensions • Regular Expressions • The text ----> Data structure for the text (suffix tree, ...) • Approximate matching: • Dynamic programming • Sequence alignment (pairwise and multiple) • Sequence assembly: hash algorithm • Probabilistic search: Hidden Markov Models

  3. Index 1a. Part: Suffix trees Algorithms on strings, trees and sequences, Dan Gusfield Cambridge University Press 2a. Part: Suffix arrays Suffix-arrays: a new method for on-line string searches, G. Myers, U. Manber

  4. Suffix trees 7: s s,7 6: as s,6 5: aas a a as,5 as,3 as,3 ba ba baas,1 baas,1 ba ba as,4 as,4 baas,2 baas,2 s,7 s,6 as,5 Given string ababaas: Suffixes: 3: abaas 1: ababaas 4: baas 2: babaas What kind of queries?

  5. Applications of Suffix trees a ba baas,1 ba as,3 baas,2 as,4 s,7 s,6 as,5 1. Exact string matching • Does the sequence ababaas contain any ocurrence of patterns abab, aab, and ab? …………………………

  6. Quadratic insertion algorithm  and the suffix-tree Invariant Properties: Given the string …………………………...... …... P1: the leaves of suffixes from have been inserted

  7. Quadratic insertion algorithm ababaabbs,1 Given the string ababaabbs

  8. Quadratic insertion algorithm babaabbs,2 Given the string ababaabbs ababaabbs,1

  9. Quadratic insertion algorithm aba baabbs,1 Given the string ababaabbs ababaabbs,1 babaabbs,2

  10. Quadratic insertion algorithm abbs,3 aba baabbs,1 Given the string ababaabbs babaabbs,2

  11. Quadratic insertion algorithm abbs,3 aba baabbs,1 ba baabbs,2 Given the string ababaabbs babaabbs,2

  12. Quadratic insertion algorithm abbs,3 aba baabbs,1 ba abbs,4 baabbs,2 Given the string ababaabbs

  13. Quadratic insertion algorithm abbs,3 a aba baabbs,1 abbs,3 ba baabbs,1 abbs,4 abbs,4 Given the string ababaabbs ba baabbs,2

  14. Quadratic insertion algorithm abbs,5 a abbs,3 ba baabbs,1 abbs,4 abbs,4 Given the string ababaabbs ba baabbs,2

  15. Quadratic insertion algorithm abbs,5 a abbs,3 ba baabbs,1 abbs,4 abbs,4 Given the string ababaabbs ba baabbs,2

  16. Quadratic insertion algorithm abbs,5 a abbs,3 b a baabbs,1 abbs,4 abbs,4 Given the string ababaabbs ba ba baabbs,2

  17. Quadratic insertion algorithm abbs,5 a bs,6 abbs,3 b a baabbs,1 abbs,4 abbs,4 Given the string ababaabbs ba baabbs,2

  18. Quadratic insertion algorithm abbs,5 a bs,6 abbs,3 b a baabbs,1 abbs,4 abbs,4 Given the string ababaabbs ba baabbs,2

  19. Quadratic insertion algorithm abbs,5 a bs,6 bs,7 abbs,3 b a baabbs,1 b abbs,4 a baabbs,2 Given the string ababaabbs

  20. Quadratic insertion algorithm abbs,5 a bs,6 bs,7 abbs,3 b a baabbs,1 b abbs,4 a s,7 baabbs,2 Given the string ababaabbs

  21. Quadratic insertion algorithm abbs,5 a bs,6 bs,7 abbs,3 b a baabbs,1 b abbs,4 a s,7 s,7 baabbs,2 Given the string ababaabbs

  22. Generalizad suffix tree the generalized suffix tree of ababaabb and aabaat … is the suffix tree of ababaabαaabaatβ, : The suffix tree of many strings … is called the generalized suffix tree … and it is the suffix tree of the concatenation of strings. For instance,

  23. Generalizad suffix tree abbα,5 a bα,6 bα,7 abbα,3 b a baabbα,1 b abbα,4 a α,7 α,7 baabbα,2 Construction of the suffix tree of ababaabbαaabaaβ: Given the suffix tree of ababaabα:

  24. Generalizad suffix tree abbα,5 a bα,6 bα,7 abbα,3 b a baabbα,1 b abbα,4 a α,7 α,7 baabbα,2 Construction of the suffix tree of ababaabbαaabaaβ:

  25. Generalizad suffix tree aaβ,1 bα,7 bα,6 abbα,3 b a baabbα,1 b abbα,4 a α,7 α,7 baabbα,2 Construction of the suffix tree of ababaabbαaabaaβ: ab a bα,5

  26. Generalizad suffix tree aaβ,1 bα,7 bα,6 abbα,3 b a baabbα,1 b abbα,4 a α,7 α,7 baabbα,2 Construction of the suffix tree of ababaabbαaabaaβ: ab a bα,5

  27. Generalizad suffix tree β,2 bα,7 bα,6 b abbα,4 a α,7 α,7 baabbα,2 Construction of the suffix tree of ababaabbαaabaaβ: aaβ,1 ab a bα,5 b a bbα,3 a baabbα,1

  28. Generalizad suffix tree β,2 bα,7 bα,6 b abbα,4 a α,7 α,7 baabbα,2 Construction of the suffix tree of ababaabbαaabaaβ: aaβ,1 ab a bα,5 b a bbα,3 a baabbα,1

  29. Generalizad suffix tree bα,7 bα,6 β,3 α,7 α,7 Construction of the suffix tree of ababaabbαaabaaβ: aaβ,1 ab a bα,5 β,2 b a bbα,3 a b baabbα,1 a a bbα,4 baabbα,2

  30. Generalizad suffix tree bα,7 bα,6 β,3 α,7 α,7 Construction of the suffix tree of ababaabbαaabaaβ: aaβ,1 ab a bα,5 β,2 b a bbα,3 a b baabbα,1 a a bbα,4 baabbα,2

  31. Generalizad suffix tree β,4 bα,7 bα,6 α,7 α,7 Construction of the suffix tree of ababaabbαaabaaβ: aaβ,1 a b a bα,5 β,2 b a bbα,3 a b baabbα,1 β,3 a a bbα,4 baabbα,2

  32. Generalizad suffix tree β,4 bα,7 bα,6 α,7 α,7 Construction of the suffix tree of ababaabbαaabaaβ: aaβ,1 a b a bα,5 β,2 b a bbα,3 a b baabbα,1 β,3 a a bbα,4 baabbα,2

  33. Generalizad suffix tree bα,6 bα,7 α,7 α,7 Construction of the suffix tree of ababaabbαaabaaβ: β,4 β,4 aaβ,1 a b a bα,5 β,2 b a bbα,3 a b baabbα,1 β,3 a a bbα,4 baabbα,2

  34. Generalizad suffix tree bα,6 bα,7 α,7 α,7 Construction of the suffix tree of ababaabbαaabaaβ: β,4 β,4 aaβ,1 a b a bα,5 β,2 b a bbα,3 a b baabbα,1 β,3 a a bbα,4 baabbα,2

  35. Generalizad suffix tree bα,6 bα,7 α,7 α,7 Construction of the suffix tree of ababaabbαaabaaβ: β,4 β,4 β,4 aaβ,1 a b a bα,5 β,2 b a bbα,3 a b baabbα,1 β,3 a a bbα,4 baabbα,2

  36. Generalizad suffix tree β,4 β,4 β,4 aaβ,1 a b a bα,5 β,2 b bα,6 bα,7 a bbα,3 a b baabbα,1 β,3 a a bbα,4 baabbα,2 α,7 α,7 Generalized suffix tree of ababaabbαaabaaβ:

  37. Applications of Generalized Suffix trees β,4 β,4 β,4 aaβ,1 a b a bα,5 β,2 b bα,7 bα,6 a bbα,3 a b baabbα,1 β,3 a a bbα,4 baabbα,2 α,7 α,7 1. The substring problem for a database of strings DB • Does the DB contain any ocurrence of patterns abab, aab, and ab?

  38. Applications of Generalized Suffix trees β,4 β,4 β,4 aaβ,1 a b a bα,5 β,2 b bα,7 bα,6 a bbα,3 a b baabbα,1 β,3 a a bbα,4 baabbα,2 α,7 α,7 2. The longest common substring of two strings &nbsp

  39. Definition of MUM … a a t g….c t g... MUM … c g t g….c c c ... Maximal Unique Matching

  40. Applications of Generalized Suffix trees β,4 β,4 β,4 aaβ,1 a b a bα,5 β,2 b bα,7 bα,6 a bbα,3 a b baabbα,1 β,3 a a bbα,4 baabbα,2 α,7 α,7 3. Finding MUMs.

  41. Quadratic insertion algorithm  and the suffix-tree Invariant Properties: Given the string …………………………...... …... P1: the leaves of suffixes from have been inserted

  42. Linear insertion algorithm   and the suffix-tree  …... Invariant Properties: Given the string …………………………...... P1: the leaves of suffixes from have been inserted P2: the string  is the longest string that can be spelt through the tree.

  43. Linear insertion algorithm: example   ababab...,5 a ababab...,3 ba baababab...,1 ba ababab...,4 baababab...,2 Given the string ababaababb...

  44. Linear insertion algorithm: example   ababab...,5 a ababab...,3 ba baababab...,1 ba ababab...,4 baababab...,2 Given the string ababaababb... 6 7 8

  45. Linear insertion algorithm: example ababb...,5 a ababb...,3 ba baababb...,1 ba ababb...,4 baababb...,2  Given the string ababaababb... 6 7 8 

  46. Linear insertion algorithm: example ababb...,5 a ababb...,3 ba baababb...,1 ba ababb...,4 baababb...,2  Given the string ababaababb... 6 7 89 

  47. Linear insertion algorithm: example  Given the string ababaababb... 6 7 89  ababb...,5 a ababb...,3 ba ababb...,1 b baababb...,1 baababb...,1 ababb...,4 ba b...,6 baababb...,2

  48. Linear insertion algorithm: example  Given the string ababaababb... 7 89  ababb...,5 a ababb...,3 ba ababb...,1 b ababb...,4 ba b...,6 baababb...,2

  49. Linear insertion algorithm: example   Given the string ababaababb... 7 89  a ababb...,5 a ababb...,3 ba ababb...,1 b ababb...,4 ba b...,6 baababb...,2

  50. Linear insertion algorithm: example  Given the string ababaababb... 7 89  a ababb...,1 b b...,6 ababb...,5 ababb...,3 ba ababb...,4 ba baababb...,2

More Related