1 / 86

Master Course

Master Course. MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen) Dep. de Llenguatges i Sistemes Informàtics CEPBA-IBM Research Institute Universitat Politècnica de Catalunya. Contents.

tibor
Download Presentation

Master Course

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen) Dep. de Llenguatges i Sistemes Informàtics CEPBA-IBM Research Institute Universitat Politècnica de Catalunya

  2. Contents 1. (Exact) String matching of one pattern 2. (Exact) String matching of many patterns 3. Extended string matching and regular expressions 4. Approximate string matching (Dynamic programming) 5. Pairwise and multiple alignment 6. Suffix trees

  3. Master Course Second lecture: First part: Extended string matching

  4. Extended string matching There are classes of characters represented by one Symbol. For instace the IUPAC code for the DNA alphabet is: R = {G,A} Y = {T,C} K = {G,T} M = {A,C} S = {G,C} W = {A,T} B = {G,T,C } D = {G,A,T} H = {A,C,T} V = {G,C,A} N = {A,G,C,T} (any) 1. Classes of characters in the tetx. There are characters in the text that represent sets of simbols 2. Classes of characters in the pattern. There are characters in the text that represent sets of simbols

  5. Classes in the text Algorismes més eficients (Navarro & Raffinot) | | 64 32 16 Horspool 8 BOM BNDM 4 Long. patró 2 w 2 4 8 16 32 64 128 256

  6. Classes in the text :Horspool example A 4 C 5 G 2 T 1 R ? … N ? Given the pattern ATGTA the shift table is:

  7. Classes in the text :Horspool example A 4 C 5 G 2 T 1 R 2 … N ? Suposem que el patró és ATGTA La taula de salts seria:

  8. Classes in the text :Horspool example Given the taxt : G T A R T R N A A G G A … A T G T A A T G T A A T G T A A 4 C 5 G 2 T 1 R 2 … N 1 Given the pattern ATGTA and the shift table:

  9. Classes in the text :Horspool example IGiven the text : G T A R T R N A A G G A ... A T G T A A T G T A A T G T A A T G T A A 4 C 5 G 2 T 1 R 2 … N 1 Given the pattern ATGTA and the shift table: …

  10. Classes in the text Algorismes més eficients (Navarro & Raffinot) BNDM : Backward Nondeterministic Dawg Matching | | BOM : Backward Oracle Matching 64 32 16 Horspool 8 BOM BNDM 4 Long. patró 2 w 2 4 8 16 32 64 128 256

  11. Alg. Cerca exacta d’un patró (text on-line) Algorismes més eficients (Navarro & Raffinot) BNDM : Backward Nondeterministic Dawg Matching | | BOM : Backward Oracle Matching 64 32 16 Horspool 8 BOM BNDM 4 Long. patró 2 w 2 4 8 16 32 64 128 256

  12. Classes in the text: BOM Com fa la comparació? Text : Patró : Autòmata: Factor Oracle Com es determina la següent posició de la finestra? Comproba si el sufix és factor del patró Però primer analitzem com fa la comparació…

  13. Classes in the text: BOM example G T A G T T A G T A I la cerca sobre el text : G T A R T R N A A T G… Com fa la comparació? Es construeix l’autòmata del patró invers: Suposem que el patró és ATGTATG A T G T A T G No és possible cap millora!

  14. Alg. Cerca exacta de molts patrons 8 | | (5 mots) Wu-Manber 4 SBOM Long. mínima 2 5 10 15 20 25 30 35 40 45 8 Wu-Manber (10 mots) (100 mots) 4 SBOM 8 Wu-Manber Ad AC 2 SBOM 4 5 10 15 20 25 30 35 40 45 Ad AC 2 5 10 15 20 25 30 35 40 45 Wu-Manber 8 (1000 mots) SBOM 4 Ad AC 2 5 10 15 20 25 30 35 40 45

  15. Classes in the text: Set Horspool G T A T A T G G T A T A A T A A Search for the patterns ATGTATG,TATG,ATAAT,ATGTG In the text: ARTGNCTATGTGACA… <it’s not possible any improvment!

  16. Classes in the text 8 | | (5 mots) Wu-Manber 4 SBOM Long. mínima 2 5 10 15 20 25 30 35 40 45 8 Wu-Manber (10 mots) (100 mots) 4 SBOM 8 Wu-Manber Ad AC 2 SBOM 4 5 10 15 20 25 30 35 40 45 Ad AC 2 5 10 15 20 25 30 35 40 45 Wu-Manber 8 (1000 mots) SBOM 4 Ad AC 2 5 10 15 20 25 30 35 40 45

  17. Classes in the pattern Algorismes més eficients (Navarro & Raffinot) | | 64 32 16 Horspool 8 BOM BNDM 4 Long. patró 2 w 2 4 8 16 32 64 128 256

  18. Classes in the text 8 | | (5 mots) Wu-Manber 4 SBOM Long. mínima 2 5 10 15 20 25 30 35 40 45 8 Wu-Manber (10 mots) (100 mots) 4 SBOM 8 Wu-Manber Ad AC 2 SBOM 4 5 10 15 20 25 30 35 40 45 Ad AC 2 5 10 15 20 25 30 35 40 45 Wu-Manber 8 (1000 mots) SBOM 4 Ad AC 2 5 10 15 20 25 30 35 40 45

  19. Alg. Cerca exacta de molts patrons 8 | | (5 mots) Wu-Manber 4 SBOM Long. mínima 2 5 10 15 20 25 30 35 40 45 8 Wu-Manber (10 mots) (100 mots) 4 SBOM 8 Wu-Manber Ad AC 2 SBOM 4 5 10 15 20 25 30 35 40 45 Ad AC 2 5 10 15 20 25 30 35 40 45 Wu-Manber 8 (1000 mots) SBOM 4 Ad AC 2 5 10 15 20 25 30 35 40 45

  20. Alg. Cerca exacta de molts patrons 8 | | (5 mots) Wu-Manber 4 SBOM Long. mínima 2 5 10 15 20 25 30 35 40 45 8 Wu-Manber (10 mots) (100 mots) 4 SBOM 8 Wu-Manber Ad AC 2 SBOM 4 5 10 15 20 25 30 35 40 45 Ad AC 2 5 10 15 20 25 30 35 40 45 Wu-Manber 8 (1000 mots) SBOM 4 Ad AC 2 5 10 15 20 25 30 35 40 45

  21. Master Course Second lecture: Second part: Regular expressions matching

  22. Expressions regulars Una expressió regular ℛés una cadena sobre ΣU { ε, |, · , * , (, ) } definida recursivament com: ε és una expressió regular Un caràcter de Σés una expressió regular ( ℛ ) és una expressió regular ℛ1 ·ℛ2és una expressió regular ℛ1 |ℛ2és una expressió regular ℛ *és una expressió regular

  23. Llenguatge regular El llenguatge representat per una expressió regular és el conjunt dels mots que es poden construir a partir de l’expressió regular. El problema de buscar una expressió regular dins el text és el de buscar tots els factors que pertanyen al respectiu llenguatge regular.

  24. Master Course Second lecture: Third part: Approximate string matching

  25. Approximate string matching For instance, given the sequence CTACTACTACGTCTATACTGATCGTAGCTACTACATGC search for the pattern ACTGA allowing one error… … but what is the meaning of “one error”?

  26. Edit distance Indel We accept three types of errors: 1. Mismatch: ACCGTGAT ACCGAGAT 2. Insertion: ACCGTGAT ACCGATGAT 3. Deletion: ACCGTGAT ACCGGAT The edit distance d between two strings is the minimum number of substitutions,insertions and deletions needed to transform the first string into the second one d(ACT,ACT)= d(ACT,AC)= d(ACT,C)= d(ACT,)= d(AC,ATC)= d(ACTTG,ATCTG)=

  27. Edit distance Indel We accept three types of errors: 1. Mismatch: ACCGTGAT ACCGAGAT 2. Insertion: ACCGTGAT ACCGATGAT 3. Deletion: ACCGTGAT ACCGGAT The edit distance d between two strings is the minimum number of substitutions,insertions and deletions needed to transform the first string into the second one d(ACT,ACT)=0 d(ACT,AC)=1 d(ACT,C)=2 d(ACT,)= d(AC,ATC)= d(ACTTG,ATCTG)=

  28. Edit distance Indel We accept three types of errors: 1. Mismatch: ACCGTGAT ACCGAGAT 2. Insertion: ACCGTGAT ACCGATGAT 3. Deletion: ACCGTGAT ACCGGAT The edit distance d between two strings is the minimum number of substitutions,insertions and deletions needed to transform the first string into the second one d(ACT,ACT)=0 d(ACT,AC)=1 d(ACT,C)=2 d(ACT,)= 3 d(AC,ATC)=1 d(ACTTG,ATCTG)=2

  29. Edit distance and alignment of strings The Edit distance is related with the best alignment of strings Given d(ACT,ACT)=0 d(ACT,AC)=1 d(ACTTG,ATCTG)=2 which is the best alignment in every case? • ACT and ACT : ACT ACT • ACT and AC: ACT AC- ACTTG and ATCTG: ACTTG ATCTG ACT - TG A - TCTG

  30. Edit distance and alignment of strings But which is the distance between the strings ACGCTATGCTATACG and ACGGTAGTGACGC? … and the best alignment between them? 1966 was the first time this problem was discussed… and the algorithm was proposed in 1968,1970,… using the technique called “Dynamic programming”

  31. Approximate string matching For instance, given the sequence CTACTACTACGTCTATACTGATCGTAGCTACTACATGC search for the pattern ACTGA allowing one error… … but what is the meaning of “one error”?

  32. Edit distance Indel We accept three types of errors: 1. Mismatch: ACCGTGAT ACCGAGAT 2. Insertion: ACCGTGAT ACCGATGAT 3. Deletion: ACCGTGAT ACCGGAT The edit distance d between two strings is the minimum number of substitutions,insertions and deletions needed to transform the first string into the second one d(ACT,ACT)= d(ACT,AC)= d(ACT,C)= d(ACT,)= d(AC,ATC)= d(ACTTG,ATCTG)=

  33. Approximate string matching For instance, given the sequence CTACTACTACGTCTATACTGATCGTAGCTACTACATGC search for the pattern ACTGA allowing one error… … but what is the meaning of “one error”?

  34. Edit distance Indel We accept three types of errors: 1. Mismatch: ACCGTGAT ACCGAGAT 2. Insertion: ACCGTGAT ACCGATGAT 3. Deletion: ACCGTGAT ACCGGAT The edit distance d between two strings is the minimum number of substitutions,insertions and deletions needed to transform the first string into the second one d(ACT,ACT)= d(ACT,AC)= d(ACT,C)= d(ACT,)= d(AC,ATC)= d(ACTTG,ATCTG)=

  35. Edit distance Indel We accept three types of errors: 1. Mismatch: ACCGTGAT ACCGAGAT 2. Insertion: ACCGTGAT ACCGATGAT 3. Deletion: ACCGTGAT ACCGGAT The edit distance d between two strings is the minimum number of substitutions,insertions and deletions needed to transform the first string into the second one d(ACT,ACT)=0 d(ACT,AC)=1 d(ACT,C)=2 d(ACT,)= d(AC,ATC)= d(ACTTG,ATCTG)=

  36. Edit distance Indel We accept three types of errors: 1. Mismatch: ACCGTGAT ACCGAGAT 2. Insertion: ACCGTGAT ACCGATGAT 3. Deletion: ACCGTGAT ACCGGAT The edit distance d between two strings is the minimum number of substitutions,insertions and deletions needed to transform the first string into the second one d(ACT,ACT)=0 d(ACT,AC)=1 d(ACT,C)=2 d(ACT,)= 3 d(AC,ATC)=1 d(ACTTG,ATCTG)=2

  37. Edit distance and alignment of strings The Edit distance is related with the best alignment of strings Given d(ACT,ACT)=0 d(ACT,AC)=1 d(ACTTG,ATCTG)=2 which is the best alignment in every case? • ACT and ACT : ACT ACT • ACT and AC: ACT AC- ACTTG and ATCTG: ACTTG ATCTG ACT - TG A - TCTG

  38. Edit distance and alignment of strings But which is the distance between the strings ACGCTATGCTATACG and ACGGTAGTGACGC? … and the best alignment between them? 1966 was the first time this problem was discussed… and the algorithm was proposed in 1968,1970,… using the technique called “Dynamic programming”

  39. Edit distance Indel We accept three types of errors: 1. Mismatch: ACCGTGAT ACCGAGAT 2. Insertion: ACCGTGAT ACCGATGAT 3. Deletion: ACCGTGAT ACCGGAT The edit distance d between two strings is the minimum number of substitutions,insertions and deletions needed to transform the first string into the second one d(ACT,ACT)=0 d(ACT,AC)=1 d(ACT,C)=2 d(ACT,)= d(AC,ATC)= d(ACTTG,ATCTG)=

  40. Edit distance Indel We accept three types of errors: 1. Mismatch: ACCGTGAT ACCGAGAT 2. Insertion: ACCGTGAT ACCGATGAT 3. Deletion: ACCGTGAT ACCGGAT The edit distance d between two strings is the minimum number of substitutions,insertions and deletions needed to transform the first string into the second one d(ACT,ACT)=0 d(ACT,AC)=1 d(ACT,C)=2 d(ACT,)= 3 d(AC,ATC)=1 d(ACTTG,ATCTG)=2

  41. Edit distance and alignment of strings The Edit distance is related with the best alignment of strings Given d(ACT,ACT)=0 d(ACT,AC)=1 d(ACTTG,ATCTG)=2 which is the best alignment in every case? • ACT and ACT : ACT ACT • ACT and AC: ACT AC- ACTTG and ATCTG: ACTTG ATCTG ACT - TG A - TCTG

  42. Edit distance and alignment of strings But which is the distance between the strings ACGCTATGCTATACG and ACGGTAGTGACGC? … and the best alignment between them? 1966 was the first time this problem was discussed… and the algorithm was proposed in 1968,1970,… using the technique called “Dynamic programming”

  43. Edit distance and alignment of strings C T A C T A C T A C G T A C T G A

  44. Edit distance and alignment of strings C T A C T A C T A C G T A C T G A

  45. Edit distance and alignment of strings C T A C T A C T A C G T A C T G A The cell contains the distance between AC and CTACT.

  46. Approximate string matching For instance, given the sequence CTACTACTACGTCTATACTGATCGTAGCTACTACATGC search for the pattern ACTGA allowing one error… … but what is the meaning of “one error”?

  47. Edit distance Indel We accept three types of errors: 1. Mismatch: ACCGTGAT ACCGAGAT 2. Insertion: ACCGTGAT ACCGATGAT 3. Deletion: ACCGTGAT ACCGGAT The edit distance d between two strings is the minimum number of substitutions,insertions and deletions needed to transform the first string into the second one d(ACT,ACT)= d(ACT,AC)= d(ACT,C)= d(ACT,)= d(AC,ATC)= d(ACTTG,ATCTG)=

  48. Edit distance Indel We accept three types of errors: 1. Mismatch: ACCGTGAT ACCGAGAT 2. Insertion: ACCGTGAT ACCGATGAT 3. Deletion: ACCGTGAT ACCGGAT The edit distance d between two strings is the minimum number of substitutions,insertions and deletions needed to transform the first string into the second one d(ACT,ACT)=0 d(ACT,AC)=1 d(ACT,C)=2 d(ACT,)= d(AC,ATC)= d(ACTTG,ATCTG)=

  49. Edit distance Indel We accept three types of errors: 1. Mismatch: ACCGTGAT ACCGAGAT 2. Insertion: ACCGTGAT ACCGATGAT 3. Deletion: ACCGTGAT ACCGGAT The edit distance d between two strings is the minimum number of substitutions,insertions and deletions needed to transform the first string into the second one d(ACT,ACT)=0 d(ACT,AC)=1 d(ACT,C)=2 d(ACT,)= 3 d(AC,ATC)=1 d(ACTTG,ATCTG)=2

  50. Edit distance and alignment of strings The Edit distance is related with the best alignment of strings Given d(ACT,ACT)=0 d(ACT,AC)=1 d(ACTTG,ATCTG)=2 which is the best alignment in every case? • ACT and ACT : ACT ACT • ACT and AC: ACT AC- ACTTG and ATCTG: ACTTG ATCTG ACT - TG A - TCTG

More Related