1 / 34

Tècniques i Eines Bioinformàtiques

Tècniques i Eines Bioinformàtiques. Bioinformatics, Sequence and Genome Analysis David W. Mount Flexible Pattern Matching in Strings (2002) Gonzalo Navarro and Mathieu Raffinot Algorithms on strings (2001) M. Crochemore, C. Hancart and T. Lecroq

Download Presentation

Tècniques i Eines Bioinformàtiques

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Tècniques i Eines Bioinformàtiques • Bioinformatics, Sequence and Genome Analysis • David W. Mount • Flexible Pattern Matching in Strings (2002) • Gonzalo Navarro and Mathieu Raffinot • Algorithms on strings (2001) • M. Crochemore, C. Hancart and T. Lecroq • http://www-igm.univ-mlv.fr/~lecroq/string/index.html

  2. Algorismes i estructures eficients de cerca String matching: definition of the problem (text,pattern) depends on what we have: text or patterns • Exact matching: • The patterns ---> Data structures for the patterns • 1 pattern ---> The algorithm depends on |p| and || • k patterns ---> The algorithm depends on k, |p| and || • The text ----> Data structure for the text (suffix tree, ...) • Approximate matching: • Dynamic programming • Sequence alignment (pairwise and multiple)

  3. Approximate string matching For instance, given the sequence CTACTACTACGTGACTAATACTGATCGTAGCTAC… search for the pattern ACTGA allowing one error… … but what is the meaning of “one error”?

  4. Edit distance Indel We accept three types of errors: 1. Mismatch: ACCGTGAT ACCGAGAT 2. Insertion: ACCGTGAT ACCGATGAT 3. Deletion: ACCGTGAT ACCGGAT The edit distance d between two strings is the minimum number of substitutions,insertions and deletions needed to transform the first string into the second one d(ACT,ACT)= d(ACT,AC)= d(ACT,C)= d(ACT,)= d(AC,ATC)= d(ACTTG,ATCTG)=

  5. Edit distance Indel We accept three types of errors: 1. Mismatch: ACCGTGAT ACCGAGAT 2. Insertion: ACCGTGAT ACCGATGAT 3. Deletion: ACCGTGAT ACCGGAT The edit distance d between two strings is the minimum number of substitutions,insertions and deletions needed to transform the first string into the second one d(ACT,ACT)= d(ACT,AC)= d(ACT,C)= d(ACT,)= d(AC,ATC)= d(ACTTG,ATCTG)= 0 1 2 3 1 2

  6. Edit distance and alignment of strings The Edit distance is related with the best alignment of strings Given d(ACT,ACT)=0 d(ACT,AC)=1 d(ACTTG,ATCTG)=2 which is the best alignment in every case? • ACT and ACT : ACT • ACT • ACT and AT: ACT • A -T • ACTTG and ATCTG: ACTTG ATCTG ACT - TG A - TCTG Then, the alignment suggest the substitutions, insertions and deletions to transform one string into the other

  7. Edit distance and alignment of strings But which is the distance between the strings ACGCTATGCTATACG and ACGGTAGTGACGC? … and the best alignment between them? 1966 was the first time this problem was discussed… and the algorithm was proposed in 1968,1970,… using the technique called “Dynamic programming”

  8. Edit distance and alignment of strings C T A C T A C T A C G T A C T G A

  9. Edit distance and alignment of strings C T A C T A C T A C G T A C T G A

  10. Edit distance and alignment of strings C T A C T A C T A C G T A C T G A The cell contains the distance between AC and CTACT.

  11. Edit distance and alignment of strings C T A C T A C T A C G T A C T G A ?

  12. Edit distance and alignment of strings C T A C T A C T A C G T 0 A C T G A ?

  13. Edit distance and alignment of strings C T A C T A C T A C G T 0 1 A C T G A ? - C

  14. Edit distance and alignment of strings C T A C T A C T A C G T 0 1 2 A C T G A ? - - CT

  15. Edit distance and alignment of strings C T A C T A C T A C G T 0 1 2 3 4 5 6 7 8 … A C T G A - - - - - - CTACTA

  16. Edit distance and alignment of strings C T A C T A C T A C G T 0 1 2 3 4 5 6 7 8 … A ? C ? T ? G A

  17. Edit distance and alignment of strings C T A C T A C T A C G T 0 1 2 3 4 5 6 7 8 … A 1 C 2 T 3 G… A ACT - - -

  18. Edit distance and alignment of strings - C C C C - BA(AC,CTA) BA(A,CTA) BA(A,CTAC) C T A C T A C T A C G T 0 1 2 3 4 5 6 7 8 … A 1 C 2 T 3 G A C T A C T A C T A C G T A C T G A d(AC,CTA)+1 d(A,CTA) BA(AC,CTAC)= best d(AC,CTAC)=min d(A,CTAC)+1

  19. Edit distance and alignment of strings Connect to http://alggen.lsi.upc.es/docencia/ember/leed/Tfc1.htm and use the global method.

  20. Edit distance and alignment of strings How this algorithm can be applied to the approximate search? to the K-approximate string searching?

  21. K-approximate string searching C T A C T A C T A C G T A C T G G T G A A … A C T G A This cell …

  22. K-approximate string searching C T A C T A C T A C G T A C T G G T G A A … A C T G A This cell gives the distance between (ACTGA, CT…GTA)… …but we only are interested in the last characters

  23. K-approximate string searching C T A C T A C T A C G T A C T G G T G A A … A C T G A This cell gives the distance between (ACTGA, CT…GTA)… …but we only are interested in the last characters

  24. K-approximate string searching * * * * * * C T A C G T A C T G G T G A A … A C T G A This cell gives the distance between (ACTGA, CT…GTA)… …but we only are interested in the last characters… …no matter where they appears in the text, then…

  25. K-approximate string searching * * * * * * C T A C G T A C T G G T G A A … 0 A C T G A This cell gives the distance between (ACTGA, CT…GTA)… …but we only are interested in the last characters… …no matter where they appears in the text, then…

  26. K-approximate string searching * * * * * * C T A C G T A C T G G T G A A … 0 A C T G A This cell gives the distance between (ACTGA, CT…GTA)… …but we only are interested in the last characters… …no matter where they appears in the text, then…

  27. K-approximate string searching C T A C T A C T A C G T A C T G G T G A A … 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 A C T G A This cell gives the distance between (ACTGA, CT…GTA)… …but we only are interested in the last characters… …no matter where they appears in the text, then

  28. K-approximate string searching Connect to http://alggen.lsi.upc.es/docencia/ember/leed/Tfc1.htm and use the semi-global method.

  29. Bioinformatics Pairwise and multiple alignment

  30. Pairwise alignment + - s(A,CTAC)-2 s(AC,CTACT)=maximum s(A,CTA) 1 s(AC,CTA)-2 Edit distance: match=0 mismatch=1 indel=1 d(A,CTAC)+1 d(AC,CTACT)=minimum d(A,CTA)….+1 d(AC,CTA)+1 Similarity: match=1 mismatch=-1 indel=-2

  31. Pairwise alignment Connect to http://alggen.lsi.upc.es Links to TEACHING EMBER LePA

  32. Pairwise to multiple alignment S2 A C A -1 S3 __ S1 What happens with three strings? Let n be their lenght, then the cost becomes O(n3) “O(23)” “O(32)” And with k strings? O(nk 2k k2)

  33. Multiple alignment Programs of multialignment use different heuristics: • Clustal (Progressive alignment) http://www.ebi.ac.uk/clustalw • TCoffee (Progressive alignment + data bases) http://igs-server.cnrs-mrs.fr/Tcoffee_cgi/index.cgi • HMM (Hidden Markov Models)

  34. Multiple alignment Connect to http://alggen.lsi.upc.es/ and follow the links TEACHING EMBER.

More Related