1 / 37

CAP5510 – Bioinformatics Multiple Sequence Alignment

CAP5510 – Bioinformatics Multiple Sequence Alignment. Tamer Kahveci CISE Department University of Florida. Goals. Understand What is multiple alignment Why align multiple sequences Learn How multiple alignments are scored Major multiple alignment methods Dynamic programming Standard

plarkin
Download Presentation

CAP5510 – Bioinformatics Multiple Sequence Alignment

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CAP5510 – BioinformaticsMultiple Sequence Alignment Tamer Kahveci CISE Department University of Florida

  2. Goals • Understand • What is multiple alignment • Why align multiple sequences • Learn • How multiple alignments are scored • Major multiple alignment methods • Dynamic programming • Standard • MSA • Progressive alignment • Star • CLUSTALW

  3. What is Multiple Alignment? • Alignment of more than two sequences • Global: multiple alignment • http://www-igbmc.u-strasbg.fr/BioInfo/BAliBASE/ scxa_buteu vrdgyiaddk dcayfcgr.. .naycdeeck ...kgaesgk cwyagqygna scx1_titse .kdgypveyd ncayicwnyd .naycdklck ..dkkadsgy cyw...vhil scx6_titse .regypadsk gckitcflta .agycntect ..lkkgssgy caw.....pa scx1_cenno .kdgylvdak gckkncyklg kndycnrecr mkhrggsygy c.....ygfg six2_leiqu ..dgyirkrd gcklsclfg. .negcnkeck ..syggsygy cwt...wgla scxa_buteu cwcyklpdwv pikqkvsgk. cn.... scx1_titse cycyglpdse ptktn..gk. cksgkk scx6_titse cycyglpesv kiwtsetnk. c..... scx1_cenno cyceglsdst ptwplp.nkt csgk.. six2_leiqu cwceglpd.e ktwksetn.t cg....

  4. What is Local Multiple Alignment? • Local: motif • Local: motif (http://blocks.fhcrc.org/blocks-bin/getblock.sh?PR00624 ) ID HISTONEH5; BLOCK AC PR00624A; distance from previous block=(9,12) DE Histone H5 signature BL adapted; width=22; seqs=9; 99.5%=986; strength=1407 H10_HUMAN|P07305 ( 10) AKPKRAKASKKSTDHPKYSDMI 63 H5A_XENLA|P22844 ( 11) AKPKRSKALKKSTDHPKYSDMI 71 H10_RAT|P43278 ( 10) AKPKRAKAAKKSTDHPKYSDMI 70 H10_MOUSE|P10922 ( 10) AKPKRAKASKKSTDHPKYSDMI 63 Q91759 ( 9) AKPRRSKASKKSTDHPKYSDMI 71 H5B_XENLA|P22845 ( 9) AKPRRSKASKKSTDHPKYSDMI 71 H5_CHICK|P02259 ( 11) AKPKRVKASRRSASHPTYSEMI 100 H5_CAIMO|P06513 ( 12) AKPKRAKAPRKPASHPSYSEMI 91 H5_ANSAN|P02258 ( 12) AKPKRARAPRKPASHPTYSEMI 100

  5. Why Multiple Sequence Alignment • Basis for phylogeny • Helps find conserved regions in sets of proteins • Conserved regions • Provide insight into substitution patterns • Gives hints about functional sites

  6. How to Evaluate Multiple Alignments

  7. Sum of Pairs (SP) • Sum of induced pairwise alignment score of all pairs • Ignore space pairs aligned together A cwcyklpdwv pikqkvsgk cn.... B cycyglpdse ptktn..gk cksgkk A cwcyklpdwv pikqkvsgk cn C cycyglpesv kiwtsetnk c. A cwcyklpdwv pikqkvsgk. cn.. D cyceglsdst ptwplp.nkt csgk A cwcyklpdwv pikqkvsgk. cn.... B cycyglpdse ptktn..gk. cksgkk C cycyglpesv kiwtsetnk. c..... D cyceglsdst ptwplp.nkt csgk.. + B cycyglpdse ptktn..gk cksgkk C cycyglpesv kiwtsetnk c..... B cycyglpdse ptktn.gk. cksgkk D cyceglsdst ptwplpnkt csgk.. C cycyglpesv kiwtsetnk. c... D cyceglsdst ptwplp.nkt csgk

  8. BAliBASE Benchmark • Compare to a set of hand-aligned sequences • Check positions of letters • If the letters appear at the same position as the benchmark => good • Score between 0 ( ) and 1 ( ) • http://www-igbmc.u-strasbg.fr/BioInfo/BAliBASE/prog_scores.html

  9. Finding Multiple Sequence Alignments

  10. Dynamic Programming

  11. Dynamic Programming • Similar to pairwise alignment • Compare NV and NS 22-1 = 3 cases N + V N S N + V N - N + - N S NV NS S = max V • If k sequences are aligned • => k-dimensional matrix is filled

  12. Dynamic Programming A S V k=3 2k –1=7 cases

  13. Complexity • Space complexity: O(nk) for k sequences each n long. • Computing at a cell: O(2k). cost of computing δ. • Time complexity: O(2knk). cost of computing δ. • Finding the optimal solution is exponential in k • Proven to be NP-complete for a number of cost functions

  14. MSA (Carrillo, Lipman’ 88)

  15. MSA – Idea 2 3 1

  16. MSA algorithm (1/3) • Find pairwise alignment • Trial multiple alignment produced by a tree, cost = d • This provides a limit to the volume within which optimal alignments are found • Specifics • Sequences x1, .., xr. • Alignment A, cost = c(A) • Optimal alignment A* • Aij = induced alignment on xi, .., xj on account of A • D(xi,xj) = cost of optimal pairwise alignment of xi,xj <= c(Aij )

  17. MSA algorithm (2/3) • d >= c(A*) = c(A*uv) + Σ c(A*ij) >= c(A*uv) + Σ D(xi,xj) • c(A*uv) <= d - Σ D(xi,xj) = B(u,v) • Compute B(u,v) for each pair of u,v • Consider any cell f with projection (s,t) on u,v plane. • If A* passes through f then A*uv passes through (s,t) • beststuv = best pairwise alignment of xu,xv that passes through (s,t). • beststuv = distance of the prefixes up to (s,t) + cost(xsi,xsj) + distance of suffixes after (s,t) i < j (i,j) ≠ (u,v) i < j (i,j) ≠ (u,v) i < j (i,j) ≠ (u,v)

  18. MSA algorithm (3/3) • If beststuv > B(u,v), then • A* cannot pass through cell f • Discard such cells from computation of DP

  19. Question Align: s1: MPE s2: MKE s3: MSKE s4: SKE BLOSUM 62

  20. Progressive Alignment

  21. Star Alignment

  22. Star Alignments • Heuristic method for multiple sequence alignments • Select a sequence c as the center of the star • For each sequence x1, …, xk such that xi c, perform a Needleman-Wunsch global alignment for xi and c

  23. MPE | | MKE MSKE | || M-KE SKE || MKE M-PE M-KE MSKE S-KE M-PE M-KE MSKE Star Alignments Example s1: MPE s2: MKE s3: MSKE s4: SKE s3 s1 s2 MPE MKE s4 • All induced pairwise alignments to the center sequence is the optimal one. • How should we choose a center? (Exercise: try s4 as the center) • Try all of them?

  24. CLUSTAL-W (Thompson, Higgins, Gibson 1994)

  25. CLUSTAL-W (1/4) • Given sequences A, B, C, D, E • Compare all pairs and construct a distance matrix

  26. A E B C D CLUSTAL-W (2/4) • Find phylogenetic tree for A, B, C, D, E using neighbor joining A E B C D A E B C D A B C D E

  27. CLUSTAL-W (3/4) • Align sequences starting from leaf level • Edge weights are used to compute the score of the alignment • O(k2n2) time • O(n2) space • Result depends on sequence order A B C D E

  28. CLUSTAL-W (4/4) • Sample query using ClustalW • http://www.ebi.ac.uk/clustalw/

  29. CLUSTAL-W (4/4) • Sample query using ClustalW

  30. Other Progressive Methods • T-COFFEE • PILUP • Muscle • …

  31. T-coffee (Notredame, Higgins, Heringa 2000) • Find a library of alignments between pairs of sequences. • Create a new scoring matrix for each pair of sequences using the library • Directly from alignment of s1 and s2 • Indirectly through alignment of s1, s3 and s3, s2. s1 • Use these scoring matrices during progressive alignment s2 Scoring matrix for s1 and s2

  32. T-Coffee (1/2) • Given sequences A, B, C, D, E • Create primary library

  33. T-Coffee (2/2) • Create extended library • Create similarity matrix SeqA

  34. Iterative Alignment

  35. PRRP A cwcyklpdwv pikqkvsgk. cn.... B cycyglpdse ptktn..gk. cksgkk C cycyglpesv kiwtsetnk. c..... D cyceglsdst ptwplp.nkt csgk.. E cyceglpdst piwplp.nkt ctgk.. 1. Find some initial alignment 2. Construct phylogenetic tree based on multiple alignment A B C D E Go back if the result has improved A cwcyklpdwv pikqkvsgk. cn.... B cycyglpdse ptktn..gk. cksgkk C cycyglpesv kiwtsetnk. c..... D cyceglsdst ptwplp.nkt csgk.. E cyceglpdst piwplp.nkt ctgk.. 3. Align sequences

  36. Other methods • Genetic algorithm (machine learning) • Partial order graphs (graph matching) • HMMER (hidden markov model) • For a comparison: • http://www.cise.ufl.edu/~tamer/papers/psb2006.pdf

  37. Motif Logos ID HISTONEH5; BLOCK AC PR00624A; distance from previous block=(9,12) DE Histone H5 signature BL adapted; width=22; seqs=9; 99.5%=986; strength=1407 H10_HUMAN|P07305 ( 10) AKPKRAKASKKSTDHPKYSDMI 63 H5A_XENLA|P22844 ( 11) AKPKRSKALKKSTDHPKYSDMI 71 H10_RAT|P43278 ( 10) AKPKRAKAAKKSTDHPKYSDMI 70 H10_MOUSE|P10922 ( 10) AKPKRAKASKKSTDHPKYSDMI 63 Q91759 ( 9) AKPRRSKASKKSTDHPKYSDMI 71 H5B_XENLA|P22845 ( 9) AKPRRSKASKKSTDHPKYSDMI 71 H5_CHICK|P02259 ( 11) AKPKRVKASRRSASHPTYSEMI 100 H5_CAIMO|P06513 ( 12) AKPKRAKAPRKPASHPSYSEMI 91 H5_ANSAN|P02258 ( 12) AKPKRARAPRKPASHPTYSEMI 100

More Related