1 / 60

Multiple Sequence Alignment

Multiple Sequence Alignment. Based on slides by Irit Gat- Viks. Example. VTIS C TGSSSNIGAG-NHVK W YQQLPG VTIS C TGTSSNIGS--ITVN W YQQLPG LRLS C SSSGFIFSS--YAMY W VRQAPG LSLT C TVSGTSFDD--YYST W VRQPPG PEVT C VVVDVSHEDPQVKFN W YVDG-- ATLV C LISDFYPGA--VTVA W KADS--

dyan
Download Presentation

Multiple Sequence Alignment

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multiple Sequence Alignment Based on slides by Irit Gat-Viks

  2. Example VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG LSLTCTVSGTSFDD--YYSTWVRQPPG PEVTCVVVDVSHEDPQVKFNWYVDG-- ATLVCLISDFYPGA--VTVAWKADS-- AALGCLVKDYFPEP--VTVSWNSG--- VSLTCLVKGFYPSD--IAVEWWSNG--

  3. Why multiple sequence alignment? • Structure similarity – aa that play the same role in each structure are in the same column. • Evolutionary similarity – aa related to the same ancestor are in the same column. • Functional similarity - aa with the same function are in the same column. • When seqs are closely related, structure-evolution-functional similarity equivalent.

  4. Multiple Alignment Definition Input: Sequences S1, S2,…, Sk over the same alphabet Output: Gapped sequences S’1, S’2,…, S’k of equal length • |S’1|= |S’2|=…= |S’k| • Removal of spaces from S’igives Si for all i

  5. A - T A G - G T T G G G G T G G - - T - A T T A - - A - T A C C A C C C - G C - G - Possible alignment Possible alignment Example S1=AGGTC S2=GTTCG S3=TGAAC

  6. Example Multiple sequence alignment of 7 neuroglobins using clustalx

  7. Human-centric beta globin Multiple Alignment http://globin.cse.psu.edu/

  8. MSA applications • Generate protein families • Extrapolation – membership of uncharacterized sequence to a protein family. • Understand evolution - preliminary step in molecular evolution analysis for constructing phylogenetic trees. e.g., is the duck evolutionary closer to a lion or to a fruit fly ? • Pattern identification – find the important (conserved) region in the protein - conserved positions may characterize a function. • Domain identification – Build a consensus/profile/motif that describe the protein family, help to describe new members of the family. • DNA regulatory elements • Structure prediction (secondary and 3D model). • Alignment of multiple sequences may reveal weak signals

  9. Protein Phylogenies – Example Kinase domain

  10. Scoring alignments • Given input seqs. S1 , S2,…, Skfind a multiple alignment of optimal score • Scores preview: • Sum of pairs • Consensus • Tree • Varying methods (and controversy)

  11. Sum of Pairs score Def:Induced pairwise alignment A pairwise alignment induced by the multiple alignment Example: x: AC-GCGG-C y: AC-GC-GAG z: GCCGC-GAG Induces: x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAG y: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG • S(M) = k<l (S’k, S’l)

  12. SOP Score Example Consider the following alignment: AC-CDB- -C-ADBD A-BCDAD Scoring scheme: match - 0 mismatch/indel - -1 SP score: -4 -3 -5 =-12 Multiple Alignment with SOP scores is NP-hard

  13. הבעיה החישובית: בהינתן מספר רצפים, מצא את ההתאמה שלהם למסגרתמשותפת (ע"י הוספת רווחים) כך שפונקצית מרחק תקבל ערך אופטימלי. -SCGPFIRV MSCGPGLRA -SCTPHL-A MSC-PKIRGMS-LPLLRN MSHKPALRA SCGPFIRV MSCGPGLRA SCTPHLA MSCPKIRGMSLPLLRN MSHKPALRA האם אפשר פשוט להכליל את שיטת התיכנות הדינמי? בתיאוריה כן, באופן פרקטי לא. מדובר בזמן ריצהובגודל זיכרון הגדלים כ NKכאשר N הוא אורך הרצף K מספר הרצפים למעשה בלתי אפשרי עבור K>3

  14. Consensus MSA • Score –sum of the distances from the sequences (with gaps as in the alignment) to some consensus sequence • More difficult to find/define, as the consensus sequence itself is difficult to define • Used mainly for computational proofs

  15. Scoring metrics -examples -SCGPFIRV MSCGPGLRA -SCTPHL-A -SCGPFIRV MSCGPGLRA -SCGPFIRV -SCTPHL-A MSCGPGLRA -SCTPHL-A 5 3 5 13 -SCGPFIRV MSCGPGLRA -SCTPHL-A MSC-PKIRGMS-LPLLRN MSHKPALRA סה"כ הומוגניות 354 6 4 2 6 1 4 5 3 סה"כ מרחק 19 420240521 3 Sum of pairs: Distance from concensus

  16. CTGG CTGG CCGG GTTG CTTG GTTC GTTG Tree MSA • Input: Tree T, a string for each leaf • Phylogenetic alignment for T: Assignment of a string to each internal node • Score – (weighted) sum of scores along edges • Goal: find phyl. alignment of optimal score • Consensus = phyl. Alignment where T is a star

  17. Profile Representation of MA - A G G C T A T C A C C T G T A G – C T A C C A - - - G C A G – C T A C C A - - - G C A G – C T A T C A C – G G C A G – C T A T C G C – G G A 1 1 .8 C .6 1 .4 1 .6 .2 G 1 .2 .2 .4 1 T .2 1 .6 .2 - .2 .8 .4 .8 .4 • Alternatively, use log odds: • pi(a) = fraction of a’s in col i • p(a) = fraction of a’s overall • log pi(a)/p(a)

  18. Aligning a sequence to a profile • Key in pairwise alignment is scoring two positions x,y: (x,y) • For a letter x and a column y in a profile, (x,y)=value of x in col. Y • Invent a score for (x,-) • Run the DP alg for pairwise alignment

  19. Aligning alignments • Given two alignments, how can we align them? • Hint: use DP on the corresponding profiles. x GGGCACTGCAT y GGTTACGTC-- Alignment 1 z GGGAACTGCAG w GGACGTACC-- Alignment 2 v GGACCT----- x GGGCACTGCAT y GGTTACGTC-- z GGGAACTGCAG w GGACGTACC-- v GGACCT-----

  20. u1= ACg/tTACg/tTACg/cT… u2 = TTAATTAATTAA… … uk = CCGGCCGGCCGG… u1= ACGTACGTACGT… u2 = TTAATTAATTAA… u3 = ACTACTACTACT… … uk = CCGGCCGGCCGG k-1 k Multiple Alignment: Greedy Heuristic • Choose most similar pair of sequences and combine into a profile , thereby reducing alignment of k sequences to an alignment of of k-1 sequences/profiles. Repeat

  21. ClustalW Thompson, Higgins, Gibson 94 • Popular multiple alignment tool today • ‘W’ = ‘weighted’ (different parts of alignment are weighted differently). • Three-step process 1.) Construct pairwise alignments 2.) Build Guide Tree 3.) Progressive alignment guided by the tree

  22. v1 v2 v3 v4 v1 - v2 .17 - v3 .87 .28 - v4 .59 .33 .62 - Step 1: Pairwise Alignment • Aligns each sequence against each other giving a similarity matrix • Similarity = exact matches / sequence length (percent identity) (.17 means 17 % identical)

  23. Step 2: Guide Tree • Use the similarity method to create a Guide Tree by applying some clustering method* • Guide tree roughly reflects evolutionary relations • ClustalW uses the neighbor-joining method, which iteratively: • Selects the closest pair of sequences/subtrees • Combines them into a single subtree • Re-computes the distances from the new subtree to all the other sequences/subtrees

  24. v1 v2 v3 v4 v1 - v2 .17 - v3 .87 .28 - v4 .59 .33 .62 - Step 2: Guide Tree (cont’d) v1 v3 v4 v2 Calculate:v1,3 = alignment (v1, v3)v1,3,4 = alignment((v1,3),v4)v1,2,3,4 = alignment((v1,3,4),v2)

  25. Step 3: Progressive Alignment • Start by aligning the two most similar sequences • Using the guide tree, add in the most similar pair (seq-seq, seq-prof or prof-prof) • Insert gaps as necessary • Many ad-hoc rules: weighting, different matrices, special gap scores…. FOS_RAT PEEMSVTS-LDLTGGLPEATTPESEEAFTLPLLNDPEPK-PSLEPVKNISNMELKAEPFD FOS_MOUSE PEEMSVAS-LDLTGGLPEASTPESEEAFTLPLLNDPEPK-PSLEPVKSISNVELKAEPFD FOS_CHICK SEELAAATALDLG----APSPAAAEEAFALPLMTEAPPAVPPKEPSG--SGLELKAEPFD FOSB_MOUSE PGPGPLAEVRDLPG-----STSAKEDGFGWLLPPPPPPP-----------------LPFQ FOSB_HUMAN PGPGPLAEVRDLPG-----SAPAKEDGFSWLLPPPPPPP-----------------LPFQ . . : ** . :.. *:.* * . * **: Dots and stars show how well-conserved a column is.

  26. CLUSTALW algorithm 1. בונים טבלה של מרחק עריכה בין כל שני רצפים Hbb Hbb Hba Hba Myg Human Horse Human Horse Whale 2.בונים עץ ע"י חיבור הרצפים הדומים לפי סדר התאמתם 3. בונים את ההתאמה לפי הסדר המוכתב ע"י העץ ישנם שלושה מצבים: -התאמת זוג רצפים -התאמה בין התאמות -הוספת רצף בודד להתאמה קיימת

  27. CLUSTALW algorithm • נקודות עדינות: • נותנים משקלות שונים לרצפים, כך שאם יש מספר רצפים מאד דומים המשקל היחסי של כל אחד יקטן. • שיטה מיוחדת לקביעת מחיר להכנסת רווחים. • חסרונות: • מאחר שהבניה היא בלתי הפיכה יש תלות גדולה מאד באיכות ההתאמה של הזוגות הראשונים. • רווחים אינם נעלמים: Once a Gap, Always a gap • רגישות רבה לקנס הרווחים • יתרונות: • מהיר • סביר • אפשר לקבוע הרבה פרמטרים • כולם משתמשים

  28. Best Pairwise alignment (optimal) Projected Pairwise alignment CLUSTALW algorithm • We can deduce a pairwise alignment for each two sequences in the multiple alignment (projected pairwise alignment) • The projected pairwise alignment is NOT the best pairwise alignment for the two sequences. • CLUTALW is not an optimal algorithm. Better alignments might exist! The algorithm yields a possible alignment, but not necessarily the best one.

  29. ClustalW at EMBL http://www.ebi.ac.uk/clustalw Clustalw at the SRS site at EBI http://www.cs.tau.ac.il/~ulitskyi/cg/GBA.fasta.txt

  30. ClustalW Output : Aln format

  31. MSA algorithms • Progressive methods (CLUSTALW,T-Coffee) • Iterative methods (Dialign) • Direct optimization (Monte Carlo, genetic algorithms) • Local methods: eMotifs, Blocks, Psi-blast

  32. MSA Editing: Jalview Conservation www.es.embnet.org/Services/MolBio/jalview/index.html

  33. MSA formats - fasta

  34. MSA formats - Aln

  35. MSA formats - MSF

  36. Example 1a: a good MSA

  37. Example 1b: making MSA of distantly related proteins

  38. Example 1c: including more distant relatives in the MSA

  39. H M e N H N S 2 M e C O O H O N O 2 O C O O H H + 2 F e A s c o r b a t e A C V M e 2 H O N H N S 2 2 M e C O O H O N O C O O H I s o p e n i c i l l i n N Example 2: Isopenicillin N Synthase • Mononuclear iron proteins – electron carrier proteins. Iron atoms are bound to amino acid side chains. • In IPNS the metal ion is coordinated by three protein residues • IPNS is involved in biosynthesis of penicillin

  40. Research IPNS • Goal: Identify Fe+2 binding residues. • Possible solutions: • In the lab... • Bioinformatic approach (comparing different IPNS sequences).

  41. Step 1 Multiple alignment of known IPNS Implementation: 1. Obtain sequence (e.g., for NCBI:) IPNS AND Bacteria[Organism] 2. MSA (clustalw) and search for conserved residues in the MSA

  42. MSA – bacteria only Not enough variation!

  43. MSA – bacteria & fungi bacteria bacteria & fungi Not enough variation!

  44. Step 2 Goal: Add more enzymes, similar to IPNS Implementation: Search in http://www.expasy.org/tools/blast/ Blast IPNS_CEPAC as query Select sequences similar to the query in the entire length Export in FASTA format Run CLUSTALW

  45. Step 2 Goal: Add more enzymes, similar to IPNS Implementation: Search in http://www.expasy.org/tools/blast/ Blast IPNS_CEPAC as query Select sequences similar to the query in the entire length Export in FASTA format Run CLUSTALW

  46. New multiple alignment, narrowing down the possibilities

  47. Simple multiple alignment • The known IPNS sequences are very similar. • Close enzymes sequences are also quite similar. • Not enough variability to categorize the active sites. • We need to obtain even more distant sequences (distant homologs).

  48. Step 3 Using the results of the MSA for further searches Implementation: 1. Obtain an MSA (clustalw). 2. - Construct a consensus sequence and perform a new search OR - Construct a profile and perform a new search. 3. MSA (clustalw) and search for conserved residues in the MSA.

  49. Consensus Sequence • We can deduce a consensus sequence from the multiple sequence alignment. The consensus sequence holds the most frequent character of the alignment at each column. • Consensus: each position reflects the most common character found at a position.

More Related