Multiple Sequence Alignment

Multiple Sequence Alignment Lesson 5

Example VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG LSLTCTVSGTSFDD--YYSTWVRQPPG PEVTCVVVDVSHEDPQVKFNWYVDG-- ATLVCLISDFYPGA--VTVAWKADS-- AALGCLVKDYFPEP--VTVSWNSG--- VSLTCLVKGFYPSD--IAVEWWSNG--

Why multiple sequence alignment? • Structure similarity – aa that play the same role in each structure are in the same column. • Evolutionary similarity – aa related to the same ancestor are in the same column. • Functional similarity - aa with the same function are in the same column. • Seq similarity – alignment with max similarity. No biological meaning. • When seqs are closely related, structure-evolution-functional similarity equivalent.

Why multiple alignment - example • Example: • Histones: small abundant proteins, Present in all eukaryotic chromosomes. • Show a remarkable conserved multiple sequence alignment • Conservation of structure and function. (aid in DNA package)

MSA applications • Generate protein families • Extrapolation – membership of uncharacterized sequence to a protein family. • Understand evolution - preliminary step in molecular evolution analysis for constructing phylogenetic trees. e.g., is the duck evolutionary closer to a lion or to a fruit fly ? • Pattern identification – find the important (conserved) region in the protein - conserved positions may characterize a function. • Domain identification – Build a consensus/profile/motif that describe the protein family, help to describe new members of the family. • DNA regulatory elements • Structure prediction (secondary and 3D model).

multiple sequence alignment Pairwise solution might be very different from multiple solution

הבעיה החישובית: בהינתן מספר רצפים, מצא את ההתאמה שלהם למסגרתמשותפת (ע"י הוספת רווחים) כך שפונקצית מרחק תקבל ערך אופטימלי. -SCGPFIRV MSCGPGLRA -SCTPHL-A MSC-PKIRGMS-LPLLRN MSHKPALRA SCGPFIRV MSCGPGLRA SCTPHLA MSCPKIRGMSLPLLRN MSHKPALRA האם אפשר פשוט להכליל את שיטת התיכנות הדינמי? בתיאוריה כן, באופן פרקטי לא. מדובר בזמן ריצהובגודל זיכרון הגדלים כ NKכאשר N הוא אורך הרצף K מספר הרצפים למעשה בלתי אפשרי עבור K>3

The problems we have to answer: 1. Choosing the sequencess 2. Scoring metrics 3. Approximation/heuristic algorithms 4. MSA formats 5. Interpreting the MSA מאחר שהבעיה החישובית קשה, אנו יודעים שכל שיטה שנציע לא תבטיח פתרון אופטימלי. נחפש שיטה שנותנת תוצאות טובות ברב המקרים. יתכן שהשיטה שנבחר תהיה תלויה בסיבה שבגללה אנו מבצעים את ההתאמה.

Scoring metrics • Distance from Consensus- The consensus of an alignment is a string of the most common characters in each column of the alignment. The total distance between the strings is defined as the number of characters that differ from the consensus character of their column: let C be the consensus sequence, then the total distance is sum_i D(Si, C). • Evolutionary Tree Alignment- The weight of the lightest evolutionary tree that can be constructed from the sequences, with the weight of the tree defined as the number of changes between pairs of sequences that correspond to two adjacent nodes in the tree, summed over all such pairs. • Sum of Pairs- The sum of pairwise distances between all pairs of sequences: Sum_{i<j} D(Si, Sj).

Scoring metrics -examples -SCGPFIRV MSCGPGLRA -SCTPHL-A -SCGPFIRV MSCGPGLRA -SCGPFIRV -SCTPHL-A MSCGPGLRA -SCTPHL-A 5 3 5 13 -SCGPFIRV MSCGPGLRA -SCTPHL-A MSC-PKIRGMS-LPLLRN MSHKPALRA סה"כ הומוגניות 354 6 4 2 6 1 4 5 3 סה"כ מרחק 19 420240521 3 Sum of pairs: Distance from concensus

MSA algorithms • Progressive methods (CLUSTALW,T-Coffee) • Iterative methods (Dialign) • Direct optimization (monte carlo, genetic algorithms) • Local methods: eMotifs, Blocks, Psi-blast

CLUSTALW algorithm • Compare all sequence pairs (pairwise alignment). • Generate a hierarchy for alignment (guide tree). • Build the multiple alignment step by step according to the guide tree; first aligning the most similar pair, then add another sequence or another pairwise alignment, etc.

CLUSTALW algorithm (1) Pairwise alignment (prepare a guide tree) 6 pairwise alignments then cluster analysis (2) Multiple alignment following the tree from (1) successivealignments

CLUSTALW algorithm 1. בונים טבלה של מרחק עריכה בין כל שני רצפים Hbb Hbb Hba Hba Myg Human Horse Human Horse Whale 2.בונים עץ ע"י חיבור הרצפים הדומים לפי סדר התאמתם 3. בונים את ההתאמה לפי הסדר המוכתב ע"י העץ ישנם שלושה מצבים: -התאמת זוג רצפים -התאמה בין התאמות -הוספת רצף בודד להתאמה קיימת

ניתן להתאים בין התאמות (או בין רצף להתאמה) ע"י הרחבה די טבעית של שיטת התיכנות הדינמי. הוספה והכנסת אותיות באופן הרגיל, המחיר להתאמת אותיות הוא מחיר ממוצע לקומבינציות השונות. צריך לתת מחיר מיוחד לרווח.

CLUSTALW algorithm • נקודות עדינות: • נותנים משקלות שונים לרצפים, כך שאם יש מספר רצפים מאד דומים המשקל היחסי של כל אחד יקטן. • שיטה מיוחדת לקביעת מחיר להכנסת רווחים. • חסרונות: • מאחר שהבניה היא בלתי הפיכה יש תלות גדולה מאד באיכות ההתאמה של הזוגות הראשונים. • רווחים אינם נעלמים: Once a Gap, Always a gap • רגישות רבה לקנס הרווחים • יתרונות: • מהיר • סביר • אפשר לקבוע הרבה פרמטרים • כולם משתמשים

Best Pairwise alignment (optimal) Projected Pairwise alignment CLUSTALW algorithm • We can deduce a pairwise alignment for each two sequences in the multiple alignment (projected pairwise alignment) • The projected pairwise alignment is NOT the best pairwise alignment for the two sequences. • CLUTALW is not an optimal algorithm. Better alignments might exist! The algorithm yields a possible alignment, but not necessarily the best one.

ClustalW at EMBL http://www.ebi.ac.uk/clustalw Clustalw at the SRS site at EBI

ClustalW Output : Aln format

MSA Editing: Jalview Conservation www.es.embnet.org/Services/MolBio/jalview/index.html

MSA formats - fasta

MSA formats - Aln

MSA formats - MSF

Example 1a: a good MSA

Example 1b: making MSA of distantly related proteins

Example 1c: including more distant relatives in the MSA

H M e N H N S 2 M e C O O H O N O 2 O C O O H H + 2 F e A s c o r b a t e A C V M e 2 H O N H N S 2 2 M e C O O H O N O C O O H I s o p e n i c i l l i n N Example 2: Isopenicillin N Synthase • Mononuclear iron proteins – electron carrier proteins. Iron atoms are bound to amino acid side chains. • In IPNS the metal ion is coordinated by three protein residues • IPNS is involved in biosynthesis of penicillin

Research IPNS • Goal: Identify Fe+2 binding residues. • Possible solutions: • In the lab... • Bioinformatic approach (comparing different IPNS sequences).

Step 1 Multiple alignment of known IPNS Implementation: 1. Obtain sequence (e.g., for MCBI:) IPNS AND Bacteria[Organism] 2. MSA (clustalw) and search for conserved residues in the MSA

MSA – bacteria only Not enough variation!

MSA – bacteria & fungi bacteria bacteria & fungi Not enough variation!

Step 2 Goal: Add more enzymes, similar to IPNS Implementation: Search in http://www.expasy.org/tools/blast/ blast IPNS_CEPAC as query Select sequences similar to the query in the entire length Export in FASTA format Run CLUSTALW

New multiple alignment, narrowing down the possibilities

Simple multiple alignment • The known IPNS sequences are very similar. • Close enzymes sequences are also quite similar. • Not enough variability to categorize the active sites. • We need to obtain even more distant sequences (distant homologs).

Step 3 Using the results of the MSA for further searches Implementation: 1. Obtain an MSA (clustalw). 2. - Construct a consensus sequence and perform a new search OR - Construct a profile and perform a new search. 3. MSA (clustalw) and search for conserved residues in the MSA.

Consensus Sequence • We can deduce a consensus sequence from the multiple sequence alignment. The consensus sequence holds the most frequent character of the alignment at each column. • Consensus: each position reflects the most common character found at a position.

Profile • We can deduce a statistical model describing the multiple sequence alignment. A Profile holds statistical information about characters in alignment at each column. • Profile: each position reflects the frequency of the character found at a position.

Profile vs. Consensus • The following multiple alignments will have the same consensus

Profile vs. Consensus • But have a different profile

Sequence LOGO A A C C G C T C T T A G C C G C G C- T A- C A G A G C C T A A G C A C G C- T A C G G G T G C T T A T G C– C G C- T http://weblogo.berkeley.edu A .. g/c c g .. G C .. T

Psi Blast • Position Specific Iterated - automatic profile-like search Regular blast Construct profile from blast results Blast profile search Final results

Alignment with distantly related proteins.

Isopenicillin N Synthase • Experimental evidence supports the finding that His212, Asp214 and His268 are the endogenous ligands that bind Fe+2 in IPNS. Enzyme Relative Km kcat kcat/Km Activity (mM) (min-1) (mM-1*min-1) Wild type 100% 0.4 38.8 96.9 His48Ala 16% 0.56 7.5 13.4 His63Ala 31% 1.0 14.2 14.2 His114Ala 28% 0.85 12.5 14.7 His124Ala 48% 0.84 32.1 38.1 His135Ala 22% 0.59 11.7 19.8 His212Ala <0.007 n.d. n.d. His268Ala <0.003 n.d. n.d. Asp14Ala 5% 0.86 0.56 0.7 Asp113Ala 63% 0.45 23.8 52.8 Asp131Ala 68% 0.48 36.3 75.5 Asp203Ala 32% 0.91 12.3 13.5 Asp214Ala <0.004 n.d. n.d.

– IPNS

Multiple Sequence Alignment