Multiple Sequence Alignment

Multiple Sequence Alignment Based on slides by Irit Gat-Viks

Example VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG LSLTCTVSGTSFDD--YYSTWVRQPPG PEVTCVVVDVSHEDPQVKFNWYVDG-- ATLVCLISDFYPGA--VTVAWKADS-- AALGCLVKDYFPEP--VTVSWNSG--- VSLTCLVKGFYPSD--IAVEWWSNG--

Why multiple sequence alignment? • Structure similarity – aa that play the same role in each structure are in the same column. • Evolutionary similarity – aa related to the same ancestor are in the same column. • Functional similarity - aa with the same function are in the same column. • When seqs are closely related, structure-evolution-functional similarity equivalent.

Multiple Alignment Definition Input: Sequences S1, S2,…, Sk over the same alphabet Output: Gapped sequences S’1, S’2,…, S’k of equal length • |S’1|= |S’2|=…= |S’k| • Removal of spaces from S’igives Si for all i

A - T A G - G T T G G G G T G G - - T - A T T A - - A - T A C C A C C C - G C - G - Possible alignment Possible alignment Example S1=AGGTC S2=GTTCG S3=TGAAC

Example Multiple sequence alignment of 7 neuroglobins using clustalx

Human-centric beta globin Multiple Alignment http://globin.cse.psu.edu/

MSA applications • Generate protein families • Extrapolation – membership of uncharacterized sequence to a protein family. • Understand evolution - preliminary step in molecular evolution analysis for constructing phylogenetic trees. e.g., is the duck evolutionary closer to a lion or to a fruit fly ? • Pattern identification – find the important (conserved) region in the protein - conserved positions may characterize a function. • Domain identification – Build a consensus/profile/motif that describe the protein family, help to describe new members of the family. • DNA regulatory elements • Structure prediction (secondary and 3D model). • Alignment of multiple sequences may reveal weak signals

Protein Phylogenies – Example Kinase domain

Scoring alignments • Given input seqs. S1 , S2,…, Skfind a multiple alignment of optimal score • Scores preview: • Sum of pairs • Consensus • Tree • Varying methods (and controversy)

Sum of Pairs score Def:Induced pairwise alignment A pairwise alignment induced by the multiple alignment Example: x: AC-GCGG-C y: AC-GC-GAG z: GCCGC-GAG Induces: x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAG y: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG • S(M) = k<l (S’k, S’l)

SOP Score Example Consider the following alignment: AC-CDB- -C-ADBD A-BCDAD Scoring scheme: match - 0 mismatch/indel - -1 SP score: -4 -3 -5 =-12 Multiple Alignment with SOP scores is NP-hard

הבעיה החישובית: בהינתן מספר רצפים, מצא את ההתאמה שלהם למסגרתמשותפת (ע"י הוספת רווחים) כך שפונקצית מרחק תקבל ערך אופטימלי. -SCGPFIRV MSCGPGLRA -SCTPHL-A MSC-PKIRGMS-LPLLRN MSHKPALRA SCGPFIRV MSCGPGLRA SCTPHLA MSCPKIRGMSLPLLRN MSHKPALRA האם אפשר פשוט להכליל את שיטת התיכנות הדינמי? בתיאוריה כן, באופן פרקטי לא. מדובר בזמן ריצהובגודל זיכרון הגדלים כ NKכאשר N הוא אורך הרצף K מספר הרצפים למעשה בלתי אפשרי עבור K>3

Consensus MSA • Score –sum of the distances from the sequences (with gaps as in the alignment) to some consensus sequence • More difficult to find/define, as the consensus sequence itself is difficult to define • Used mainly for computational proofs

Scoring metrics -examples -SCGPFIRV MSCGPGLRA -SCTPHL-A -SCGPFIRV MSCGPGLRA -SCGPFIRV -SCTPHL-A MSCGPGLRA -SCTPHL-A 5 3 5 13 -SCGPFIRV MSCGPGLRA -SCTPHL-A MSC-PKIRGMS-LPLLRN MSHKPALRA סה"כ הומוגניות 354 6 4 2 6 1 4 5 3 סה"כ מרחק 19 420240521 3 Sum of pairs: Distance from concensus

CTGG CTGG CCGG GTTG CTTG GTTC GTTG Tree MSA • Input: Tree T, a string for each leaf • Phylogenetic alignment for T: Assignment of a string to each internal node • Score – (weighted) sum of scores along edges • Goal: find phyl. alignment of optimal score • Consensus = phyl. Alignment where T is a star

Profile Representation of MA - A G G C T A T C A C C T G T A G – C T A C C A - - - G C A G – C T A C C A - - - G C A G – C T A T C A C – G G C A G – C T A T C G C – G G A 1 1 .8 C .6 1 .4 1 .6 .2 G 1 .2 .2 .4 1 T .2 1 .6 .2 - .2 .8 .4 .8 .4 • Alternatively, use log odds: • pi(a) = fraction of a’s in col i • p(a) = fraction of a’s overall • log pi(a)/p(a)

Aligning a sequence to a profile • Key in pairwise alignment is scoring two positions x,y: (x,y) • For a letter x and a column y in a profile, (x,y)=value of x in col. Y • Invent a score for (x,-) • Run the DP alg for pairwise alignment

Aligning alignments • Given two alignments, how can we align them? • Hint: use DP on the corresponding profiles. x GGGCACTGCAT y GGTTACGTC-- Alignment 1 z GGGAACTGCAG w GGACGTACC-- Alignment 2 v GGACCT----- x GGGCACTGCAT y GGTTACGTC-- z GGGAACTGCAG w GGACGTACC-- v GGACCT-----

u1= ACg/tTACg/tTACg/cT… u2 = TTAATTAATTAA… … uk = CCGGCCGGCCGG… u1= ACGTACGTACGT… u2 = TTAATTAATTAA… u3 = ACTACTACTACT… … uk = CCGGCCGGCCGG k-1 k Multiple Alignment: Greedy Heuristic • Choose most similar pair of sequences and combine into a profile , thereby reducing alignment of k sequences to an alignment of of k-1 sequences/profiles. Repeat

ClustalW Thompson, Higgins, Gibson 94 • Popular multiple alignment tool today • ‘W’ = ‘weighted’ (different parts of alignment are weighted differently). • Three-step process 1.) Construct pairwise alignments 2.) Build Guide Tree 3.) Progressive alignment guided by the tree

v1 v2 v3 v4 v1 - v2 .17 - v3 .87 .28 - v4 .59 .33 .62 - Step 1: Pairwise Alignment • Aligns each sequence against each other giving a similarity matrix • Similarity = exact matches / sequence length (percent identity) (.17 means 17 % identical)

Step 2: Guide Tree • Use the similarity method to create a Guide Tree by applying some clustering method* • Guide tree roughly reflects evolutionary relations • ClustalW uses the neighbor-joining method, which iteratively: • Selects the closest pair of sequences/subtrees • Combines them into a single subtree • Re-computes the distances from the new subtree to all the other sequences/subtrees

v1 v2 v3 v4 v1 - v2 .17 - v3 .87 .28 - v4 .59 .33 .62 - Step 2: Guide Tree (cont’d) v1 v3 v4 v2 Calculate:v1,3 = alignment (v1, v3)v1,3,4 = alignment((v1,3),v4)v1,2,3,4 = alignment((v1,3,4),v2)

Step 3: Progressive Alignment • Start by aligning the two most similar sequences • Using the guide tree, add in the most similar pair (seq-seq, seq-prof or prof-prof) • Insert gaps as necessary • Many ad-hoc rules: weighting, different matrices, special gap scores…. FOS_RAT PEEMSVTS-LDLTGGLPEATTPESEEAFTLPLLNDPEPK-PSLEPVKNISNMELKAEPFD FOS_MOUSE PEEMSVAS-LDLTGGLPEASTPESEEAFTLPLLNDPEPK-PSLEPVKSISNVELKAEPFD FOS_CHICK SEELAAATALDLG----APSPAAAEEAFALPLMTEAPPAVPPKEPSG--SGLELKAEPFD FOSB_MOUSE PGPGPLAEVRDLPG-----STSAKEDGFGWLLPPPPPPP-----------------LPFQ FOSB_HUMAN PGPGPLAEVRDLPG-----SAPAKEDGFSWLLPPPPPPP-----------------LPFQ . . : ** . :.. *:.* * . * **: Dots and stars show how well-conserved a column is.

CLUSTALW algorithm 1. בונים טבלה של מרחק עריכה בין כל שני רצפים Hbb Hbb Hba Hba Myg Human Horse Human Horse Whale 2.בונים עץ ע"י חיבור הרצפים הדומים לפי סדר התאמתם 3. בונים את ההתאמה לפי הסדר המוכתב ע"י העץ ישנם שלושה מצבים: -התאמת זוג רצפים -התאמה בין התאמות -הוספת רצף בודד להתאמה קיימת

CLUSTALW algorithm • נקודות עדינות: • נותנים משקלות שונים לרצפים, כך שאם יש מספר רצפים מאד דומים המשקל היחסי של כל אחד יקטן. • שיטה מיוחדת לקביעת מחיר להכנסת רווחים. • חסרונות: • מאחר שהבניה היא בלתי הפיכה יש תלות גדולה מאד באיכות ההתאמה של הזוגות הראשונים. • רווחים אינם נעלמים: Once a Gap, Always a gap • רגישות רבה לקנס הרווחים • יתרונות: • מהיר • סביר • אפשר לקבוע הרבה פרמטרים • כולם משתמשים

Best Pairwise alignment (optimal) Projected Pairwise alignment CLUSTALW algorithm • We can deduce a pairwise alignment for each two sequences in the multiple alignment (projected pairwise alignment) • The projected pairwise alignment is NOT the best pairwise alignment for the two sequences. • CLUTALW is not an optimal algorithm. Better alignments might exist! The algorithm yields a possible alignment, but not necessarily the best one.

ClustalW at EMBL http://www.ebi.ac.uk/clustalw Clustalw at the SRS site at EBI http://www.cs.tau.ac.il/~ulitskyi/cg/GBA.fasta.txt

ClustalW Output : Aln format

MSA algorithms • Progressive methods (CLUSTALW,T-Coffee) • Iterative methods (Dialign) • Direct optimization (Monte Carlo, genetic algorithms) • Local methods: eMotifs, Blocks, Psi-blast

MSA Editing: Jalview Conservation www.es.embnet.org/Services/MolBio/jalview/index.html

MSA formats - fasta

MSA formats - Aln

MSA formats - MSF

Example 1a: a good MSA

Example 1b: making MSA of distantly related proteins

Example 1c: including more distant relatives in the MSA

H M e N H N S 2 M e C O O H O N O 2 O C O O H H + 2 F e A s c o r b a t e A C V M e 2 H O N H N S 2 2 M e C O O H O N O C O O H I s o p e n i c i l l i n N Example 2: Isopenicillin N Synthase • Mononuclear iron proteins – electron carrier proteins. Iron atoms are bound to amino acid side chains. • In IPNS the metal ion is coordinated by three protein residues • IPNS is involved in biosynthesis of penicillin

Research IPNS • Goal: Identify Fe+2 binding residues. • Possible solutions: • In the lab... • Bioinformatic approach (comparing different IPNS sequences).

Step 1 Multiple alignment of known IPNS Implementation: 1. Obtain sequence (e.g., for NCBI:) IPNS AND Bacteria[Organism] 2. MSA (clustalw) and search for conserved residues in the MSA

MSA – bacteria only Not enough variation!

MSA – bacteria & fungi bacteria bacteria & fungi Not enough variation!

Step 2 Goal: Add more enzymes, similar to IPNS Implementation: Search in http://www.expasy.org/tools/blast/ Blast IPNS_CEPAC as query Select sequences similar to the query in the entire length Export in FASTA format Run CLUSTALW

New multiple alignment, narrowing down the possibilities

Simple multiple alignment • The known IPNS sequences are very similar. • Close enzymes sequences are also quite similar. • Not enough variability to categorize the active sites. • We need to obtain even more distant sequences (distant homologs).

Step 3 Using the results of the MSA for further searches Implementation: 1. Obtain an MSA (clustalw). 2. - Construct a consensus sequence and perform a new search OR - Construct a profile and perform a new search. 3. MSA (clustalw) and search for conserved residues in the MSA.

Consensus Sequence • We can deduce a consensus sequence from the multiple sequence alignment. The consensus sequence holds the most frequent character of the alignment at each column. • Consensus: each position reflects the most common character found at a position.

Multiple Sequence Alignment