Multiple Sequence Alignment

Multiple Sequence Alignment

Definition • Homology: related by descent • Homologous sequence positions  ATTGCGC ATTGCGC ATTGCGC  AT-ACGC ATTGCGC  ATACGC A

Reasons for aligning sets of sequences • Organise data to reflect sequence homology • Estimate evolutionary distance • Infer phylogenetic trees from homologous sites • Highlight conserved sites/regions • Highlight variable sites/regions • Uncover changes in gene structure • Look for evidence of selection • Summarise information

Alignments help to Organise Visualise Analyze Sequence Data

The process of aligning sequences is a game involving playing off gaps and mismatches

Ways of aligning multiple sequences • By hand • Automated • Combination

Definition Optimality criteria: some kind rule or scoring scheme to help you to decide what you consider to be the best alignment

Pairwise vs Multiple Sequences • Pairs of sequences typically aligned using exhaustive algorithms (dynamic programming) • complexity of exhaustive methods is O(2n mn) n = number of sequences m = sequence length • Multiple sequence alignment usually performed using heuristic methods

ATTGCGC  ATA-CGC The Correct Alignment  ATTGCGC ATTGCGC ATTGCGC  AT-ACGC ATTGCGC  ATACGC A

The Correct Alignment

Sequence alignment is easy with sufficiently closely related sequences • Below a certain level of identity sequence alignment may become meaningless • twilight zone for aa sequences ~ 30% • In the twilight zone it is good to make use of additional information if possible (e.g. structure)

Consensus Sequences • Simplest Form:A single sequence which represents the most common amino acid/base in that position Y D D G A V - E A L Y D G G - - - E A L F E G G I L V E A L F D - G I L V Q A V Y E G G A V V Q A L Y D G G A/I V/L V E A L

Multiple Alignment Formats e.g. Clustal, Phylip, MSF, MEGA etc. etc.

Clustal Format CLUSTAL X (1.81) multiple sequence alignment CAS1_BOVIN MKLLILTCLVAVALARPKHPIKHQGLPQ--------EVLNEN- CAS1_SHEEP MKLLILTCLVAVALARPKHPIKHQGLSP--------EVLNEN- CAS1_PIG MKLLIFICLAAVALARPKPPLRHQEHLQNEPDSRE-------- CAS1_HUMAN MRLLILTCLVAVALARPKLPLRYPERLQNPSESSE-------- CAS1_RABBIT MKLLILTCLVATALARHKFHLGHLKLTQEQPESSEQEILKERK CAS1_MOUSE MKLLILTCLVAAAFAMPRLHSRNAVSSQTQ------QQHSSSE CAS1_RAT MKLLILTCLVAAALALPRAHRRNAVSSQTQ------------- *:***: **.*.*:* : . :

Phylip Format (Interleaved) 7 100 SOMA_BOVIN MMAAGPRTSL LLAFALLCLP WTQVVGAFPA MSLSGLFANA VLRAQHLHQL SOMA_SHEEP MMAAGPRTSL LLAFTLLCLP WTQVVGAFPA MSLSGLFANA VLRAQHLHQL SOMA_RAT_P -MAADSQTPW LLTFSLLCLL WPQEAGAFPA MPLSSLFANA VLRAQHLHQL SOMA_MOUSE -MATDSRTSW LLTVSLLCLL WPQEASAFPA MPLSSLFSNA VLRAQHLHQL SOMA_RABIT -MAAGSWTAG LLAFALLCLP WPQEASAFPA MPLSSLFANA VLRAQHLHQL SOMA_PIG_P -MAAGPRTSA LLAFALLCLP WTREVGAFPA MPLSSLFANA VLRAQHLHQL SOMA_HUMAN -MATGSRTSL LLAFGLLCLP WLQEGSAFPT IPLSRLFDNA MLRAHRLHQL AADTFKEFER TYIPEGQRYS -IQNTQVAFC FSETIPAPTG KNEAQQKSDL AADTFKEFER TYIPEGQRYS -IQNTQVAFC FSETIPAPTG KNEAQQKSDL AADTYKEFER AYIPEGQRYS -IQNAQAAFC FSETIPAPTG KEEAQQRTDM AADTYKEFER AYIPEGQRYS -IQNAQAAFC FSETIPAPTG KEEAQQRTDM AADTYKEFER AYIPEGQRYS -IQNAQAAFC FSETIPAPTG KDEAQQRSDM AADTYKEFER AYIPEGQRYS -IQNAQAAFC FSETIPAPTG KDEAQQRSDV AFDTYQEFEE AYIPKEQKYS FLQNPQTSLC FSESIPTPSN REETQQKSNL

Phylip Format (Sequential) 3100 Rat ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGTTAATGGCCG TGGTGGCTGGAGTGGCCAGTGCCCTGGCTCACAAGTACCACTAA Mouse ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGTCTCTTGCCT TGGGGAAAGGTGAACTCCGATGAAGTTGGTGGTGAGGCCCTGGG Rabbit ATGGTGCATCTGTCCAGT---GAGGAGAAGTCTGCGGTCACTGC TGGGGCAAGGTGAATGTGGAAGAAGTTGGTGGTGAGGCCCTGGG

Mega Format #mega TITLE: No title #Rat ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGT #Mouse ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGT #Rabbit ATGGTGCATCTGTCCAGT---GAGGAGAAGTCTGC #Human ATGGTGCACCTGACTCCT---GAGGAGAAGTCTGC #OppossumATGGTGCACTTGACTTTT---GAGGAGAAGAACTG #Chicken ATGGTGCACTGGACTGCT---GAGGAGAAGCAGCT #Frog ---ATGGGTTTGACAGCACATGATCGT---CAGCT

Progressive Multiple Alignment • Heuristic • Perform pairwise alignments • Align sequences to alignments or alignments to existing alignments (profile alignments • Do the alignments in some sensible order

Progressive versus Simultaneous • speed versus accuracy • simultaneous methods are capable of working out an ‘exact’ solution to the problem of multiple sequence alignment (e.g. NCBI’s MSA – user interface QAlign)

Iterative methods • Several progressive alignment methods can be iterated • e.g. Barton-Sternberg, ClustalX

ClustalX Algorithm • Perform pairwise alignments and calculate distances for all pairs of sequences • Construct guide tree (dendrogram) joining the most similar sequences using Neighbour Joining • Align sequences, starting at the leaves of the guide tree. This involves the pair-wise comparisons as well as comparison of single sequence with a group of seqs (Profile)

ClustalX is not optimal • There are known areas in which ClustalX performs badly e.g. • errors introduced early cannot be corrected by subsequent information • alignments of sequences of differing lengths cause strange guide trees and unpredictable effects • edges: ClustalX does not penalise gaps at edges • There are alternatives to ClustalX available

T-Coffee • JMB 2000 • Also a progressive alignment method • Designed to solve some of the problems with clustal (in particular the problem of clustals inability to correct errors that appear early in the process of alignment) • Can consider global and local pair-wise alignments

Using ClustalX • Start with sequences in FASTA format (or an existing alignment in Clustal format • [Do Alignment] on the alignment menu

ClustalX Parameters • Scoring Matrix • Gap opening penalty • Gap extension penalty • Protein gap parameters • Additional algorithm parameters • Secondary structure penalties

Score Matrices • Pairwise matrices and multiple alignment matrix series • PAM (Dayhoff), BLOSUM (Hennikof), GONNET (default), user defined • Transition (A<->G)/Transversion (C<-T) ratio – low for distantly related sequences

Gap Penalties • Linear gap penalties – Affine gap penalties p = (o + l.e) • Gap opening • Gap extension • Protein specific penalties (on by default) • Increase the probability of gaps associated with certain residues • Increase the chances of gaps in loop regions (> 5 hydrophilic residues)

Algorithm parameters • Slow-accurate pair-wise alignment • Do alignment from guide tree • Reset gaps before aligning (iteration) • Delay Divergent sequences (%)

Additional displays • Column Scores • Low quality regions • Exceptional residues

Multiple Alignment Tips • Align pairs of sequences using an optimal method • Progressive alignment programs such as ClustalX for multiple alignment • Choose representative sequences to align carefully • Choose sequences of comparable lengths • Progressive alignment programs may be combined • Review alignment by eye and edit • If you have a choice align amino acid sequences rather than nucleotides

Alignment of coding regions • Nucleotide sequences much harder to align accurately than proteins • Protein coding sequences can be aligned using the protein sequences • e.g. BioEdit: toggle translation to amino acid, call clustalw to align, edit alignment by hand, toggle back to nucleotide • In-frame nucleotide alignments can be used, e.g. to determine non-synonymous and synonymous distances separately

Multiple Alignments and Phylogenetic Trees • You can make a more accurate multiple sequence alignment if you know the tree already • A phylogenetic tree is only as good as the alignment from which it was produced • The process of constructing a multiple alignment (unlike pair-wise) needs to take account of phylogenetic relationships

Editing a multiple sequence alignment • It is NOT fraud to edit a multiple sequence alignment • Incorporate additional knowledge if possible • Alignment editors help to keep the data organised and help to prevent unwanted mistakes

Alignment Editors • e.g. GDE, Bioedit, Seaview, Jalview etc. • Some alignment editors have begun to function as sequence analysis platforms (e.g. tools on BioEdit, GDE) • Construct sub-sequences (GDE, Seaview) • Annotate sequences (Seaview)

Aligning weakly similar sequences

Sequence contains conserved regions • e.g. DIALIGN (Morgenstern, Dress, Werner) • re-aligns regions between conserved blocks http://bibiserv.techfak.uni-bielefeld.de/ useful if sequences contains consistent conserved blocks • Block Maker – searches for conserved words that may be inconsistent http://blocks.fhcrc.org/

Profile Alignment Gribskov et al. 1987 • Position specific scores • Allows addition of extra sequence(s) to an alignment • Allows alignment of alignments • Gaps introduced as whole columns in the separate alignments • Optimal alignment in time O(a2l2) a = alphabet size, l = sequence length • Information about the degree of conservation of sequence positions is included

Good reasons to use profile alignments • Adding a new sequence to an existing multiple alignment that you want to keep fixed(align sequence to profile) • Searching a database for new members of your protein family(pfsearch) • Searching a database of profiles to find out which one your sequence belongs to(pfscan) • Combining two multiple sequence alignments(profile to profile)

Profile Alignment Using ClustalX • Profile Alignment Mode • Align sequence to profile • Align profile 1 to profile 2 • Secondary structure parameters

Profile searching using PSI-BLAST • Position Specific Iterative • Perform search – construct profile – perform search • Convergence (hopefully…) • Increased sensitivity for distantly related sequences • Available on-line (NCBI)

Databases of Aligned Sequences • Hovergen http://pbil.univ-lyon1.fr/databases/hovergen.html (vertebrate alignments) • Pfam http://www.sanger.ac.uk/Software/Pfam/ (protein domain alignments and profile HMMs) • BLOCKS http://blocks.fhcrc.org/ • Ribosomal Database Project http://rdp.cme.msu.edu/html/ alignments and trees derived from rRNA sequences • Interpro – combines information from other sources • Many more…

Probabilistic Models of Sequence Alignment • Hidden Markov Models • sequence of states and associated symbol probabilities • Produces a probabilistic model of a sequence alignment • Align a sequence to a Profile Hidden Markov Model • Algorithms exist to find the most efficient pathway through the model

Markov Chain: A chain of things. The probability of the next thing depends only on the current thing Hidden Markov Model: A sequence of states which form a Markov Chain. The states are not observable. The observable characters have “emission” probabilities which depend on the current state.

Some more recent developments • The need to align genomes • alignment tools required that can align very large regions of genomes • poses a computational challenge • programmes such as dialign can be run in parallel on multiprocessor machines

Some more recent developments • MUSCLE • Faster (uses a k-mer frequency to calculate first pair-wise alignments) • Progressive (repeats the MSA using the more accurate kimura distance between aligned amino acid sequences) • Has a third optimisation stage that involves making profile alignments of sub-trees and accepting the new alignment if it improves the SP score.

MuSiC - multiple sequence alignment with constraints • web server that allows a user to enter a set of

Multiple Sequence Alignment