180 likes | 269 Views
Multiple Sequence Alignment “An Inexact Science”. Why Do Multiple Sequence Alignments? Since Biological Sequences often occur in families, we may want to know to which family our sequence of interest belongs.
E N D
Multiple Sequence Alignment “An Inexact Science”
Why Do Multiple Sequence Alignments? Since Biological Sequences often occur in families, we may want to know to which family our sequence of interest belongs. For the most part we are interested in families of proteins since homologous sequences retain similar structures and functions. Thus, if we know a feature of one of the proteins, we can possibly identify similar features in homologous proteins and predict that they have similar functions. Since most proteins have been identified by the sequencing of genomic DNA, the functions of most proteins have been assigned on the basis of homology to other known proteins rather than on the basis of results from biochemical or functional (cell biological) assays. Aligned residues tend to occupy the corresponding positions in the 3D-structure of each aligned protein. While protein structures evolve over time, their sequences evolve more rapidly than the structures, so a good multiple sequence alignment can tell us something about the evolution of the protein.
For these reasons multiple sequence alignments are used to: • provide insight into likely function and structure of a protein • provide a more sensitive method for finding distantly related members of a protein family • find conserved residues or motifs in a subset of the results from a database search • gain understanding of the evolution of a particular protein • help understand the gene products of a newly sequenced genome
Any Multiple Sequence Alignment starts with a primary sequence. In the following we start with hMSH3 - Swiss Prot Accession Number P20585 sp|P20585|MSH3_HUMAN GRGTSTHDGIAIAYATLEYFIRDVKSLTLFVTHYPPVCELEKNYSHQVGNYHMGFLVS 1035 sp|P43246|MSH2_HUMAN GRGTSTYDGFGLAWAISEYIATKIGA--FCMFATHFHELTALAN-QIPTVNNLHVTALTT807 sp|O15457|MSH4_HUMAN GRGTNTEEGIGICYAVCEYLLSLKAF---TLFATHFLELCHIDALYPNVENMHFEVQHVK 818 sp|O43196|MSH5_HUMAN GKGTNTVDGLALLAAVLRHWLARGPTCPHIFVATNFLSLVQLQLLPQGPLVQYLTMETCE 733 sp|P52701|MSH6_HUMAN GRGTATFDGTAIANAVVKELAETIKC--RTLFSTHYHSLVEDYSQNVAVRLGHMACMVEN 1273 sp|P13705|MSH3_MOUSE GRGTSTHDGIAIAYATLEYFIRDVKS--LTLFVTHYPPVCELEKCYPEQVGNYHMGFLVN 989 sp|P25336|MSH3_YEAST GRGTGTHDGIAISYALIKYFSELSDCP-LILFTTHFPMLGEIKS---PLIRNYHMDYVEE 957 tr|Q73MQ8 KKGKWLVAVDNLKLTIAEDDIEVCENQEKLKLSKPIVSIISDTSAPSSRPS--------- 740 :*. : : . . . : Interpreting the notation at the bottom of the alignment * = entirely conserved column : = all residues have approximately the same size and hydropathy . = one of either size or hydropathy have been preserved In general, our lab manual says that a good block in an alignment is at least 10-30 aa long and contains at least one to three *’s, five to seven :’s, and a few .’s. If this is the case, we suspect that we have identified a conserved region within the alignment.
The previous example clearly illustrates the following observations: • We can not expect two protein structures with different sequences to be completely superposable (i.e. to be able to completely superimpose one upon the other.) • Research by Chothia & Lesk in 1986 found that given two protein sequence alignments that were clearly homologous (30% identical) usually only about 50% of the individual residues were superposable.
Evolutionarily correct alignment is more difficult to infer than structural alignment. • Evolutionary history of the residues from a sequence of a family is not usually known from any source. • It must be inferred from sequence alignment. • Sequence alignment has an independent source of reference – this may not be the “common” ancestor. This does not say that multiple sequence alignment is straight forward – there is no way to define an unambiguously correct alignment.
In general, humans can do better than machines aligning sequences. Many Biologists can do high quality alignments by hand. • Some of the Factors to be considered are: • Highly conserved sequences • Buried hydorphobic residues • Expected patterns of insertions and deletions that tend to alternate with conserved sequences • Phylogenetic relationships • Influence of secondary and tertiary structure e.g. alternation of hydrophobic and hydrophilic columns in an exposed β – sheet. But, it is HARD and TEDIOUS work!! Most times it is best to start with a machine generated sequence alignment.
The basic algorithm is called progressive alignment developed in the late 1980’s by Da-Fei Feng and Russell Doolittle. • The algorithm follows these steps: • First a group of proteins is chosen to be aligned • Every protein in the group is globally aligned using Needleman-Wunsch with every other protein in the group to be aligned • A distance matrix is used to score the pairwise alignments. These scores are used to generate a guide tree to construct the alignment. • The guide tree which reflects the relatedness of the sequences to be aligned shows the order in which sequences should be added to the multiple alignment starting with the two most closely related sequences • One by one the next most closely related sequences are added to the alignment (for example, one naïve way is to create a consensus alignment for the first two, which may include gaps, then align the next sequence with this consensus alignment – continue the process until all of the sequences are aligned.)
We will illustrate this process for the following three sequences: >gi|62901522|sp|P01865|GCAM_MOUSE Ig gamma-2A chain C region, membrane-bound form KTTAPSVYPLAPVCGDTTGSSVTLGCLVKGYFPEPVTLTWNSGSLSSGVHTFPAVLQSDLYTLSSSVTVT SSTWPSQSITCNVAHPASSTKVDKKIEPRGPTIKPCPPCKCPAPNLLGGPSVFIFPPKIKDVLMISLSPI VTCVVVDVSEDDPDVQISWFVNNVEVHTAQTQTHREDYNSTLRVVSALPIQHQDWMSGKEFKCKVNNKDL PAPIERTISKPKGSVRAPQVYVLPPPEEEMTKKQVTLTCMVTDFMPEDIYVEWTNNGKTELNYKNTEPVL DSDGSYFMYSKLRVEKKNWVERNSYSCSVVHEGLHNHHTTKSFSRTPGLDLDDVCAEAQDGELDGLWTTI TIFISLFLLSVCYSASVTLFKVKWIFSSVVELKQTISPDYRNMIGQGA >gi|121048|sp|P01863|GCAA_MOUSE Ig gamma-2A chain C region, A allele AKTTAPSVYPLAPVCGDTTGSSVTLGCLVKGYFPEPVTLTWNSGSLSSGVHTFPAVLQSDLYTLSSSVTV TSSTWPSQSITCNVAHPASSTKVDKKIEPRGPTIKPCPPCKCPAPNLLGGPSVFIFPPKIKDVLMISLSP IVTCVVVDVSEDDPDVQISWFVNNVEVHTAQTQTHREDYNSTLRVVSALPIQHQDWMSGKEFKCKVNNKD LPAPIERTISKPKGSVRAPQVYVLPPPEEEMTKKQVTLTCMVTDFMPEDIYVEWTNNGKTELNYKNTEPV LDSDGSYFMYSKLRVEKKNWVERNSYSCSVVHEGLHNHHTTKSFSRTPGK >gi|113588|sp|P01878|IGHA_MOUSE Ig alpha chain C region ESARNPTIYPLTLPPALSSDPVIIGCLIHDYFPSGTMNVTWGKSGKDITTVNFPPALASGGRYTMSNQLT LPAVECPEGESVKCSVQHDSNPVQELDVNCSGPTPPPPITIPSCQPSLSLQRPALEDLLLGSDASITCTL NGLRNPEGAVFTWEPSTGKDAVQKKAVQNSCGCYSVSSVLPGCAERWNSGASFKCTVTHPESGTLTGTIA KVTVNTFPPQVHLLPPPSEELALNELLSLTCLVRAFNPKEVLVRWLHGNEELSPESYLVFEPLKEPGEGA TTYLVTSVLRVSAETWKQGDQYSCMVGHEALPMNFTQKTIDRLSGKPTNVSVSVIMSEGDGICY
Pairwise alignments: SeqA Name Len(aa)SeqB Name Len(aa) Score =================================================================================================== 1 gi|62901522|sp|P01865|GCAM_MOU 398 2 gi|121048|sp|P01863|GCAA_MOUSE 330 99 1 gi|62901522|sp|P01865|GCAM_MOU 398 3 gi|113588|sp|P01878|IGHA_MOUSE 344 24 2 gi|121048|sp|P01863|GCAA_MOUSE 330 3 gi|113588|sp|P01878|IGHA_MOUSE 344 25 =================================================================================================== Guide Tree ( gi|62901522|sp|P01865|GCAM_MOU:0.00966, gi|121048|sp|P01863|GCAA_MOUSE:-0.00360, gi|113588|sp|P01878|IGHA_MOUSE:0.74906);
Basic problem is assigning a score to the alignment. • Producing a meaningful score is a very inexact science at this point and may continue to be that. • A good scoring system should take into account • The fact that some positions are more conserved than others – thus, we infer that it needs to be position specific scoring. • The fact that some sequences are not independent, but related by a phylogenetic tree. Sometimes there is a small set of sequences that can be aligned unambiguously and used as a starting point for the complete alignment.
We will tend to ignore the phylogenetic tree and concentrate on the first criterion. Simplifying assumption: individual columns of alignment are statistically independent. Note: Gaps will be aligned with gaps. Only restriction is that every column must have at least one non-gap character. Typical scoring function: Let m be some alignment S(m) = G + sum( S( mi ) ) Where mi is column i in the alignment S( mi ) is the score of that column, and G is a function for scoring the comparison of gaps.
The score for a column, S(mi), is usually computed as a Sum of Pairs, i.e. each residue in the column is paired with the residues of previous column entries and evaluated using a standard scoring scheme (BLOSUMn or PAMnn = your favorite number). Gaps next to a residue are usually scored using the gap penalty and adjacent gaps are usually scored as 0. The formula for the score of a column:
Problem with sum of pairs scoring: • Suppose you are comparing N sequences and using BLOSUM50 as your comparison matrix. • Suppose all N rows in a column have L (leucine). The BLOWSUM50 score for comparing L to L is 5. for a total score for the column of 5*N*(N – 1)/2 (5 times the number of symbol pairs in the column) • Seq 1: ……….. L………………. • Seq 2: ……….. L………………. • Seq 3: ……….. L………………. • Seq 4: ……….. L………………. • Score: • (Seq 2 & Seq 1) 5 (Seq 4 & Seq 1) 5 • (Seq 3 & Seq 1) 5 (Seq 4 & Seq 2) 5 Column Score = 30 • (Seq 3 & Seq 2) 5 (Seq 4 & Seq 3) 5
Now suppose one of the L’s is replaced by a G (Glycine) The BLOSUM50 score for comparing L to G is -4 instead of 5 so (N – 1) pairs are reduced by 9 (in our example 3 pairs are reduced by 9) • Seq 1: ……….. L………………. • Seq 2: ……….. L………………. • Seq 3: ……….. G………………. • Seq 4: ……….. L………………. • Score: • (Seq 2 & Seq 1) 5 (Seq 4 & Seq 1) 5 • (Seq 3 & Seq 1) -4 (Seq 4 & Seq 2) 5 Column Score = 3 • (Seq 3 & Seq 2) -4 (Seq 4 & Seq 3) -4 • The fraction reduction in the SP score is:
However, if we were comparing a similar collection of 60 sequences, the score for a column of all L’s is: 1770 * 5 = 8850 If one of the L’s were a G, then the score would be reduced by a rescoring of 59 pairs or: 59 * 9 = 531 The fraction reduction in the SP score is only: In general if we have N sequences to be aligned and suppose they all contain an L in the same column except for one G then the fraction reduction in the SP score as a result of the G is: The problem is N in the denominator. As more sequences are added (and say they all have L’s) the effect of this G is reduced. It should be amplified to show a disparity.
None the less SP is commonly used. But the problem does not stop here. To find the best scoring alignment multidimensional Dynamic Programming similar to that we did earlier is used This becomes intractable very fast. Suppose only aligning 3 sequences.