830 likes | 996 Views
Burkhard Morgenstern Institut f ür Mikrobiologie und Genetik Molekulare Evolution und Rekonstruktion von phylogenetischen B äumen WS 2006/2007. Goal: Phylogeny reconstruction based on molecular sequence data (DNA, RNA, protein sequences). Multiple sequence alignment.
E N D
Burkhard Morgenstern Institut für Mikrobiologie und Genetik Molekulare Evolution und Rekonstruktion von phylogenetischen Bäumen WS 2006/2007
Goal: Phylogeny reconstruction based on molecular sequence data (DNA, RNA, protein sequences)
Multiple sequence alignment • Molecular phylogeny reconstruction relies on comparative nucleic acid and protein sequence analysis • Alignment most important tool for sequence comparison • Multiple alignment contains more information than pair-wise alignment
Tools for multiple sequence alignment Y I M Q E V Q Q E R • Sequence duplicates in history (e.g. speciation event)
Tools for multiple sequence alignment Y I M Q E V Q Q E R
Tools for multiple sequence alignment Y I M Q E V Q Q E R Y I M Q E V Q Q E R
Tools for multiple sequence alignment Y I M Q E A Q Q E R Y L M Q E V Q Q E R • Substitutions occur
Tools for multiple sequence alignment Y I M Q E A Q Q E R Y L M Q E V Q Q E R
Tools for multiple sequence alignment YAI M Q E A Q Q E R Y L M - - V Q Q E R V • Insertions/deletions (indels) occur
Tools for multiple sequence alignment YAI M Q E A Q Q E R Y L M - - V Q Q E R V
Tools for multiple sequence alignment Y A I M Q E A Q Q E R Y L M V Q Q E R V • because of insertions/deletions: sequence similarity no longer immediately visible!
Tools for multiple sequence alignment Y A I M Q E A Q Q E R - Y - L M V - - Q Q E R V • Alignment brings together related parts of the sequences by inserting gaps into sequences
Tools for multiple sequence alignment Y A I M Q E A Q Q E R - Y - L M V - - Q Q E R V
Tools for multiple sequence alignment Y AI M QE A Q Q E R - Y -L M V- - Q Q E R V • Mismatches correspond to substitutions • Gaps correspond to indels
Tools for multiple sequence alignment • Pairwise alignment: alignment of two sequences • Multiple alignment: alignment of N > 2 sequences
Tools for multiple sequence alignment s1 R Y I M R E A Q Y E S A Q s2 R C I V M R E A Y E s3 Y I M Q E V Q Q E R s4 W R Y I A M R E Q Y E • Assumtion: sequence family related by common ancestry; similarity due to common history • Sequence similarity not obvious (insertions and deletions may have happened)
Tools for multiple sequence alignment s1 - R Y I - M R E A Q Y ES A Q s2 - R C I V M R E A - Y E- - - s3 - - Y I - M Q E V Q Q ER - - s4 W R Y I A M R E - Q Y E- - - • Multiple alignment = arrangement of sequences by introducing gaps • Alignment reveals sequence similarities
Tools for multiple sequence alignment s1 - R Y I - M R E A Q Y ES A Q s2 - R C I V M R E A - Y E- - - s3 - - Y I - M Q E V Q Q ER - - s4 W R Y I A M R E - Q Y E- - -
Tools for multiple sequence alignment s1 - R Y I - M R E A Q Y ES A Q s2 - R C I V M R E A - Y E- - - s3 - - Y I - M Q E V Q Q ER - - s4 W R Y I A M R E - Q Y E- - -
Tools for multiple sequence alignment s1 - R Y I - M R E A Q Y E S A Q s2 - R C I V M R E A - Y E - - - s3 - - Y I - M Q E V Q Q E R - - s4 W R Y I A M R E - Q Y E - - - General information in multiple alignment: • Functionally important regions more conserved than non-functional regions • Local sequence conservation indicates functionality!
Tools for multiple sequence alignment s1 - R Y I - M R E A Q Y ES A Q s2 - R C I V M R E A - Y E - - - s3 - - Y I - M Q E V Q Q ER - - s4 W R Y I A M R E - Q Y E - - - Phylogeny reconstruction based on multiple alignment: • Estimate pairwise distances between sequences (distance-based methods for tree reconstruction) • Estimate evloutionary events in evolution (parsimony and maximum likelihood methods)
Tools for multiple sequence alignment s1 - R Y I - M R E A Q Y ES A Q s2 - R C I V M R E A - Y E - - - s3 - - Y I - M Q E V Q Q ER - - s4 W R Y I A M R E - Q Y E - - - Task in bioinformatics: Find best multiple alignment for given sequence set
Tools for multiple sequence alignment s1 - R Y I - M R E A Q Y ES A Q s2 - R C I V M R E A - Y E - - - s3 - - Y I - M Q E V Q Q ER - - s4 W R Y I A M R E - Q Y E - - - Astronomical number of possible alignments!
Tools for multiple sequence alignment s1 - R Y I - M R E A Q Y ES A Q s2 - R C I V M R E A - - - Y E - s3 Y I - - - M Q E V Q Q ER - - s4 W R Y I A M R E - Q Y E - - - Astronomical number of possible alignments!
Tools for multiple sequence alignment s1 - R Y I - M R E A Q Y ES A Q s2 - R C I V M R E A - - - Y E - s3 Y I - - - M Q E V Q Q ER - - s4 W R Y I A M R E - Q Y E - - - Computer has to decide: which one is best??
Tools for multiple sequence alignment Questions in development of alignment programs: (1) What is a good alignment? → objective function (`score’) (2) How to find a good alignment? → optimization algorithm First question far more important !
Tools for multiple sequence alignment Before defining an objective function (scoring scheme) • What is a biologically good alignment ??
Tools for multiple sequence alignment Criteria for alignment quality: • 3D-Structure: align residues at corresponding positions in 3D structure of protein!
Tools for multiple sequence alignment Criteria for alignment quality:
Tools for multiple sequence alignment Criteria for alignment quality: • 3D-Structure: align residues at corresponding positions in 3D structure of protein!
Tools for multiple sequence alignment Species related by common history
Tools for multiple sequence alignment Genes / proteins related by common history
Tools for multiple sequence alignment Criteria for alignment quality: • 3D-Structure: align residues at corresponding positions in 3D structure of protein! • Evolution: align residues with common ancestors!
Tools for multiple sequence alignment s1 - R Y I - M R E A Q Y ES A Q s2 - R C I V M R E A - Y E - - - s3 - - Y I - M Q E V Q Q ER - - s4 W R Y I A M R E - Q Y E - - - Alignment hypothesis about sequence evolution • Mismatches correspond to substitutions • Gaps correspond to insertions/deletions
Tools for multiple sequence alignment s1 - R Y I - M R E A Q Y ES A Q s2 - R C I V M R E A - Y E - - - s3 - - Y I - M Q E V Q Q ER - - s4 W R Y I A M R E - Q Y E - - - Alignment hypothesis about sequence evolution • Search for most plausible scenario! • Estimate probabilities for individual evolutionary events: insertions/deletions, substitutions
Tools for multiple sequence alignment s1 - R Y I - M R E A Q Y ES A Q s2 - R C I V M R E A - Y E - - - s3 - Y - I - M Q E V Q Q ER - - s4 W R Y I A M R E - Q Y E - - - Alignment hypothesis about sequence evolution • Search for most plausible scenario! • Estimate probabilities for individual evolutionary events: insertions/deletions, substitutions
Tools for multiple sequence alignment Compute score s(a,b) for degree of similarity between amino acids a and b based on probability pa,b of substitution a → b (or b → a) (Extremely simplified!)
Tools for multiple sequence alignment Reason for different substitutin probabilities pa,b : • Different physical and chemical properties of amino acids • Amino acids with similar properties more likely to be substituted against each other
Tools for multiple sequence alignment Use penalty for gaps introduced into alignment • Simplest approach: linear gap costs: penalty proportional to gap length • Non-linear gap penalties more realistic: long gap caused by single insertion/deletion • Most frequently used: affine linear gap penalties: more realistic, but efficient to calculate!
Traditional Objective functions: Define Score of alignments as • Sum of individual similarity scores s(a,b) • Minus gap penalties Needleman-Wunschscoring system for pairwise alignment (1970)
Pair-wise sequence alignment T Y W I V T - - L V Example: Score = s(T,T) + s(I,L) + s (V,V) – 2 g Assumption: linear gap penalty!
Pair-wise sequence alignment T Y W I V T - - L V Dynamic-programming algorithm finds alignment with best score. (Needleman and Wunsch, 1970)
Pair-wise sequence alignment T Y W I V T - - L V • Running time proportional to product of sequence length • Time-complexity O(l1 * l2)
Pair-wise sequence alignment • Algorithm for pairwise alignment can be generalized to multiple alignment of N sequences • Time-complexity O(l1 * l2 * … * lN) • Not feasable in reality (too long running time!) • Heuristic necessary, i.e. fast algorithm that does not necessarily produce mathematically best alignment
`Progressive´ Alignment Most popular approach to (global) multiple sequence alignment: • Progressive Alignment Since mid-Eighties: Feng/Doolittle, Higgins/Sharp, Taylor, …
`Progressive´ Alignment WCEAQTKNGQGWVPSNYITPVN WWRLNDKEGYVPRNLLGLYP AVVIQDNSDIKVVPKAKIIRD YAVESEAHPGSFQPVAALERIN WLNYNETTGERGDFPGTYVEYIGRKKISP
`Progressive´ Alignment WCEAQTKNGQGWVPSNYITPVN WWRLNDKEGYVPRNLLGLYP AVVIQDNSDIKVVPKAKIIRD YAVESEAHPGSFQPVAALERIN WLNYNETTGERGDFPGTYVEYIGRKKISP Guide tree
`Progressive´ Alignment WCEAQTKNGQGWVPSNYITPVN WW--RLNDKEGYVPRNLLGLYP- AVVIQDNSDIKVVP--KAKIIRD YAVESEASFQPVAALERIN WLNYNEERGDFPGTYVEYIGRKKISP Profile alignment, “once a gap - always a gap”