890 likes | 1.24k Views
Phylogenetic Analysis. YTSLLLSRQ-. YASLLW-RQA. PASIILSRQA. GRSIVLTRQM. Phylogenetics. What do I need to do?. Get related sequences of interest. Perform multiple sequence alignments. Edit alignment. Estimate phylogenetic relationships. Interpret results correctly. Phylogenetics.
E N D
Phylogenetic Analysis YTSLLLSRQ- YASLLW-RQA PASIILSRQA GRSIVLTRQM
Phylogenetics What do I need to do? Get related sequences of interest Perform multiple sequence alignments Edit alignment Estimate phylogenetic relationships Interpret results correctly
Phylogenetics Get related sequences of interest Perform multiple sequence alignments Edit alignment Estimate phylogenetic relationships Interpret results correctly
So you have a sequence…now what? MKILLLCIIFLYYVNAFKNTQKDGVSLQILKKKRSNQVNFLNRKNDYNLIKNKNPSSSLKSTFDDIKKIISKQLSVEEDKIQMNSNFTKDLGADSLDLVELIMALEEKFNVTISDQDALKINTVQDAIDYIEKNNKQ
#1: What is it? Does source organism have it’s own genome database? Unknown/No Yes BLAST@ genome database(GeneDB, PlasmoDB, etc.) BLAST@ Pubmed
Why start with genome-specific database? Genome location/structure Strain variability BLAST Expression data Pathway data
Blastp PubMed BLAST
Phylogenetics Get related sequences of interest Perform multiple sequence alignments Edit alignment Estimate phylogenetic relationships Interpret results correctly
GYTSLLLSRQNED--G G--SLLLSHK-D-HTG Global GYTSLLLSRQNEDG-- --GSLLLSHK-D-HTG Overlap TSLLLSR TSLLLSH Local Pair-wise sequence alignment Smith-Waterman
- Y T S L L L S R Q - Y A S L L W R Q A YTSLLLSRQ YASLLWRQA YTSLLLSRQ- YASLLW-RQA Aligning 2 sequences globally -4 -8 -12 -16 -20 -24 -28 -32 -36 -8 -12 -16 -20 -24 -28 -32 -36 -4 4 -4 2 -12 -16 -20 -24 -28 -32 -36 -8 -12 -4 -8 10 -16 -20 -24 -28 -32 -36 -4 -8 -12 14 -20 -24 -28 -32 -36 -16 -20 -4 -8 -12 -16 18 14 10 -32 -36 -19 -8 -12 -16 -20 14 10 6 -36 -24 -28 -4 -20 -12 -16 -20 -24 -28 15 11 -25 -29 -24 -16 -20 -24 -28 -32 20 -32 16 -36 -26 -25 -34 -25 -35 -28 -28 -32
YTSLLLSRQ- YASLLW-RQA YTSLLLSRQ- YASLLW-RQA PASIILSRQA YTSLLLSRQ- YASLLW-RQA PASIILSRQA GRSIVLTRQM Multiple sequence alignment Progressive Align 2 closest sequences Add in next closest sequence Continue adding…. Hyper dependent on initial matches.
YTTSLLLSRQ-- YATSLLWRQA-- PASIILSRQA-- GRTSIVLTRQMA YTTSLLLSRQ-- YATSLLW-RQ-A PA-SIILSRQ-A GRTSIVLTRQMA Multiple sequence alignment Iterative Initial MSA Score (low) Optimize MSA score Probabilistic methods don’t always generate the same answer
Multiple sequence alignment programs Pair-wise alignment type Global Local ClustalX T-Coffee progressive POA MSA Alignment type HMMs GAs Dialign iterative
Multiple Sequence Alignments POAVIZ – progressive local CLUSTAL – progressive global
Multiple Sequence Alignments POAVIZ – progressive local CLUSTAL – progressive global
Multiple Sequence Alignments POAVIZ – progressive local CLUSTAL – progressive global
CLUSTALX Parameters
CLUSTALX – Protein Weight Matrices • 1) BLOSUM (Henikoff). These matrices appear to be the best available for carrying out data base similarity (homology searches). • 2) PAM (Dayhoff). These have been extremely widely used since the late '70s. • 3) GONNET. These matrices were derived using almost the same procedure as the Dayhoff one (above) but are much more up to date and are based on a far larger dataset.
BLOSUM99 ----------------------------------------------------->BLOSUM62 >99% identity >62% identity BLOSUM (BLOck SUbstitution Matrix) BLOSUM62 – Gather proteins with at least 62% identity to obtain actual substitution rates for these proteins Pros Best bet for distantly divergent sequences
PAM1 ------------------------------------------------------------->PAM250 99% identity 20% identity PAM (point accepted mutation) Gather the substitution rates for PAM1 (99% identical sequences) Assuming that those substitution rates are consistent over time…: (# Point mutations / 100 amino acids) Pros Very good for closely related sequences Cons Rare mutations under-represented Substitution rates not constant over time (both are problems for phylogenetic estimation)
CLUSTAL vs POAVIZ (global vs local) POAVIZ CLUSTAL
Phylogenetics Get related sequences of interest Perform multiple sequence alignments Edit alignment Estimate phylogenetic relationships Interpret results correctly
BioEdit – Alignment manipulation Open the “.aln” file
BioEdit – Alignment manipulation “Back colored view” gives more contrast Select “Edit” from the mode dropdown
BioEdit – Alignment manipulation Select “Insert” so that you don’t accidentally lose part of your sequence Then select the unaligned beginning (or end) sequence and delete it….
BioEdit – Alignment manipulation Now save as a different file .fasta
Phylogenetics Get related sequences of interest Perform multiple sequence alignments Edit alignment Estimate phylogenetic relationships Interpret results correctly
Tree terminology root outgroup common ancestor (node, branch point) lineage (branch, edge) branch length B C D E F G A Operational taxonomic units (OTUs, leaves)
Topology 1 B C D E F G A Topology 2 B C E F G D A Topology 3 E F G C D B A monophyletic paraphyletic polyphyletic
A A B B Sequence homology – orthologues and paralogues Ancestral gene duplication A B Last common ancestor speciation Human A Rat A Human B Rat B orthologues orthologues paralogues orthologues paralogues
Methods of estimating phylogenetic relationships Character-based Maximum Parsimony (MP)Distance-based Neighbor-Joining (NJ) Minimum Evolution (ME)Probabilistic Maximum likelihood (ML) Bayesian inference
Taxa1 AAG Taxa2 AAA Taxa3 GGA Taxa4 AGA 1 AAA AAA AAA AAA AGA AAA AAA AAA AAA 1 1 2 1 2 1 1 1 AAG AAA GGA AGA AAG AGA AAA GGA AAG GGA AAA AGA 3 changes required (best tree) 4 changes required 4 changes required Methods of estimating phylogenetic relationships Maximum Parsimony (MP)
Methods of estimating phylogenetic relationships Distance-based Neighbor-Joining (NJ) MethodThe NJ method involves clustering of neighbor species that are joined by one node. It does not evaluate all the possible tree topologies. Not guaranteed to obtain the optimal tree Minimum Evolution (ME) MethodEstimates the total branch length of each topology exhaustively, then chooses the topology with the least total branch length. Time intensive for large numbers of taxa.
Methods of estimating phylogenetic relationships Probabilistic methods Maximum likelihood (ML) Prob ( data | model + tree ) More likely topology found Search all possible topologies to optimize probability
Bayesian inference Prior information Model for selection need both for everyone in the class