810 likes | 851 Views
Phylogenetic analysis. taken from http://allserv.rug.ac.be/~avierstr and http://www.cs.otago.ac.nz/cosc348/Lectures/MSAPhylogeny.htm And Introduction to Bioinformatics course slides. Purpose of phylogenetics :. Reconstruct the evolutionary relationship between species
E N D
Phylogenetic analysis taken from http://allserv.rug.ac.be/~avierstr and http://www.cs.otago.ac.nz/cosc348/Lectures/MSAPhylogeny.htm And Introduction to Bioinformatics course slides
Purpose of phylogenetics : • Reconstruct the evolutionary relationship between species Experience learns that closely related organisms have similar sequences, more distantly related organisms have more dissimilar sequences. • Estimate the time of divergence between two organisms since they last shared a common ancestor. But… • The theory and practical applications of the different models are not universally accepted. • Important to have a good alignment to start with. (Garbage in, Garbage out) • Trees based on an alignment of a gene represent the relationship between genes and this is not necessarily the same relationship as between the whole organisms. If trees are calculated based on different genes from organisms, it is possible that these trees result in different relationships.
Why is phylogeny imporant • Determining tree of life (e.g., for a new organism) • Determining gene function • Understand which parts of the gene/regulatory sequences are important • Tracing the evolution of genes – horizontal gene transfer etc.
Protein or DNA? • As with Multiple Sequence Alignment – proteins are preferred • More informative • Shorter in length • Less chance of multiple mutations at the same site • When DNA? • A non-coding sequence • Proteins too similar
Terminology : • node : a node represents a taxonomic unit. This can be a taxon (an existing species) or an ancestor (unknown species : represents the ancestor of 2 or more species). • branch : defines the relationship between the taxa in terms of descent and ancestry. • topology : is the branching pattern. • branch length : often represents the number of changes that have occurred in that branch. • root : is the common ancestor of all taxa. • distance scale : scale which represents the number of differences between sequences (e.g. 0.1 means 10 % diff
Possible ways of drawing a tree : Unscaled branches : the length is not proportional to the number of changes.
Possible ways of drawing a tree : • Scaled branches : the length of the branch is proportional to the number of changes (usually in PAMs). The distance between 2 species is the sum of the length of all branches connecting them.
Possible ways of drawing a tree : • Rooted trees: the root is the common ancestor. The direction of each path from the root corresponds to evolutionary time. • Unrooted tree: specifies the relationships among species and does not define the evolutionary path.
Rooted vs. unrooted trees 3 1 2 3 1 2
Rooted vs. Unrooted. The position of the root does not affect the MP score.
1 0 Intuition why rooting doesn’t change the score Gene number 1 1 or 0 1 s1 s4 s3 s2 s5 1 1 1 0 0 The change will always be on the same branch, no matter where the root is positioned…
We want rooted trees! How can we root the tree?
Gorilla gorilla (Gorilla) Pan troglodytes (Chimpanzee) Homo sapiens (human) Gallus gallus (chicken)
Human Human Human Chicken Chimp Chimp Gorilla Chicken Gorilla Chimp Gorilla Chicken Evaluate all 3 possible UNROOTED trees: MP tree
Rooting based on a priori knowledge: Human Chicken Gorilla Chimp Chicken Gorilla Human Chimp
Ingroup / Outgroup: Chicken Gorilla Human Chimp OUTGROUP INGROUP
Distance-based methods • Compress all of the individual differences between pairs of sequences into a single number – the distance. • Starting from an alignment, pairwise distances are calculated between DNA sequences as the sum of all base pair differences between two sequences (the most similar sequences are assumed to be closely related. This creates a distance matrix. • From the obtained distance matrix, a phylogenetic tree is calculated with clustering algorithms. These cluster methods construct a tree by linking the least distant pair of taxa, followed by successively linking more distant taxa. • Algorithms: UPGMA clustering , Neighbor Joining. • Assumes molecular clock ClustalW!
Cladistic methods • Trees are calculated by considering the various possible pathways of evolution and are based on parsimony or likelihood methods. These methods use each alignment position as evolutionary information to build a tree. • Parsimony : Looks for themost parsimonious tree: the tree with the fewest evolutionary changes for all sequences to derive from a common ancestor. • Slower than distance methods. • Assumes molecular clock • Maximum Likelihood : Looks forthe tree with the maximum likelihood: the most probable tree. • this is the slowest method of all but seems to give the best result and the most information about the tree. • No molecular clock assumption Phylip Phylip
Even the best evolutionary models can't solve this problem... Two homologous DNA sequences which descended from an ancestral sequence and accumulated mutations since their divergence from each other. Note that although 12 mutations have accumulated, differences can be detected at only three nucleotide sites.
Molecular clocks Dickerson, 1971 • Assumption: constant rate of evolution • Different rate for different genes: Millions of years since divergence
Problems with molecular clocks Surprisingly, insulin from the guinea pig evolved seven times faster than insulin from other species. Why? The answer is that guinea pig insulin does not bind two zinc ions, while insulin molecules from most other species do. There was a relaxation on the structural constraints of these molecules, and so the genes diverged rapidly.
Building trees with ClustalW http://www.ebi.ac.uk/clustalw/ Choose a tree here Place alignment here
PHYLIP • A suite of phylogeny tools • Both web servers and stand-alone applications • Used for distance/parsimony/maximum likelihood • http://bioweb.pasteur.fr/seqanal/phylogeny/phylip-uk.html
Bootstrapping • Assigns confidence to individual tree branches • Columns of the alignment are randomly sampled (with replacement) and the tree is recomputed X many interactions • Boorstrap value of a branch = how many iterations had it.
Collections of homologous genes • Homologene @ Entrez • http://www.ncbi.nlm.nih.gov/sites/entrez?db=homologene • COG – Clusters of Orthologous Genes • Results of Blast All-vs-All between genomes. Genes within the same COG are “pairwise best hits” • http://www.ncbi.nlm.nih.gov/COG/ • RDP – Ribosomal sequences • The “standard” sequences for doing species phylogeny • Focused on Bacteria • http://rdp8.cme.msu.edu/html/
Orthologs Homologous sequences are orthologous if they were separated by a speciation event: If a gene exists in a species, and that species diverges into two species, then the copies of this gene in the resulting species are orthologous.
Orthologs • Orthologs will typically have the same or similar function in the course of evolution. • Identification of orthologs is critical for reliable prediction of gene function in newly sequenced genomes.
Orthologs ancestor a speciation a a descendant 1 (e.g., human) descendant 2 (e.g., dog)
Paralogs Homologous sequences are paralogous if they were separated by a gene duplication event: If a gene in an organism is duplicated, then the two copies are paralogous.
Paralogs • Orthologs will typically have the same or similar function. • This is not always true for paralogs due to lack of the original selective pressure upon one copy of the duplicated gene, this copy is free to mutate and acquire new functions.
Paralogs a Duplication a b
Evolutionary rate and conservation Functionally or structurally important sites are conserved: Conserved sites “slow” evolving sites Variable sites “fast evolving” sites Sites which are under a functional/structural constraint are conserved, and evolve slowly
Conservation in an MSA S1 KITAYCELARTDMKLGLDFYKGVSLANWVCLAKWESGYN S2MPFERCELARTLKRMADADIRGVSLANWVCLAKWFWDGG S3MPFERCELARTLKRMMDADIRGVSLANWVCLAKWFWDGG From the MSA (and the tree), one can determine how conserved is a gene.
“Inverse relation between evolutionary rate and age of mammalian genes”: Protocol
Step 1 - BLAST Build the dataset of mammalian genes
Step 1 – BLAST: build the dataset of mammalian genes, based on mouse-human ortholog pairs • The orthologs are defined as pairs of reciprocal BLAST hits. • Eliminate genes with more than one potential orthologous sequence. • Select only genes which the human protein was functionally annotated.
Step 2 – Calculate Evolutionary Rates (Conservation) For each orthologous pair: • Alignment at the amino acid level. • Measure evolutionary rate The dataset contained 6,776 human-mouse gene pairs.
Step 3 – Assignment of Temporal Categories How old is each gene? Used BLAST to find homologs in 6 different eukaryotic genomes
Schizosaccharomyces pombe Takifugu rubripes Caenorhabditis elegans Drosophila melanogaster Arabidopsis thaliana Saccharomyces cerevisiae
What is Old ? OLD • Presence of any homolog in all the 6 genomes. METAZOANS What is Presence ? DEUTEROSTOMES TETRAPODS • Using an e-value cutoff of 10-4 in BLAST.