1.16k likes | 1.34k Views
BuildingTrees. What is a Tree?. A tree is a visualization of the mathematical analysis of a comparison of characteristics in multiple individuals or species. The multiples can also be tissues or developmental stages in the case of microarrays.
E N D
What is a Tree? • A tree is a visualization of the mathematical analysis of a comparison of characteristics in multiple individuals or species. The multiples can also be tissues or developmental stages in the case of microarrays. • The closer branches share more similarities and the more distant branches are less similar.
Phylogeny (phylo =tribe + genesis) 1.Phylogeny inference or “tree building” — the inference of the branching orders, and ultimately the evolutionary relationships, between “taxa” (entities such as genes, populations, species, etc.) 2.Character and rate analysis — using phylogenies as analytical frameworks for rigorous understanding of the evolution of various traits or conditions of interest
Start with a group of species and establish relationships based on measurements birds snakes rodents primates crocodiles marsupials lizards
crocodiles birds lizards snakes rodents primates marsupials This is an example of a phylogenetic tree.
Homology & Similarity • Homology • Conserved sequences arising from a common ancestor • Orthologs: homologous genes that share a common ancestor in the absence of any gene duplication (Mouse and Human Hemoglobin) • Paralogs: genes related through gene duplication (one gene is a copy of another - Fetal and Adult Hemoglobin) • Similarity • Genes that share common sequences but are not necessarily related
Sequences As Modules • Proteins are derived from a limited number of basic building blocks (Modules) • Evolution has shuffled these modules giving rise to a diverse repertoire of protein sequences • Proteins can share a global or local relationships specific to a single DOMAIN Global Local
Sequence Domains Modules Define Functional/Structural Domains
Defining A Sequence Family Family B Family E Family D Family A Family C
Global vs. Local Alignments • Global • Search for alignments, matching over entire sequences • Local • Examine regions of sequence for conserved segments • Both Consider: Matches, Mismatches, Gaps
Global Sequence Alignments Yeast Prion-Like Proteins
How To Make A Global MSA • On The Web • http://pir.georgetown.edu/pirwww/search/multaln.html • On Your Computer • ClustalX: http://www-igbmc.u-strasbg.fr/BioInfo/ClustalX/
MSA Example Sequences Standard FASTA Sequence Format >KSYK_HUMAN FFFGNITREEAEDYLVQGGMSDGLYLLRQSRNYLGGFALSVAHGRKAHHYTIERELNGTYAIAGGRTHASPADLCHYH >ZA70_HUMAN WYHSSLTREEAERKLYSGAQTDGKFLLRPRKEQGTYALSLIYGKTVYHYLISQDKAGKYCIPEGTKFDTLWQLVEYL >KSYK_PIG WFHGKISRDESEQIVLIGSKTNGKFLIRARDNGSYALGLLHEGKVLHYRIDKDKTGKLSIPGGKNFDTLWQLVEHY >MATK_HUMAN WFHGKISGQEAVQQLQPPEDGLFLVRESARHPGDYVLCVSFGRDVIHYRVLHRDGHLTIDEAVFFCNLMDMVEHY >CSK_CHICK WFHGKITREQAERLLYPPETGLFLVRESTNYPGDYTLCVSCEGKVEHYRIIYSSSKLSIDEEVYFENLMQLVEHY >CRKL_HUMAN WYMGPVSRQEAQTRLQGQRHGMFLVRDSSTCPGDYVLSVSENSRVSHYIINSLPNRRFKIGDQEFDHLPALLEFY >YES_XIPHE WYFGKLSRKDTERLLLLPGNERGTFLIRESETTKGAYSLSLRDWDETKGDNCKHYKIRKLDNGGYYITTRTQFMSLQMLVKHY >FGR_HUMAN WYFGKIGRKDAERQLLSPGNPQGAFLIRESETTKGAYSLSIRDWDQTRGDHVKHYKIRKLDMGGYYITTRVQFNSVQELVQHY >SRC_RSVP WYFGKITRRESERLLLNPENPRGTFLVRKSETAKGAYCLSVSDFDNAKGPNVKHYKIYKLYSGGFYITSRTQFGSLQQLVAYY
MSA Example Result YES_XIPHE WYFGKLSRKDTERLLLLPGNERGTFLIRESETTKGAYSLSLRDWDETKGDNCKHYKIRKL FGR_HUMAN WYFGKIGRKDAERQLLSPGNPQGAFLIRESETTKGAYSLSIRDWDQTRGDHVKHYKIRKL SRC_RSVP WYFGKITRRESERLLLNPENPRGTFLVRKSETAKGAYCLSVSDFDNAKGPNVKHYKIYKL MATK_HUMAN WFHGKISGQEAVQQLQPPED--GLFLVRESARHPGDYVLCVS-----FGRDVIHYRVLHR CSK_CHICK WFHGKITREQAERLLYPPET--GLFLVRESTNYPGDYTLCVS-----CEGKVEHYRIIYS CRKL_HUMAN WYMGPVSRQEAQTRLQGQRH--GMFLVRDSSTCPGDYVLSVS-----ENSRVSHYIINSL ZA70_HUMAN WYHSSLTREEAERKLYSGAQTDGKFLLRPRK-EQGTYALSLI-----YGKTVYHYLISQD KSYK_PIG WFHGKISRDESEQIVLIGSKTNGKFLIRAR--DNGSYALGLL-----HEGKVLHYRIDKD KSYK_HUMAN FFFGNITREEAEDYLVQGGMSDGLYLLRQSRNYLGGFALSVA-----HGRKAHHYTIERE :: . : :: : * :*:* * : * : ** : YES_XIPHE DNGGYYITTRTQFMSLQMLVKHY FGR_HUMAN DMGGYYITTRVQFNSVQELVQHY SRC_RSVP YSGGFYITSRTQFGSLQQLVAYY MATK_HUMAN -DGHLTIDEAVFFCNLMDMVEHY CSK_CHICK -SSKLSIDEEVYFENLMQLVEHY CRKL_HUMAN PNRRFKIGDQE-FDHLPALLEFY ZA70_HUMAN KAGKYCIPEGTKFDTLWQLVEYL KSYK_PIG KTGKLSIPGGKNFDTLWQLVEHY KSYK_HUMAN LNGTYAIAGGRTHASPADLCHYH * . : .
Steps to Build Trees from MSA 1) identify taxa to be considered 2) choose characters (independent, “unit”) 3) construct character matrix for each taxon: 4) After performing alignment, use mathematical formula to describe degree of similarity for each taxon: e.g. simple matching coefficient # matches total # of characters S =
Steps to Build Trees 5) construct matrix with pairwise S values 6) use clustering technique to produce a tree (dendrogram) • Unweighted/Equal weighting = all characters given equal consideration • UPGMA (Unweighted Pair Group Method with Arithmetic Averaging) • Neighbour-joining • Unweighting is a form of weighting
Building Matrices Character Matrix S-value Matrix
Joining Clusters into a Tree Closest: A&D = 0.7 2nd Closest B&C = 0.5 When does A&D join B&C ? (A&B) + (A&C) + (D&B) + (D&C) 4 = (0.3 + 0.4 + 0.4 + 0.3)/4 = 0.35
Problems • Different methods or characters = different dendrograms • If we use all possible characteristics this would be a natural classification • The tree is an accurate phylogeny if differences in characters between taxa proportional to time elapsed since common ancestor
Convergent Evolution • Similar phenotypic response to similar ecological conditions • Different developmental pathways
Reversal of Evolution • An altered character reverts to the ancestral form. • In a DNA molecule, a nucleotide position may change from a C to a T and then back to a C. This frog reverted to teeth.
Trees are hypotheses about evolutionary history • Different methods may result in different trees. • How to chose between the different models? • One way is to compare different types of character data and see if the trees make sense.
Haplotype Network in 3 Elephant Species with 3 DNA sequences
Parsimonious choices reflect fewer changes • The assumptions of parsimony • Reversals and convergence require more changes • Parsimonious trees represent best estimates of phylogenetic relationships
Use of DNA, RNA, or Protein • For phylogeny, DNA can be more informative. • The protein-coding portion of DNA has synonymous and nonsynonymous substitutions. • Some DNA changes do not have corresponding protein changes • See arrows 14, 21, 25, 27, 29 in the retinol-binding protein figure.
For phylogeny, DNA can be more informative. • If the synonymous substitution rate (dS) is greater than the nonsynonymous substitution rate (dN), the DNA sequence is under negative (purifying) selection. • This limits change in the sequence. • If dS < dN, positive selection occurs. • For example, a duplicated gene may evolve rapidly to assume new functions.
Models of nucleotide substitution- Transitions > Transversions transition A G transversion transversion C T transition
Some substitutions in a DNA sequence alignment can be directly observed: • single nucleotide substitutions • sequential substitutions • coincidental substitutions
Additional mutational events can be inferred by analysis of ancestral sequences. These changes include • parallel substitutions • convergent substitutions • back substitutions
Advantages of DNA • Noncoding regions (such as 5’ and 3’ untranslated regions) may be analyzed using molecular phylogeny. • See Figure 11.10 (arrows 4-10 and 35-38) • Pseudogenes (nonfunctional genes) are studied by molecular phylogeny • Rates of transitions and transversions can be measured. • Transitions: purine (A to G) or pyrimidine (C to T) substitutions • Transversion: purine to pyrimidine
Protein sequences are also used for phylogeny • Proteins have 20 states (amino acids) instead of only four for DNA, so there is more phylogenetic information. • Nucleotides are unordered characters: any one nucleotide can change to any other in one step. • An ordered character must pass through one or more intermediate states before reaching the final state. • Amino acid sequences are partially ordered character states: there is a variable number of states between the starting value and the final value.
Amino acid sequences • From the standpoint of the genetic code, some amino acid changes can be made by a single DNA mutation while others require two or even three changes in the DNA sequence • Some amino acids can replace one another with relatively little effect on thestructure and function of the final protein while other replacements can befunctionally devastating • Tables of frequencies of all amino acid replacements within families of related protein sequences in the databanks are used: PAM and BLOSSUM
Sequence-Based Comparisons • Identify sequences within an organism that are related to each other and/or across different species • Within: Fetal and adult hemoglobin • Across : Human and chimpanzee hemoglobin • Generate an evolutionary history of related genes • Locate insertions, deletions, and substitutions that have occurred during evolution (C) Cysteine (R) Arginine (E) Glutamate (A) Alanine (T) Threonine (S) Serine (L) Leucine (P) Proline (G) Glycine CREATE CREASE -RELAPSE [Ancestor] [Progenitors] GREASER
Multiple Sequence Alignments • Place residues in columns that are derived from a common ancestral residue • Identify Matches, Mismatches, and Gaps • MSA can reveal sequence patterns • Demonstration of homology between >2 sequences • Identification of functionally important sites • Protein function prediction • Structure prediction CREASE CREATE RELAPSE GREASER SeqA CRE-A-TE- SeqB CRE-A-SE- SeqC GRE-A-SER SeqD -RELAPSE- 123456789
MSA and Tree Relationship • “The optimal alignment of several sequences can be thought of as minimizing the number of mutational steps in an evolutionary tree for which the sequences are the leaves” (Mount, 2001) CREATE CREASE CREATE CRE-A-TE- SeqA CREATE CREASE CRE-A-SE- SeqB +R GRE-A-SER SeqC T to S GREASE C to G +L +P -RELAPSE- SeqD -G
Multiple Sequence Alignments • Confirm that all sequences are homologous • Adjust gap creation and extension penalties as needed to optimize the alignment • Restrict phylogenetic analysis to regions of the multiple sequence alignment for which data are available for all taxa (delete columns having incomplete data). • Many experts recommend that you delete any column of an alignment that contains gaps (even if the gap occurs in only one taxon)
Problems in Reconstructing Phylogeny • Characters sometimes conflict • It is sometimes difficult to tell homology from homoplasy • Analogy- characters similar because of convergent evolution • Reversal- character reverts to ancestral form • With morphological characters, careful examination may distinguish homoplasy (orthologs) from homology • With molecular characters (DNA/Protein sequences), orthologs sometimes impossible to distinguish from homologs and paralogs.
A Phylogenetic Tree • Taxon -- Any named group of organisms – evolutionary theory not necessarily involved. • Clade -- A monophyletic taxon (evolutionary theory utilized)
A phylogenetic tree with branch lengths • Branch length can be significant… • In this case it is and mouse is slightly more similar to fly than human is to fly (sum of branches 1+2+3 is less than sum of 1+2+4)
Common Phylogenetic Tree Terminology Terminal Nodes Branches or Lineages A Represent the TAXA (genes, populations, species, etc.) used to infer the phylogeny B C D Ancestral Node or ROOT of the Tree E Internal Nodes or Divergence Points (represent hypothetical ancestors of the taxa)
Taxon B Taxon C No meaning to the spacing between the taxa, or to the order in which they appear from top to bottom. Taxon A Taxon D Taxon E This dimension either can have no scale (for ‘cladograms’), can be proportional to genetic distance or amount of change (for ‘phylograms’ or ‘additive trees’), or can be proportional to time (for ‘ultrametric trees’ or true evolutionary trees). Phylogenetic trees diagram the evolutionary relationships between the taxa ((A,(B,C)),(D,E)) = The above phylogeny as nested parentheses These say that B and C are more closely related to each other than either is to A, and that A, B, and C form a clade that is a sister group to the clade composed of D and E. If the tree has a time scale, then D and E are the most closely related.
time Three types of trees Cladogram Phylogram Ultrametric tree 6 Taxon B Taxon B Taxon B 1 1 Taxon C Taxon C Taxon C 3 1 Taxon A Taxon A Taxon A Taxon D Taxon D 5 Taxon D no meaning genetic change All show the same evolutionary relationships, or branching orders, between the taxa.
cladogram t1 • relative recent common descent. • Does not imply that ancestors on the same line necessarily speciated at the same time. • t1 can bebefore or after t2 but not before t3 t3 t2 Types of trees: Cladogram (no time scale)
branch lengths = amount of change Types of trees: Phylogram phylogram (additive tree: branch lengths can be summed) relative recenct common descent, and
divergence Types of trees: Ultrametric Ultrametric tree (linearized tree) All tree tips are equidistant from the root Amount of change can be scaled to time scale = time
A A A B C E C E C D B B E D D Polytomy or multifurcation A bifurcation The goal of phylogeny inference is to resolve the branching orders of lineages in evolutionary trees Completely unresolved or "star" phylogeny Partially resolved phylogeny Fully resolved, bifurcating phylogeny