1 / 17

Phylogenetic Shadowing

Phylogenetic Shadowing. Daniel L. Ong. Abstract. The human genome contains about 3 billion base pairs! Algorithms to analyze these sequences must be linear to be tractable Finding genes is important to Molecular Biologists, first step to understanding. Outline. Introduction Alignments

reed
Download Presentation

Phylogenetic Shadowing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Phylogenetic Shadowing Daniel L. Ong

  2. Abstract • The human genome contains about 3 billion base pairs! • Algorithms to analyze these sequences must be linear to be tractable • Finding genes is important to Molecular Biologists, first step to understanding RUGS, UC Berkeley

  3. Outline • Introduction • Alignments • Phylogenetic trees • Sequence models • Example: mRNA and scRNA models • Conclusions RUGS, UC Berkeley

  4. Introduction to Biosequences • 4 nucleotides: A matches T; G matches C • In RNA, U replaces T • The NIH GenBank has 188 GB of sequence data; UC Santa Cruz has another 128 GB • The central dogma: http://web.mit.edu/esgbio/www/dogma/dogma.html RUGS, UC Berkeley

  5. Alignments • Alignment: given two sequences, insert gaps or allow mismatches in input sequences to minimize a cost function • Similar to edit distance • Generalizes to n sequences • Exploited to predict genes • Greater similarity in protein-coding genes • Mutated as a pair in structural RNA genes RUGS, UC Berkeley http://hanuman.math.berkeley.edu/kbrowser (Chakrabarti & Pachter, 2004)

  6. Multiple alignment • Considering multiple sequences allows us to leverage the comparative genomics paradigm • Functionally important regions of the genome are more likely to be conserved across species • The converse is also true • Genomes should be closely related • About 5-7 species of a family (Boffelli, et. al. 2003) • Additional genomes increase sensitivity (true positives) and decrease specificity (true negatives) RUGS, UC Berkeley

  7. [Durbin, et. al., 1998] Phylogenetic Trees • Use directed binary tree to track the relationships between organisms • Each node represents the nucleotide at a particular position in an aligned sequence • Current organisms are leaves of tree (observed) • Internal nodes are the common ancestor (unobserved) • Edges are speciation events and represent “evolutionary distance” as an extra parameter • Assume each nucleotide evolves independently (site independent evolution) RUGS, UC Berkeley

  8. Phylogenetic Tree • Site independent model computes probability of independent columns • Used for protein-coding genes • Pairwise site dependent model computes probability of base-paired columns • Used for scRNA genes Marty Yanofsky http://www-biology.ucsd.edu/labs/yanofsky/images/mads/phylogenetic%20tree.jpg RUGS, UC Berkeley

  9. How to find a Phylogenetic Tree? • Given n sequences, we want to find the correct tree topology • Search works for small n • Maximum likelihood: choose the tree that maximizes the probability of the alignment RUGS, UC Berkeley

  10. Biosequence analysis • Phylogenetic trees encapsulate evolutionary time across sequences • Sequence model predicts changes along the length of a particular sequence • Sequence models are typically HMMs RUGS, UC Berkeley

  11. Example: mRNA genes • Suppose we want to identify coding genes with an HMM • Exon: DNA segment that gets transcribed to mRNA • Have states in HMM corresponding to exon regions (Alexandersson, et. al., 2003) • Other types of RNA that get transcribed from DNA but not translated into protein are noncoding RUGS, UC Berkeley

  12. Structural RNA (scRNA) • A sequence with many self-binding sites, forming a stable structure. • Implicated in regulating critical biochemical pathways Michael W. King http://www.indstate.edu/thcme/mwking/trna.gif RUGS, UC Berkeley

  13. [Chakrabarti & Ong, 2004] Example: Structural RNA • Due to semi-palindromic structure, sequence model would be a PCFG • Violates the site-independent assumption of phylogenetic trees • Modify to allow pairwise site-dependencies in addition to non-matches • Gene length can be in the thousands • Limit the length of scRNA to constant L; time O(L3 + N*L2), N = length of multi-alignment RUGS, UC Berkeley

  14. [Chakrabarti & Ong, 2004] Example completed • Can combine HMM and the PCFG to form a supermodel • Use a generic framework to identify mRNA, scRNA, and other regions RUGS, UC Berkeley

  15. Phylogenetic shadowing • Use multiple alignment of several closely related genomes • Analysis of data becomes more reliable (Boffelli, et. al., 2003) • More genomes reduce probability of false positives • Still need closely related species to decrease chance of false negatives RUGS, UC Berkeley

  16. Conclusions • Phylogenetic shadowing uses a multiple alignment to analyze multiple genomes simultaneously, increasing success • AI techniques have been proven useful in Computational Biology • Still many more problems to solve RUGS, UC Berkeley

  17. References • M. Alexandersson, S. Cawley, and L. Pachter. “SLAM: Cross-Species Gene Finding and Alignment with a Generalized Pair Hidden Markov Model.” Genome Research, 13 (2003) p 496--502.http://www.genome.org/cgi/content/abstract/13/3/496 • D. Boffelli, J. McAuliffe, D. Ovcharenko, K.D. Lewis, I. Ovcharenko, L. Pachter, and E.M. Rubin. “Phylogenetic shadowing of primate sequences to find functional regions of the human genome.” Science, 299 (2003), p 1391-1394. http://www.sciencemag.org/cgi/content/short/299/5611/1391 • K. Chakrabarti and D.L. Ong. “Computational Identification of Noncoding RNA Genes through Phylogenetic Shadowing.” ACM/ISCB RECOMB 8 (2004), poster. http://recomb04.sdsc.edu/posters/kushalcATuclink.berkeley.edu_168.pdf • K. Chakrabarti and L. Pachter. “Visualization of multiple genome annotations and alignments with the K-BROWSER.” Genome Research 14 (2004), p 716--720. http://www.genome.org/cgi/content/abstract/14/4/716 • R. Durbin, S. Eddy, A. Krogh, and G. Mitchison. “Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids.” New York: Cambridge University Press, 1998. RUGS, UC Berkeley

More Related