170 likes | 318 Views
Phylogenetic Shadowing. Daniel L. Ong. Abstract. The human genome contains about 3 billion base pairs! Algorithms to analyze these sequences must be linear to be tractable Finding genes is important to Molecular Biologists, first step to understanding. Outline. Introduction Alignments
E N D
Phylogenetic Shadowing Daniel L. Ong
Abstract • The human genome contains about 3 billion base pairs! • Algorithms to analyze these sequences must be linear to be tractable • Finding genes is important to Molecular Biologists, first step to understanding RUGS, UC Berkeley
Outline • Introduction • Alignments • Phylogenetic trees • Sequence models • Example: mRNA and scRNA models • Conclusions RUGS, UC Berkeley
Introduction to Biosequences • 4 nucleotides: A matches T; G matches C • In RNA, U replaces T • The NIH GenBank has 188 GB of sequence data; UC Santa Cruz has another 128 GB • The central dogma: http://web.mit.edu/esgbio/www/dogma/dogma.html RUGS, UC Berkeley
Alignments • Alignment: given two sequences, insert gaps or allow mismatches in input sequences to minimize a cost function • Similar to edit distance • Generalizes to n sequences • Exploited to predict genes • Greater similarity in protein-coding genes • Mutated as a pair in structural RNA genes RUGS, UC Berkeley http://hanuman.math.berkeley.edu/kbrowser (Chakrabarti & Pachter, 2004)
Multiple alignment • Considering multiple sequences allows us to leverage the comparative genomics paradigm • Functionally important regions of the genome are more likely to be conserved across species • The converse is also true • Genomes should be closely related • About 5-7 species of a family (Boffelli, et. al. 2003) • Additional genomes increase sensitivity (true positives) and decrease specificity (true negatives) RUGS, UC Berkeley
[Durbin, et. al., 1998] Phylogenetic Trees • Use directed binary tree to track the relationships between organisms • Each node represents the nucleotide at a particular position in an aligned sequence • Current organisms are leaves of tree (observed) • Internal nodes are the common ancestor (unobserved) • Edges are speciation events and represent “evolutionary distance” as an extra parameter • Assume each nucleotide evolves independently (site independent evolution) RUGS, UC Berkeley
Phylogenetic Tree • Site independent model computes probability of independent columns • Used for protein-coding genes • Pairwise site dependent model computes probability of base-paired columns • Used for scRNA genes Marty Yanofsky http://www-biology.ucsd.edu/labs/yanofsky/images/mads/phylogenetic%20tree.jpg RUGS, UC Berkeley
How to find a Phylogenetic Tree? • Given n sequences, we want to find the correct tree topology • Search works for small n • Maximum likelihood: choose the tree that maximizes the probability of the alignment RUGS, UC Berkeley
Biosequence analysis • Phylogenetic trees encapsulate evolutionary time across sequences • Sequence model predicts changes along the length of a particular sequence • Sequence models are typically HMMs RUGS, UC Berkeley
Example: mRNA genes • Suppose we want to identify coding genes with an HMM • Exon: DNA segment that gets transcribed to mRNA • Have states in HMM corresponding to exon regions (Alexandersson, et. al., 2003) • Other types of RNA that get transcribed from DNA but not translated into protein are noncoding RUGS, UC Berkeley
Structural RNA (scRNA) • A sequence with many self-binding sites, forming a stable structure. • Implicated in regulating critical biochemical pathways Michael W. King http://www.indstate.edu/thcme/mwking/trna.gif RUGS, UC Berkeley
[Chakrabarti & Ong, 2004] Example: Structural RNA • Due to semi-palindromic structure, sequence model would be a PCFG • Violates the site-independent assumption of phylogenetic trees • Modify to allow pairwise site-dependencies in addition to non-matches • Gene length can be in the thousands • Limit the length of scRNA to constant L; time O(L3 + N*L2), N = length of multi-alignment RUGS, UC Berkeley
[Chakrabarti & Ong, 2004] Example completed • Can combine HMM and the PCFG to form a supermodel • Use a generic framework to identify mRNA, scRNA, and other regions RUGS, UC Berkeley
Phylogenetic shadowing • Use multiple alignment of several closely related genomes • Analysis of data becomes more reliable (Boffelli, et. al., 2003) • More genomes reduce probability of false positives • Still need closely related species to decrease chance of false negatives RUGS, UC Berkeley
Conclusions • Phylogenetic shadowing uses a multiple alignment to analyze multiple genomes simultaneously, increasing success • AI techniques have been proven useful in Computational Biology • Still many more problems to solve RUGS, UC Berkeley
References • M. Alexandersson, S. Cawley, and L. Pachter. “SLAM: Cross-Species Gene Finding and Alignment with a Generalized Pair Hidden Markov Model.” Genome Research, 13 (2003) p 496--502.http://www.genome.org/cgi/content/abstract/13/3/496 • D. Boffelli, J. McAuliffe, D. Ovcharenko, K.D. Lewis, I. Ovcharenko, L. Pachter, and E.M. Rubin. “Phylogenetic shadowing of primate sequences to find functional regions of the human genome.” Science, 299 (2003), p 1391-1394. http://www.sciencemag.org/cgi/content/short/299/5611/1391 • K. Chakrabarti and D.L. Ong. “Computational Identification of Noncoding RNA Genes through Phylogenetic Shadowing.” ACM/ISCB RECOMB 8 (2004), poster. http://recomb04.sdsc.edu/posters/kushalcATuclink.berkeley.edu_168.pdf • K. Chakrabarti and L. Pachter. “Visualization of multiple genome annotations and alignments with the K-BROWSER.” Genome Research 14 (2004), p 716--720. http://www.genome.org/cgi/content/abstract/14/4/716 • R. Durbin, S. Eddy, A. Krogh, and G. Mitchison. “Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids.” New York: Cambridge University Press, 1998. RUGS, UC Berkeley