1 / 30

The modern RNA world:

The modern RNA world:. Not all genes encode proteins. Eddy lab HHMI/Washington University, Saint Louis. GeneSweep; or, Ewan’s definition of a gene. http://www.ensembl.org/Genesweep. Rule 2.

kassia
Download Presentation

The modern RNA world:

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The modern RNA world: Not all genes encode proteins. Eddy lab HHMI/Washington University, Saint Louis

  2. GeneSweep; or, Ewan’s definition of a gene http://www.ensembl.org/Genesweep Rule 2. “A gene is a set of connected transcripts.... A transcript is a set of exons.... one transcript must encode a protein (see footnotes).” Footnote 1. “We are restricting ourselves to protein coding genes to allow an effective assessment. RNA genes were considered too difficult to assess by 2003.”

  3. life with 6000 genes Life with 6000 Genes A. Goffeau, B.G. Barrell, H. Bussey, R.W. Davis, B. Dujon, H. Feldmann, F. Galibert, J.D. Hoheisel, C. Jacq, M. Johnston, E.J. Louis, H.W. Mewes, Y. Murakami, P. Phillippsen, H. Tettelin, S.G. Oliver Science 274:546-567, 1996 but besides the ~6000 protein-coding genes, there’s also: 140 ribosomal RNA genes, 275 transfer RNA genes, 40 small nuclear RNA genes, ~100 small nucleolar RNA genes, ... ?

  4. Structure of the large ribosomal subunit Haloarcula marismortui Ban et. al., Science 289:905-920, 2000

  5. inside-out genes? Tycowski, Shu, and Steitz Nature 379:464-466, 1996 Human UHG (U22 host gene) no significant ORFs; not conserved with mouse; rapidly degraded Eight intron-encoded snoRNAs conserved with mouse; stable

  6. pRNA in f29 rotary packaging motor Simpson et al, Nature 408:745-750, 2000 “Structure of the bacteriophage f29 DNA packaging motor”

  7. Cartilage-hair hypoplasia mapped to an RNA Ridanpaa et al. Cell 104:195-203, 2001 RMRP: Human RNase MRP, 267 nt

  8. The human Prader-Willi critical region Cavaille et al., PNAS 97:14035-7, 2000

  9. RNA genes can be hard to detect UGAGGUAGUAGGUUGUAUAGU C. elegans let-7; 21 nt Pasquinelli et al. Nature 408:86-89, 2000 • often small • sometimes multicopy and redundant • often not polyadenylated (and remember EST libraries are poly-A selected) • immune to frameshift and nonsense mutation • no open reading frame or codon bias • often rapidly evolving in primary sequence

  10. The Altuvia screen Argaman et al., Current Biology 11:941-50, 2001 “Novel small RNA-encoding genes in the intergenic regions of E. coli” “Over a period of about 30 years, only four bona fide regulatory RNAs have been discovered in E. coli. Here we report on the discovery of 14 novel small RNA-encoding genes....” sraA 120 nt sraB 149-168 nt rprA 105 nt sraC 234-249 nt sraD 70 nt gcvB 205 nt sraE 88 nt sraF 189 nt sraG 146-174 nt sraH 88-108 nt sraI 91-94 nt sraJ 172 nt sraK 245 nt sraL 140 nt • start w/ “intergenic” regions • computational identification of putative promoter and terminator, 50-400 nt apart • select regions conserved with other bacteria by BLAST

  11. The Gottesman screen Wassarman et al., Genes Dev. 15:1637-51, 2001 “Identification of novel small RNAs using comparative genomics and microarrays” rydB 60 nt ryeE 86 nt ryfA 320 nt ryhA 45 nt (sraH) ryhB 90 nt (sraI) ryiA 210 nt ryjA 92 nt rybB 80 nt ryiB 270 nt (sraK, csrC) rybA 205 nt rygA 89 nt (sraE) rygB 83 nt ryeA 275 nt ryeB 100 nt ryeC 107,143 nt ryeD 102,137 nt rygC 107,139 nt “... a multifaceted search strategy to predict sRNA genes was validated by our discovery of 17 novel sRNAs....” • intergenic regions >= 180 nt • conserved w/ other bacteria by BLAST • manual inspection of location & sequence • expression detected on high-density oligo probe array

  12. Two computational analysis problems • Similarity search (e.g. BLAST): • I give you a query; you find sequences in a database that • look like the query. • For RNA, you want to take the secondary structure • of the query into account. • 2. Genefinding (e.g. GENSCAN): • Based solely on a priori knowledge of what a “gene” • looks like, find genes in a genome sequence. • For RNA – with no open reading frame and no codon • bias – what do you look for?

  13. RNA structure: nested pairwise correlations

  14. Context-free grammars pioneered in comp bio by David Searls Basic CFG “production rules” a CFG “derivation”

  15. HMM and SCFG algorithms R Durbin, SR Eddy, GJ Mitchison, A Krogh Biological Sequence Analysis:Probabilistic Models of Proteins and Nucleic Acids Cambridge Univ. Press, 1998 Goal optimal alignment P(sequence | model) EM parameter estimation memory complexity: time complexity (general): time complexity (as used): HMM algorithm Viterbi Forward Forward-Backward O(MN) O(M2N) O(MN) SCFG algorithm CYK Inside Inside-Outside O(MN2) O(M3N3) O(MN3) • we can analyze target sequences with secondary structure models; • but the algorithms are computationally expensive.

  16. SCFG-based RNA similarity search profile HMMs from A. Krogh, D. Haussler  “profile SCFGs”, or “covariance models” • COVE (Eddy and Durbin, 1994) structural profiles of RNA sequence families • tRNAscan-SE (Lowe and Eddy, 1997) fast prescreens + COVE model of tRNA – large scale tRNA detection • snoscan (Lowe and Eddy, 1998) C/D box small nucleolar RNA detection in yeast genome • snoRNAs detected in Archaea (Omer et al. 2000) C/D box snoRNA homologues detected in many Archaea collaboration with Pat Dennis’ lab at UBC Vancouver • FOLDALIGN (Jan Gorodkin) – automatic recognition and alignment of consensus secondary structure elements also Y. Sakakibara, F. Lefebvre, B. Knudsen, I. Holmes, others...

  17. SCFGs for RNA folding • Minimum energy RNA folding by dynamic programming – Michael Zuker • Partition function calculations (weighted summations over ensemble of all • structures) – J. McCaskill • SCFG analogue of the Zuker program; • maximum likelihood folds by the CYK algorithm; • summations by the Inside algorithm – E. Rivas and S.R. Eddy, 2000

  18. Genefinding by comparative analysis Jonathan Badger, Gary Olsen: CRITICA Most comparative analysis relies just on differential rates of evolution. However, the pattern of mutation is also informative. The OTHER model: score with terms P(a,b | OTH) models divergence only the CODING model: score with terms P(aaa,bbb | COD) models divergence, constrained by amino acid substitution matrix and codon bias

  19. add: a comparative model of structural RNAs The RNA model: terms: P(a-a’, b-b’ | RNA) models DNA divergence constrained by a secondary structure

  20. Technical issues • The structure is unknown; must do weighted sum over all possible structures. • We extended an SCFG model of RNA folding (Rivas and Eddy, 2000) • to a pair-SCFG, and we use an Inside algorithm to score it. • model must deal with gapped alignments that are heterogeneous w.r.t. • models – e.g. BLAST may align beyond the edge of the real RNA. We use • pair-grammar formalisms for all three models, and include flanking models • of conserved nonstructured alignment. • though we want to sum over all structures, we also want to recover maximum • likelihood start/end points of an RNA within a longer alignment. We • use the generalized HMM parsing trick introduced by Stormo and • Haussler (aka “semi Markov models” in Burge’s GENSCAN), and • treat our RNA model as an i,j feature score in a generalized HMM. • divergence times of the three models must be the same. We tie all • model parameters to a choice of amino acid substitution matrix. • -

  21. Three models – examples of their scores

  22. A screen for novel ncRNAs in E. coli E. Rivas, R. Klein, T. Jones, S.R. Eddy, submitted 2367 E. coli intergenic sequences >50 nt in length WUBLASTN vs. S. typhi, S. paratyphi, S. enteriditis, K. pneumoniae gave 23,674 WUBLASTN alignments w/ E<0.01, length >50 nt, >65% identity QRNA classified: 556 candidate RNA loci 160 candidate small ORFs (not examined further) 281 candidate loci are explainable: cis-regulatory RNA structures (terminators, attenuators, etc.) and certain inverted repeat elements leaves 275 candidate ncRNA gene loci Northerns on 49 candidates: 11/49 are expressed as small stable RNAs in exponentially growing E. coli in rich media

  23. Northern blots confirming E. coli RNAs

  24. Summary of three E. coli screens 10/14 of the RNAs found by the Altuvia screen are in QRNA candidate list 3 are just below 5 bit cutoff; one (sraI) completely missed 14/17 of the RNAs found by the Gottesman screen are in candidate list 2 are just below cutoff; 1 was thrown out mistakenly (QRNA found it, we thought it was just a terminator) Conclusions: Sensitivity of QRNA is respectable; most E. coli ncRNAs conserve secondary structure Only 4/11 of our confirmed ncRNAs are in the Altuvia or Gottesman genes Conclusions: These screens have not saturated E. coli for new ncRNAs; A total of 34 new ncRNAs confirmed. We have >200 other candidates in testing; We have confirmed transcripts as short as 40 nt; The functions of these RNAs are unknown.

  25. human/mouse ncRNA detection the cartilage-hair hypoplasia region: QRNA is a general genefinder for structural ncRNA genes.

  26. The ancient RNA World Gesteland & Atkins, The RNA World, CSHL Press, 1999

  27. RNA is very good at recognizing RNA RA Lease & M Belfort, PNAS 97:9919-24, 2000 “A trans-acting RNA as a control switch in Escherichia coli...”

  28. A closing idea: The modern RNA world Hypothesis: When a cell needs to make a molecule X that specifically recognizes a target RNA molecule, and the function of X is either: - catalytically unsophisticated (e.g. steric repression of translation); or - something that can be abstracted onto a single protein (e.g. many guide snoRNAs, one catalytic methylase) then RNA may be the material of choice. Small, highly specific complementary RNAs can be generated by simply duplicating part of the antisense strand of the target RNA. Specific RNA-binding proteins are big, expensive, and more difficult to evolve.

  29. Summary • Noncoding RNAs are genes too. • Methods to find homologous RNAs by structural similarity have been • greatly improved, using stochastic context free grammar algorithms. • Methods to find novel RNAs by de novo genefinding are our current • aim. Two different screens detect new structural RNAs: • - a simple GC screen in AT-rich hyperthermophile genomes; • - QRNA, an RNA genefinder using comparative sequence analysis. [SR Eddy, Curr Opin Genet Dev 9:965, 1999] [R Durbin et al., Biological Sequence Analysis, Cambridge U. Press 1998] [RJ Klein, Z Misulovin, SR Eddy, in preparation] [E Rivas, RJ Klein, TA Jones, SR Eddy, submitted]

  30. Acknowledgements http://www.genetics.wustl.edu/eddy/ The QRNA comparative analysis screen: Elena Rivas Tom Jones Robbie Klein

More Related