120 likes | 264 Views
example of complexities observed by ENCODE (A) annotated exons (black rectangles), novel transcriptionally active regions or TARs (hollow rectangles); conventional annotation identifies only 4 genes or just a fraction of the transcripts reported (dashed lines are introns)
E N D
example of complexities observed by ENCODE (A) annotated exons (black rectangles), novel transcriptionally active regions or TARs (hollow rectangles); conventional annotation identifies only 4 genes or just a fraction of the transcripts reported (dashed lines are introns) (B) observed transcripts are shown alongside the sequences that regulate them (gray circles); note that some of the enhancers are actually promoters for novel splice isoforms proposed redefinition of “gene” requires it to have a biological roleGerstein MB, …, Snyder M. 2007. Genome Res 17: 669-681
a redefinition of the “gene”1. a gene is a genomic sequence directly encoding functional product molecules, either RNAs or proteins2. when there are several functional products that share overlapping regions, take the union of all overlapping genomic sequences encoding them3. this union must be coherent, done separately for protein and RNA products, but it does not require that all the products necessarily share a common subsequenceconcisely summarized asa union of genomic sequences encoding a coherent set of potentially overlapping functional products
there are three primary transcripts, two of which encode five proteins, while the third encodes a noncoding RNA; two primary transcripts share a 5’ untranslated region, but they are considered different genes because the translated regions (D and E do not overlap; there is a noncoding RNA, but the fact it shares its genomic sequence (X and Y) with the protein-coding genomic segments A and E does not make it a co-product of these genes; there are four genes in this one locus by the new definition 4 genes defined in this one locus
genome is sequenced observed transcripts genes dark matter sequence annotation dark matter is reproducible, but it’s poorly transcribed, poorly conserved, non protein coding, and outnumbers validated microRNAs by ~1000 fold time gene number estimates as a function of time and methodology
mouse cDNAs by Okazaki Y, …, Hayashizaki Y. 2002. Nature 420: 563 or human cDNAs by Imanishi T, …, Sugano S. 2004. PLoS Biol 2: e162 cDNA sequencing reveals an abundance of non-coding genes
ncRNAs are known RNA genes; intron1 and intergenic are negative controls communications arising Wang J, …, Wong GK. 2004. Nature 431: after p757 neutral evolution of non-coding cDNAs from mouse transcriptome
mystery BURST human thymus polyA+ cDNAs profiled at locus of Ewing sarcoma breakpoint region 1 gene; from Johnson JM, …, Schadt EE. 2005. Trends Genet 21: 93 tiling array data are riddled with unexplained signal anomalies toodo not assume that non-coding cDNAs are tiling arrays exons
indications of biological relevance: transcription, conservation, both lines of evidence, or neither?possible dark matter explanations:1. biological noise, i.e. real transcripts with no biological roles2. RNA genes unique to a species3. long RNAs are precursors for short (and conserved) RNAsNB: dark matter based on tiling arrays with 150 bp exons is not equivalent to cDNA sequences with 1800 bp exons
nuclear and cytosolic polyadenylated RNAs longer than 200 nt (long RNAs, lRNAs) and whole-cell RNAs less than 200 nt (short RNAs, sRNAs) for non-repetitive portion of human genome; 64% of poly(A)+ transcription (nucleus and cytosol) do not align with annotated exons but of these 265,237 annotated exons some 80% are detected hypothesis is unannotated long RNAs are precursors for short RNAsKapranov P, …, Gingeras TR. 2007.Science 316: 1484-1488
PhastCons identifies evolutionarily conserved elements from a multi-species sequence alignment, given their phylogenetic tree, and based on a statistical model of evolution called a phylogenetic hidden Markov model (phylo-HMM) lRNAs that overlap with sRNAs are more PhastCons conserved (i)
quantile-quantile plot of PhastCons scores for long RNAs that do (x axis) and do not (y axis) overlap with short RNAs; conservatively, 3.1% of HepG2 and 2.4% of HeLa nuclear lRNA transfrags might be parts of precursors of sRNAs lRNAs that overlap with sRNAs are more PhastCons conserved (ii)
enrichment over random expectation is plotted as function of distance from 5’ and 3’ termini for sRNAs on same (sense) or opposite (antisense) strand as the annotated transcripts; comparison is made against random regions with matched G+C content sRNAs associate with 5’ and 3’ boundaries of annotated transcripts