440 likes | 529 Views
Dark matters in the genomes. Shin-Han Shiu Plant Biology / Genetics / EEBB / QBMI. About myself. About myself. About myself. About myself. Cell, nucleus, and chromosomes. DNA. A. G. G. C. G. T. A. G. A. G. A. G. A. T. C. C. T. T. G. A. T. T. C. C. G. C. A. A. C.
E N D
Dark matters in the genomes Shin-Han Shiu Plant Biology / Genetics / EEBB / QBMI
DNA A G G C G T A G A G A G A T C C T T G A T T C C G C A A C T C T C A A G G A A C A A
DNA and Genome • Genome is all the DNA in a cell made up of A, T, G, C... • How many A's, T's, G's, and C's are there in the human genome? 3,200,000,000 letters • A sizable book, say, Lord of the Ring: Fellowship of the Ring 764,470 characters in 410 pages ~2,000 characters per page • The book of our life 1,600,000 pages 4,186 Fellowship of the ring
Between human and other animals • How much do our and chimp genomes differ? • 0.1% • 1% • 10% • 50% • 90% • How many genes do you think we share with worm? • 1% • 10% • 50% • 75% • 99%
Our research interest TTGGCTATCCTTTATATTTTAAGGGTTATTAGGATATTTTTTATTATGACTACATGGGATAAATGTTTAAAAAAAATAAAAAAAAACCTTTCTACGTTTGAGTATAAGACGTGGATAAAGCCTATCCATGTGGAGCAAAA TAGTAACTTATTCACAGTTTACTGTAACAATGAATATTTCAAAAAACATATAAAATCTAAGTATGGAAATCTTATTTTATCAACAATCCAAGAGTGTCATGGTAATGATTTAATTATTGAATATTCTAATAAAAAATTCT CTGGCGAAAAAATTACTGAGGTTATCACAGCTGGACCACAAGCTAATTTTTTTAGCACAACAAGTGTTGAGATAAAAGATGAATCAGAAGATACAAAAGTAGTACAAGAACCTAAAATATCAAAGAAGTCTAATAGTAAA GACTTTTCTTCATCACAAGAGTTATTCGGTTTTGACGAAGCTATGCTAATTACAGCAAAAGAAGATGAGGAATACTCTTTTGGTTTACCGTTAAAAGAAAAATATGTTTTTGATAGTTTTGTTGTTGGAGATGCTAACAA AATTGCTAGAGCAGCGGCTATGCAGGTATCGATAAATCCAGGTAAATTACATAACCCTTTATTCATTTATGGTGGTAGTGGTTTAGGTAAAACTCACTTAATGCAAGCAATAGGTAATCATGCAAGAGAAGTTAATCCTA ATGCCAAAATTATTTATACAAATTCAGAACAATTTATTAAAGATTATGTAAATTCTATTCGTTTACAAGATCAAGATGAGTTTCAAAGAGTTTATAGATCTGCGGATATACTTTTGATTGATGATATTCAATTTATCGCT GGTAAAGAGGGTACTGCTCAGGAGTTTTTCCATACTTTTAATGCATTGTATGAAAATGGTAAACAGATAATTCTAACTAGTGATAAGTATCCAAATGAAATAGAAGGGCTTGAAGAAAGACTAGTTTCGCGTTTTGGTTA TGGTTTAACAGTTTCTGTTGATATGCCAGATTTAGAAACCAGAATTGCTATCTTGCTCAAAAAAGCTCATGATTTAGGTCAGAAATTACCTAACGAAACAGCAGCTTTTATTGCTGAGAATGTACGTACTAATGTCAGAG AACTAGAAGGTGCTCTAAATAGGGTTCTTACTACCTCTAAATTTAATCATAAAGATCCTACTATCGAAGTAGCACAAGCTTGCTTAAGAGATGTTATAAAAATACAAGAAAAGAAAGTAAAAATAGATAATATCCAAAAG GTTGTTGCTGATTTTTATAGAATCAGGGTAAAAGATTTAACTTCTAATCAAAGAAGTAGAAATATAGCTAGACCAAGACAGATAGCAATGAGTTTAGCACGTGAACTAACATCACATAGTTTGCCAGAAATAGGCAATGC TTTTGGTGGTAGAGACCATACGACAGTTATGCATGCTGTCAAAGCTATAACTAAATTAAGACAAAGCAATACTTCAATATCGGATGATTATGAGTTGCTTTTAAATAAAATTTCTCGTTAAATAAAATTAGTAACTTTAT CAAAGGGGTTTTAAAAAATGAATTTTGTACTAAATAGAGATGACTTACTAAAGCCTTTGCAATCTATGCTCTCAGTTGCAAATAGTAAGAGTACAATGCCTTTATTATCATGTATCTTATTTGATATTGATAATAATAAT CTCAAAATTACGGCTTCGGATCTTGATACAGAGATATCATGCAATATAGCAGTTAGTTGTAACACAACTATTAAGTTAGCATTAAATGCTGACAAAATTTATAACATTGTCAGAAGCTTAAATGAAAATTCAATGATTGA TTTTAGAATTAATGAAAATAAGGTAACTATTGTTTCTAATAATAGTACTTTTAACCTTATATCACTAAATGCTGACAACTATCCTCTTATTGATAGTAATATCAATGAGCAAGCAAGTTTTGATCTTTCTCAACAAGATT TTCATCATATTATTTCAAAAGTAGATTTCTCAATGGCTAATGATGATACTCGATATTTCTTAAATGGGATGTTTTGGGAAATCAACGCAAATCTACTAAGAGCAGTATCTACAGATGGTCATAGAATGTCTATCACAGAG GCTATAATTGATAGTAAAGTGTTAGATAGTGCTTCTCAGTCGATAATTCCAAAAAAAGCGATTTTAGAGCTTAAAAAGATAGTTGGCAAAACAGAAGAAAATATCAAAATTTGTCTTGGCAAAAATTATCTAAAAGCGAT TTTTGGTAATTATGCTTTTATATCAAAGCTTATAGATGGTCGCTATCCTGATTACCAAAAAGTAATCCCTAAAAATAATACAAAACTATTAGCAGTTGATAAGCAGTTTTTCAAAAATTCATTATTAAGAACATCAATAC TTGCTAATGATAAATATAAAGGTGTTCGTCTTAACATATCTCAAAATCAATTACTTCTATCAGCTAATAACCCTGATAATGAAAAGGCTGAAGATAAAATCGAAGTTCAATATAATGATCAACCAATGGAAATTTGTTTT AATTACAAATATCTTTTGGATATTATAAATGTACTTAGTGAAGAAACTATGTCTATCTACCTTGATAATCCAAATATGAGTGCTTTAGTTAAAGATGAGAAAGATAATAGTTTGTTTATTATTATGCCAATGAAAATTTA AGTAATAAGTAGTTTTAGGAAATAACTATTTTTATAAGCCTTTTGGAATGAATAATAAAGCAATAAAAAAAGGTATGCATAAAAACATTATATAGAAAGCTGGGATTAGATAATTTCCAGTAGTAGTAATTAATAAAGTC ATAAGAAAGGCAACAGTACCTCCAAATAAAGAAACGCTTATATTAAAACTTATTCCAAAACCAGTATTTCTATTTCTAACAGGAAATAGCCCTGCCGTATTTGCAAATATAGGTCCTATTACAGCACCACTTATAATAGC AAGAGAAAAAATAGCTATAGATACTAATTGATGATTTTTTATAATAATTTGGTATATTGGTAAAACAGCTATAAATAAGACTATACAAGAATACATCAGAACTTTTTTACCACCAATTCTATCAGCAATATATCCAAATA TAATTGAAGAAAGCATTAATACTATAGTTAATCCGAGAGTATTTTGTTAAACACATAAGAGATAGCATCACAAAATTTTTCAAAACTATTATTCACTTTTCTAAATATTTTTTTAAAGTTAGCCCAAACCTTTTCAATAG GATTTAAATCTGGAGAATACGGAGGTAGATATAATATTTGTACATCAAATTTATTGGCTATTTCAATCAGCTTAGAGGATTTATGGAAACTAGCATTATCCATTACTATAGTAGTTTTAGGTTTTAATGATGGGCATAAG TGTTCCTCAAACCATTGATTAAAAATTTCAGTATTGGTATATCCACTGTACTCTAATGGAGCTATAATCTTTTTATCTGCATAATTATATCCAGCAACAATACTTCTTCTTTGTGTTTGATATGCTAAAACCTCACCATA ACTAGGCTCACCAATTAGTGACCATCCTCTTAGGATAGAAAGCTTATTGTCACACCCCATCTCATCTATATAAAATAACAAGTTTTGAGCTATTTCTTTTAGTTTTTCTATATACTCCAACCTTTCATGTTCTTTTCTTT GCTTATATTTTGGAGTCTTTTTTTAAAACTAAAACCAAGTCTATTAAGACAATCATAAAATGTACTTCTTGGAATATCAGGGGCTAATGCTTCTTTTATATCTAATGCACTTGCATCTGGATGATCTATCAAATACTGTT CAATCAATGTTTTATCGGTAAAGCTAGCGACTCTGCCACAACCAACTCCTTGCTTTGAACTATAATCTCCGGTTCTTTTATAAAACTCTATCCATGAAACAACTGTACGCTTATCTATGTTAAAAAACTTACTCAGCTCG AACTCCGTCATACCTTCTTCATATTTATTAATTACGATGTCTCTAAAATATTGGCTATATGATGGCATTTTTATTAGACATTATAACATTTCTACAAATATCTTTTTCTACAAATATCTTTCGGATTAACTATATAAGTA GAGTCAACAACCATCCAAATCACCCAATTATCTATAATTTTCTGCTTGCTAAAAAAACGCATACCAATGATGCTACACTTGTAAAACCATCCATATATGGCGTTGTTGAATCAGTATAAAATATAAGTAGTTGCGAAACT AGTAACCCAAATACTACAATGCTTACTAGAACTTTTAACCAACCAATGATTTTAAGTCTATGAACAACTATCTTTTTGTGACTAAAGTTGGGTTGCCAACTATACCAACCGTATCCAAAGCTAAATAAAAGAATCATCTG CAATATAGCATCGGCATATAGTCCACTAACAGAAAATAAACCCGCACTCATGATCAAACCAACTATCTCCACAGGCCAACCAATGACATAAAGCCTTGCCAGCAAAAAGGTACACAAAAGATTAACAATCATTGTACAAA AATCAAAAATATGCAGCATATTTATTTTACTAATCAAAGTATTATAAATATTATAATAACTTTGAAGTTGGCGTATTAAAGCCATAAACTTTAGTAGGTTAGTGTTTATACCAATATTTTGAGATGCTTTCTGCAAGCTA ATAACATTTAGCTATCTAGCCTAAATAATTAATATACAAAACTTTCAAGCTTATTGAATTTTTCAACAGATACAGCGCGTTATAACAAATAAGTAATTGACTAAATTAAAAAGCAAGTATAATATCGATTGTGTTTATTA CATAATATAAAACGAGGATAAAAAAAATATGAAATTAAGAAAAGTATTAATCGCGACATTATTAGGAGCTTCTGCTTTATCTTTAAGTAGTTGTTGGTTACTTGTTGGTGCAGCTGTTGGTGGTGGAACTGCTGCGTATA TTTCTGGTGAGTATTCAATGAATATGAGTGGCAGTGTAAAAGATATTTACAATGCTACTTTAAAAGCTGTTCAAAGCAATGATGATTTTGTAATTACTAAAAAATCTATTACTTCTGTTGATGCAGTTGTTGATGGTAGT ACTAAGGTAGACTCAACAAGTTTCTATGTTAAAATAGAAAAACTTACTGATAATGCTTCAAAAGTTACAATTAAGTTTGGTACTTTTGGTGACCAAGCAATGTCAGCAACATTAATGGATCAAATCCAAAAGAATCTTTA ATTAAATAGGTAATTACTATAATGACTTTTCTAAAGAAAGCTTTTATTGCAACTATAGTTTCTATTTCAGCATTAGTTCTAAATAGTTGTATTGTTGCAGCAATAGCTGTTGGTGGTGGAACAGTTGCCTATATTGATGG AAATTATTTTATGAATATAGAAGGCAACTATAAAGCTGTCTATAAAGCTACTCTTAAAGCTATTAATGATAATAATGACTTTGTTCTAGTATCAAAAGATCTTGATCAAACAAAGCAAAATGCCGACATTGAAGGTGCTA CTAAAATTGATAGTACGAGTTTTAGTGTCAAAATTGAAAGACTGACAGATCAGGCTACTAAAGTGACAATCAAATTTGGTACTTTTGGCGATCAAGCAATGTCATCAACATTAATGGATCAGATCCAGGCAGCTGTACAT AAAGCTTCTTAGAAATGTACAAAAAACTCTACTTAATTATATTATCCACAATAATCGCAATCTCTCTTAATAGTTGTGTTGTTGCCGCTGTTTTAATTGGTACAGCAGTTGTTGCTGGAGGTACAGTATATTACATCAAT GGTAACTATATAATCGAAGTCCCTAAAGATATTAGAAGTGTATACAATGCTACAATCAAGACTATACAGATGGATAGTCAAAATAAACTAATAAGTCAAACCTATAATACTAAATCTGCTATAATTAAAGCTTTACAAAA AGGTGAAAAAATTAGTATAGATTTAAGCAATATTGATAGTCGTTCAACAGAGATAAAAATTCGTATAGGTGTACTTGGCGATGAGAAAAAATCTGCTGATTTAGCAAACTCAATAACAAAAAATATCACCTAAGCAATAT TTCTCGAACTTTGGTTAACTTTTTCTTTTTAAAAACTTTCAAAAATGTATAATTTGTGTTAGTTTGCAAACTACCCTTATATCCATAATGAGTAATAAGGTATTAGATACATATTATAAAAACAATCGACATATTTGGGT GCTAGTACTATCTGGTGCTGTTATAGGCACAATGATTGGTCTTCTAGCAACAGCATTTCAGCTACTCCTAGACTTTATTTTTAAAATTAAGCTGGCTCTTTTTTCTTTCAGTGGTGGTAATCTTTTTATCGAAATCGCTA TGTCAATATCATTAAGTATTGTGATGGTATTAATTTCGATTTTTATTGTTAAAAAATTTGCGAAAGAGGCTGGTGGTAGCGGTATCCAAGAGGTTGAGGGTGCTTTAAAAGGCTGCCGCAAAATACGTAAAAGAGTTATG CCCGTGAAGTTTATAAGTGGACTTTTTTCGTTAGGCTCAGGTTTAAGTTTAGGTAAAGAGGGACCATCAATTCATATGGCTGCTGCATTAGCGCAGTTTTTTGTTGATAAATTTAAACTTACTACAAAATATGCTAATGC GGTTATCTCTGCTGGGGCTGGAGCTGGACTAGCAGCTGCTTTTAATACCCCACTTTCTGGGATTATCTTTGTTATTGAAGAGATGAATAGAAAGTTTAGATTTAGTGTTTCGGCAATAAAGTGTGTGCTAGTAGCATGTA TCATGAGTACAGTTATCTCTAGAGCTATTATGGGTAATCCTCCAGCAATACGCGTAGAAACTTTCAGCTCAGTACCACAAAATACTCTTTGGTTATTTATGGTATTAGGGATTATATTTGGTTATTTTGGTTTACTATTT AACAAATCCTTAATCAAAGTGGCAAACTTTTTCTCAGAAGGCTCCAAGAAGAGGTATTGGACTTTAGTTATAATTGTTTGCATAATTTTTGGTATTGGTGTTGTTCTATCTCCAAATGCTGTTGGCGGTGGCTATATTGT CATAGCAAATACTCTTGATTATAACTTATCAATCAAGATGCTTTTAGTGCTTTTTGTACTTCGTTTTGCTGGAGTTATTTTCTCATATGGCACCGGCGTTACTGGTGGGATATTCGCACCAATGATTGCGCTTGGTACTG
Evolution of genome sizes • C-value: 1pg ~= 1.02Gb • Thale cress (Arabidopsis thaliana): 0.16 pg • Fruit fly (Drosophila melanogaster): 0.18 pg • Pufferfish (Takifugu rubripes): 0.4 pg • Human (Homo sapiens): 3.5 pg • Onion (Allium cepa): 16.75 pg • Tiger salamander (Ambystoma tigrinum): 32 pg • Marbled lungfish (Protopterus aethiopicus): 132 pg http://www.rbgkew.org.uk/
Genic region and genome size Dan Graur
Exon UTR Intron Annotated genes Cis-regulatory elements Dead genes (pseudogenes) Novel genes What's in the genome Genome Selfish elements
"Non-genic": repetitive elements • E.g. Human genome • Exons take up? • Introns account for? • Repetitive elements occupy? • Unknown? • A B C • 1% 24% 25% • 24% 1% 25% • 35% 60% 45% • 40% 15% 5% Venter et al. (2001) Science 291:1304
cDNA array Tiling array Gap size: 10bp Probe size: 25bp What are in the unknown regions? • Investigate with tiling array • Number of features: • Arabidopsis, 135Mb, 1 chip, ~6x106 features • Human, 3Gb, 7 chips, ~4.2x107 features
"Non-genic": unannotated genes • Tiling array analysis of human Chr 21, 22 Kapranov et al., 2002. Science
Tiling array analysis of human transcriptome • Human Chr 21, 22 • What do you think these expressed regions represent?? Kapranov et al., 2002. Science
Difficulties for coding gene prediction • Training data • You need to know something... • “Biased” toward the properties of the majority. • Real genes that are shorter tend to be much harder to predict. Table 3 Accuracy of GISMO, Glimmer and CRITICA in predicting short genes (<300 bp) Gene finder Cor Sn Snfk (%) Sp GISMO 0.64 63.0 86.4 69.0 Glimmer 0.54 72.0 83.7 44.0 CRITICA 0.60 46.0 67.4 84.0 Snfk denotes the sensitivity in detecting function-known genes. Krause et al., 2006. Nucleic Acid Res. 35:540
Novel coding sequence identification • Arabidopsis thaliana as an example • 135Mb, ~50% occupied by annotated genes. • Focus on coding sequences 90-300bp long. • What would you do next to eliminate ORFs that are likely false predictions? 133,090 sORFs
Criterion 1: Codon usage bias • Some codons are used more frequently than others http://www.cbs.dtu.dk/services/GenomeAtlas/
Criterion 1: Codon usage bias • For example: codons for proline • Suppose you have the following 2 sequences both code for poly-leucine, which one is more likely to be real coding sequence? Seq1 CCT CCA CCT Seq2 CCC CCG CCC
Posterior probability calculation Bayes' theorm Novel CDS identification
Novel CDS identification • Determine base composition probabilities • Feature tables Coding sequences CDS parameters Non-coding sequences NCDS parameters Coding sequences c1 c2 c3 c4 c5 c6 Non-coding sequences n
Posterior probability of coding sequence • Compare known non-coding and coding sequences Hanada et al., 2007. Genome Res.
Posterior probability of coding sequence • Scanning Arabidopsis genome Hanada et al., 2007. Genome Res.
After applying the first criterion 7,442 coding sORFs
How good is the CDS finding measure • For the training data • For 18 Arabidopsis small protein genes • All 18 are predicted as CDS. • For 84 yeast small protein genes • All 84 are predicted as CDS.
So what does this mean? • If a sequence is a true coding sequence • Our approach can predict them with high accuracy. • So, the sensitivity is very good. • Is this good enough?? • What about specificity? • Namely, how good is the criteria in excluding false positives?
Gap size: 10bp Probe size: 25bp Criterion 2: Expression • What would be the expression level you would expect for true CDS compared to false CDS? Tiling array Frequency Expression level
Comparison of expression levels • Exon, intron, tRNA, rRNA, our predictions A: Exon B: Intron C: Prediceted novel CDS D: tRNA E: rRNA
Applying the second criterion • Prediction significantly enriched in expressed sequences 2,996 transcribed sORFs
Criterion 3: Purifying selection • Compare known coding and non-coding sequences
Criterion 3: Purifying selection • Compare known coding and non-coding sequences
Our research interests 17,000 6,000 45,000 10,000 30,000 25,000
Duplication Mechanism and Loss Rate Gene Duplications Mechanisms Preferential retention Preferential retention Consequences Consequences
+ Duplication mechanisms • Whole genome duplication • Tandem duplication • Segmental duplication • Duplicative transposition
Differences in Duplicability • Duplicability • The propensity for the retention of a duplicate gene • Computational analysis of genome-wide trend
Functional Consequences of Duplication • Functional divergence and conservation • Is it because of changes in cis-regulatory elements or coding sequences • How are duplicates retained, subfunctionalization or neofunctionalization
Acknowledgement • Lab members • TIGR • Chris Town • Hank Wu • University of Chicago • Wen-Hsiung Li • Justin O. Borevitz • Xu Zhang • Funding Kousuke Hanada Melissa Lehti-Shiu Cheng Zou