970 likes | 1.14k Views
Predicting Genes in Eukaryotic Genomes By Computer. Hao Bailin ( 郝柏林 ) T-Life Research Center, Fudan University Beijing Genomics Institute , Academia Sinica Institute of Theoretical Physics, Academia Sinica (www.itp.ac.cn/~hao/).
E N D
Predicting Genes inEukaryotic GenomesBy Computer Hao Bailin (郝柏林) T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica Institute of Theoretical Physics, Academia Sinica (www.itp.ac.cn/~hao/)
The Central Dogma of Molecular Biology replication DNA DNA reverse transcriptiontranscription cDNA mRNA translation Protein/Enzyme folding Function Structure interaction
DNA(脱氧核糖核酸)序列 • 由4种字母(核苷酸、碱基a, c, g, t )组成 • 长度:单条染色体从几千到几千万个字母 • 人有23对染色体;黑猩猩有24对;小鼠有19对;水稻有12对;猕猴桃有300对 • 染色体的一部分编码蛋白质;其余是控制信号,重复片段,意义不明的“随机”字母串,等等
Large-Scale DNA SequencingSince 1977 • Sanger method: polymerization stopping • Maxam-Gilbert: chemical degradation • Each reaction: 500-600 bp (a single read) • Clone by clone vs. whole-genome shotgun • Sequence assembling: reads – contigs – scaffolds – superscaffolds • Automatic sequencer: MegaBace, 96 or 384 channels
Letter production at BGI (Beijing + Hangzhou) Daily: 5 x107 Yearly: 1010
已经测序的真核生物基因组 • 酿酒酵母(Saccharomyces cerevisiae) • 列解酵母(Schizosacchromyces pombe) • 秀丽线虫(Caenorhabitatis elegans) • 果蝇(Drosophila melanogaster) • 恶性疟疾原虫(Plasmodium falciparum) • 岗比亚按蚊(Anopheles gambiae) • 智人(Homo sapiens)、黑猩猩 (Pan trogodytes) • 小鼠 (Mus musculus)、大鼠 (Rattus norvegicus) • 家犬 (Canis familiaris)、家鸡 (Gallus gallus)、家猪(Sus scrofa) • 河豚鱼(Fugu rubripes) • 家蚕 (Bambyx mori)、蜜蜂 (Apsis mellifera) • 拟南芥(Arabidopsis thaliana)、水稻(Oryza sativa) • 玉米 (Zea mays)
cccaatatcttgcttcagcaagatattgggtatttctagctttcctttcttcaaaaattgctatatgttagcagaaaagccttatccattaagagatggaacttcaagagcagctaggtctagagggaagttgtgagcattacgttcgtgcattacttccataccaagattagcacggttgatgatatcagcccaagtattaataacgcgaccttggctatcaactacagattggttgaaattgaatccgtttagattgaaagccatagtactaatacctaaagcagtgaaccaaatccctactacaggccaagcagccaagaagaagtgtaaagaacgagagttgttaaaactagcatattggaagattaatcggccaaaataaccatgagcggccacaatattataagtttcttcctcttgaccaaatctgtaaccctcattagcagattcgttttcagtggtttccctgatcaaactagaggttaccaaggaaccatgcatagcactgaatagggaaccgccgaatacaccagctacacctaacatgtgaaatggatgcataaggatgttatgctctgcctggaatacaatcataaagttgaaagtaccagatattcctaaaggcataccatcagagaaacttccttgaccaatagggtaaatcaagaaaacagcagtagcagctgcaacaggagctgaatatgcaacagcaatccaaggacgcatacccagacggaaactcagttcccactcacgacccatataacaagctacaccaagtaagaagtgtagaacaattagctcataaggaccaccattgtataaccactcatcaacagatgcagcttcccaaattgggtaaaagtgcaatccgatcgccgcagaagtaggaataatggcaccagagataatattgtttccgtaaagtaaagaaccagaaacaggctcacgaataccatcaatatctactggaggggcagcgatgaaggcgataataaatacagaagttgcggtcaataaggtagggatcatcaaaacaccgaaccatccgatgtaaagacggttttcggtgctagttatccagttgcagaagcgaccccacaggcttgtactttcgcgtctctctaaaattgcagtcatggtaagatcttggtttattcaaattgcaaggactcccaagcacacgtattaactagaaagataatagaaggcttgttatttaacagtataatatagactatataccaatgtcaaccaagccagccccgacagttgtatatccatacaacaaaatttaccaaaccaaaaaattttgtaaatgaagtgagtgaaaaatcaaaactcagattgctcctttctagtttccatatgggttgcccgggactcgaacccggaactagtcggatggagtagataattattccttgttacaatagagaaaaaacctctccccaaatcgtgcttgcatttttcattgcacacgactttccctatgtagaaataggctatttctattccgaagaggaagtctactaatttttttagtagtaagttgattcacttactatttattatagtacagagaacatttcagaatggaaactgtgaaagttttaccttgatcatttatcaatcatttctagtttattagttttgtttaatgattaattaagaggattcaccagatcattgatacggagaatatccaaataccaaatacgctcactgtgcgatccacggaaagaaaagtaagttgttttggcgaacatcaaagaaaaaacttgctcttcttccgtaaaaaattcttctaaaaataccgaacccaaccattgcataaaagctcgtaccgtgcttttatgtttacgagctaaagttctagcgcatgaaagtcgaagtatatactttagtcgatacaaagtcttcttttttgaagatccactgtgataatgaaaaagatttctacatatccgaccaaaccgatcaagaatatcccaatccgataaatcggtccaaattggtttactaataggatgccccgatccagtacaaaattgggcttttgctaaagatccaatgagaggagtaacagggactttggtatcgaattttttcatttgagtatctattagaaatgaattctccagcatttgattccttactaacaaagaatttattggtacacttgaaaagtaccccagaaaatcgaagcaagagttttctaattggtttagatggatcctttgcggttgagtccaaaaagagaaagaatattgccacaaacggacaaggtaacatttccatttcttcttcaaaagaagagttccttttgatgcaagaattgcctttccttgatatcgaacataatgcataaggggatccataacgaaccatatggttttccgaaaaaaagcagggtacattaacccaaaatgttccatcttcctagaaaagatgattcgttccagaaaggttccggaagaagttaatcgcaagcaagaagattgtttacgaagaaacaacaagaaaaattcatattctgatacataagagttatataggaaccgaaatagtcttttattttcttttttcaaaataaaaatggatttcattgaagtaataaaactattccaattcgagtagtagttgagaaagaatcgcaataaatgcaaggatggaacatcttggatccggtattgaaggagttgaagcaagatatccaaatggataggatagggtatttctatatgtgctagataatgtaagtgcaaaaatttgtcttctaaaaaaggaaatattgaatgaatagatcgtaaattctgaaactttggtatttctttttcttccggacaagactgttctcgtagcgagaatgggatttctacaacgatcgcaaacccctcagatagaatctgagaataaaactcagaataaaaaaaattgttgtaatccaataatcgatcttggttaggatgattaaccaaattaatccaaaaattctgctgatacattcgaatcattaaccgtttcacaagtagtgaactaaatttcttgttattagaaccaataatttcgacaagttcggaaccatttaatccataatcatgggcaaacacataaatgtactcctgaaagagtagtgggtagacgaaatattgtctaggaaatttaagtttttctgaataaccctcgaatttttccatttgtatttctacttgaatcagagagagagaaatatttctcggtttatcaaatggtgatacatagtacaatatggtcagaacagggtgttgcattttttaatacaaacccctggggaagaaaaggagtctaatccacggatctttttccgctccttttctatccaatttgtttatgtttgttctaattacaaaagagaacaaatcctttatttttgcaggccaattgctcttttgactttgggatacagtctctttatcaatatactgcttcttttacacattcaatccataacatccttttcaatccaaaatcaagaataattaggatttctaaaaaaaaaagaaaaaatcaaaggtctactcataggaaaaccagcttttccctacatcaggcactaatctatttttaacgtctaattagatcagggagttcttccaattaagaagttaagctcgttgctttttgttttaccagaattggagccaggctctatccatttattcattagacccagaaaatcagaatttttttattccattccaaaaatccaaaataagaaattgattttattacgacatgctattttttccattcattacccttgaggatcagtcgcggtcttatagactctaccaagagtctggacgaattttttgcttcatccaaatgtgtaaaagatcatagtcgcacttaaaagccgagtactctaccattgagttagcaacccagataaactaggatcttagatacgatcgaaatccaaaaatcaatggaattacaccgcacacccctgtcaaaatcttaaaatagcaagacattaaaagaaagattttatcaccattgaaaacactcagataccaaaaggaacgggtctggttaaatttcactaaggttaaaagtggcaccaatcacgatcgtaaaattgtcatttttttagcatttttatttaaataaataaataaatcttgtatgagagtacaaacaagagggacaaccctaccatttgagcaaagtgtaggcaaaaaacctaatagggagtgaggataaagagacttatccatctacaaattctagatgttcaatggacctttgtcaatggaaatacaatggtaagaaaaaaattagatagaaaaactcaaaaaaataaaggcttatgttggattggcacgacataaatccagtcaaaaataggattaagaaagaggcaaattatttctaaatagttagacaacaagggatactagtgagcctctcctagttttttattcatttagttcttcaattaactcaaagttctttctttttctttaaagaattccgccttccttaaaatatcagaaacggttcttgtaggttgagcacctttttcaaggaaatagagaatagctggaacatttaaacaagtttgattctttatcggatcataaaaacctacttttcgaagatctcttccttctcttcgagatcgaacatcaattgcaacgattcgatagacagcttattgggatagatgtagataaataaagccccccctagaaacgtataggaggttttctcctcatacggctcgagaatatgacttgcattaatttccgtacagaaaaaacaaatttcatttatactcatgactcaagttgactaattttgattgacagacttgaaagaaaaaaatcctttgaaattttttgagtcgtctctaaactcttttctttgcctcatctcgaacaaattcacttttattccttattccggtccaattctattgttgagacagttgaaaatcgtgtttacttgttcgggaatcctttatctttgatttgtgaaatccttgggtttaaacattacttcgggaattcttattcttttttctttcaaaagagtagcaacatacccttttttcttatttccttcgataaagcatttccctcttctatagaaatcgaatatgagcgattgattctgatagactttaatcaaaagagttttcccatatcttccaaaattggactttcttcttattttaaccttttgatttctatattatttcgatttctatattaagggtagaatgacaaagttggcctaatttattagttttcactaaccctagattctttcccttgataaaaaataaattctgtcctctcgagctccatcgtgtactatttacttagcttacttacaaacaacccagcgaaaattcggttcgggacgaatagaacagactatgtcgagccaagagcattttcattactatggaaaatggtggatagcaaaatccacaatcgatcgtgtccttcaagtcgcacgttgctttctaccacatcgttttaaacgaagttttaacataacattcctctaatttcattgcaaagtgttatagggaattgatccaatatggatggaatcatgaatagtcattagtttcgttttttgtatactaattcaaacttgctttgctatctatggagaaatatgaataaaagaaattaagtatttatcgggaaagactccgcaaagagccaatttatttaaacccatattctatcatatgaatgaaatatagttcgaaaaaagggaataaacaagtttgcttaagacttatttattatggaatttccatcctcaacagaggactcgagatgatcaatccaatcctgaaatgataagagaagaattgactcttctccaacaaataaactatcaacctcccgtttaattaatttaattaatatattagattagcaatctatttttccataccatttttccgtaacaaaactaattaactattaactagttaaactattgcaatgaaaagaaagttttttggtagttatagaattctcgtatttcttcgactcgaataccaaaagaaagaaaaaaatgaagtaaaaaaaacgcatttcctgtaaagtaaaattaaggtctttgcttttacttattttttcttttacctaaaagaagcaactccaaatcaaaattgaatccattctatctaacgagcagttcttatcttatctttaccgggatggatcattctggatatttaaaaaatcgcggatcgagatcgtttttgcttaaccaaagaaagaaaaagaagaaggaaccttttttactaataaaatactataaaaaaaatttatctctatcataaatctatctctaccataaaggaataggtctcgttttttatacaatgttctacgtcaagtttaaaattttttcatgaaaaaaagattttcaatttgactggacttgacactggattatgttttctgagacagaaaatgaacgcattaggactgcatcgaatctaagagtttataagagaaaaaaattctctttaataaactttatgtctcgtgcagaatacaatacgatttcatctttcgtttcatcagaaaaaatctgggacggaaggattcgaacctccgagtaacgggaccaaaacccgctgccttaccacttggccacgccccatttcgggttttatgcgacactaataaacagtattatgtttatttcttattcgtcaatcctacttcaattacataaaaatggggggtattctcttggtaggattctagacatgcgaataatatagaatccaaaaaatgcattgatcattacatggaattctattaagatattatatgaaagtcgaatttcttccactctcatttgagagtgcgaatacaaggaggtattttgtgtttgggaaagtccgaagaaaaaaggattttgaatcctccttttcctttttcccttagaaaaataactcaatcaaaatccaattatctactctacaagaacgaaacgcttgttatgcctaatatacttagtttaacctgtatttgttttaattctgttatttatccgactagttttttcttcgccaaattgcccgaagcttatgccattttcaatccaatcgtggattttatgcctgtcatacctgtactcttttttctattagcctttgtttggcaagctgctgtaagttttcgatgaaatctttactactctgtctgccaaattgaatcatgtattcattctaaaaaaattcgaaaaatggataagagccgagaagtcttatattatgaaccttcgattctaaaattcaaattcttctacattgaatgtatagctgcagcaataaatttggatcagcctttctactccctgcatctacgttgagcaggtatctttaggtaaccgcacaatacctaacctaatttattgataagagtgcttattataaatcaattcttgcaatttttttcaaaaattgatttttgcatttttaggtgtcaaaataaacaaaacccatcctagtggatttgtgtggtaaggaaaaacgggtaatctattccttaaaaaaaaatcttggagattatgtaatgcttactctcaaactttttgtttatacagtagtgatattctttgtttccctctttatctttggattcttatctaatgatccaggacgtaatcctgggcgtgacgagtaaaaatccaaaattttttcttacaaattggatttgtttcatacatttatctacgagaaaatccgggggtcagaattccttccaattcgaaagtcccaaacgatccgagggggcggaaagagagggattcgaaccctcggtacaaaaaaattgtacaacggattagcaatccgccgctttagtccactcagccatctctccccgttccaaatcgaaaggtttccgtgatatgacagaggcaagaaataacgattgcaaaaaatccttcctttttctttcaaaagttcaaaaaaattatattgccaattccattttagttatattcttttttcttaatgttaataaaaaaaagaagaaaattcttcttttttctttctaattctaaaattggatattggctaaaagacaatcagatagattttctcttcagcaggcatttccatataggacttgttataataaaacaagcaggttatagaaaaaaactcttttttttattatttatcaacaaagcaaaaaggggtcttatcaaaccaacccaccccataaaattggaaagaaagataaagtaagtggacctgactccttgaatgaggcctctatccgctattctgatatataaattcgatgtagatgaaattgtataagtggatttttttgtatttccttagacttagaccacgcaaggcaagaatttctcgctatttactatttcatattcttgttactagatgttctataggaataagaagaaatcgcaacccctttccgctacacataaaaatggatttcgaaagtcaatttttcttttcaatatctttactttttttcagaatcctatttttgttcttatacccatgcaatagagagcgagtgggaaaagggaggttactttttttcattttttccttaaaaaataggctttcttggaaataggaatcatggaataatctgaattccaatgtttatttctatagtataagaaaaactaattgaatcaaattcatggatttaccacgacctcggctgtgaccccatagataaaaatgcaaaatttctatcttcgagaccattgaaaaaaggcattgaacgagaaaaaatcgtccacagataatctatcgtatgccttggaagtgatataaggtgctcggaaatggttgaagtaattgaataggaggatcactatgactatagcccttggtagagttactaaagaagaaaatgatttatttgatattatggacgactggttacgaagggaccgttttgtttttgtaggatggtctggcctattgctttttccttgtgcttatttcgctttaggaggttggtttacagggacaacttttgtaacttcttggtatacccatggattggcgagttcctatttggaaggttgcaatttcttaaccgcagcagtttccacccctgccaatagtttagcacactctttgttgctactatggggcccggaagcacaaggggattttactcgttggtgtcaattaggtggtctgtggacttttgttgctctccatggggcttttgcactaataggtttcatgttacgtcaatttgaacttgctcggtctgttcaattgcggccttataatgcaatttcattctctggcccaatcgctgtttttgtttccgtattcctgatttatccactggggcaatccggttggttctttgcgccgagttttggcgtagcagcgatatttcgattcatcctcttcttccaaggatttcataattggacgttgaacccatttcatatgatgggagttgccggagtattaggcgcggctctgctatgcgctattcatggggcaaccgtggacccaatatcttgcttcagcaagatattgggtatttctagctttcctttcttcaaaaattgctatatgttagcagaaaagccttatccattaagagatggaacttcaagagcagctaggtctagagggaagttgtgagcattacgttcgtgcattacttccataccaagattagcacggttgatgatatcagcccaagtattaataacgcgaccttggctatcaactacagattggttgaaattgaatccgtttagattgaaagccatagtactaatacctaaagcagtgaaccaaatccctactacaggccaagcagccaagaagaagtgtaaagaacgagagttgttaaaactagcatattggaagattaatcggccaaaataaccatgagcggccacaatattataagtttcttcctcttgaccaaatctgtaaccctcattagcagattcgttttcagtggtttccctgatcaaactagaggttaccaaggaaccatgcatagcactgaatagggaaccgccgaatacaccagctacacctaacatgtgaaatggatgcataaggatgttatgctctgcctggaatacaatcataaagttgaaagtaccagatattcctaaaggcataccatcagagaaacttccttgaccaatagggtaaatcaagaaaacagcagtagcagctgcaacaggagctgaatatgcaacagcaatccaaggacgcatacccagacggaaactcagttcccactcacgacccatataacaagctacaccaagtaagaagtgtagaacaattagctcataaggaccaccattgtataaccactcatcaacagatgcagcttcccaaattgggtaaaagtgcaatccgatcgccgcagaagtaggaataatggcaccagagataatattgtttccgtaaagtaaagaaccagaaacaggctcacgaataccatcaatatctactggaggggcagcgatgaaggcgataataaatacagaagttgcggtcaataaggtagggatcatcaaaacaccgaaccatccgatgtaaagacggttttcggtgctagttatccagttgcagaagcgaccccacaggcttgtactttcgcgtctctctaaaattgcagtcatggtaagatcttggtttattcaaattgcaaggactcccaagcacacgtattaactagaaagataatagaaggcttgttatttaacagtataatatagactatataccaatgtcaaccaagccagccccgacagttgtatatccatacaacaaaatttaccaaaccaaaaaattttgtaaatgaagtgagtgaaaaatcaaaactcagattgctcctttctagtttccatatgggttgcccgggactcgaacccggaactagtcggatggagtagataattattccttgttacaatagagaaaaaacctctccccaaatcgtgcttgcatttttcattgcacacgactttccctatgtagaaataggctatttctattccgaagaggaagtctactaatttttttagtagtaagttgattcacttactatttattatagtacagagaacatttcagaatggaaactgtgaaagttttaccttgatcatttatcaatcatttctagtttattagttttgtttaatgattaattaagaggattcaccagatcattgatacggagaatatccaaataccaaatacgctcactgtgcgatccacggaaagaaaagtaagttgttttggcgaacatcaaagaaaaaacttgctcttcttccgtaaaaaattcttctaaaaataccgaacccaaccattgcataaaagctcgtaccgtgcttttatgtttacgagctaaagttctagcgcatgaaagtcgaagtatatactttagtcgatacaaagtcttcttttttgaagatccactgtgataatgaaaaagatttctacatatccgaccaaaccgatcaagaatatcccaatccgataaatcggtccaaattggtttactaataggatgccccgatccagtacaaaattgggcttttgctaaagatccaatgagaggagtaacagggactttggtatcgaattttttcatttgagtatctattagaaatgaattctccagcatttgattccttactaacaaagaatttattggtacacttgaaaagtaccccagaaaatcgaagcaagagttttctaattggtttagatggatcctttgcggttgagtccaaaaagagaaagaatattgccacaaacggacaaggtaacatttccatttcttcttcaaaagaagagttccttttgatgcaagaattgcctttccttgatatcgaacataatgcataaggggatccataacgaaccatatggttttccgaaaaaaagcagggtacattaacccaaaatgttccatcttcctagaaaagatgattcgttccagaaaggttccggaagaagttaatcgcaagcaagaagattgtttacgaagaaacaacaagaaaaattcatattctgatacataagagttatataggaaccgaaatagtcttttattttcttttttcaaaataaaaatggatttcattgaagtaataaaactattccaattcgagtagtagttgagaaagaatcgcaataaatgcaaggatggaacatcttggatccggtattgaaggagttgaagcaagatatccaaatggataggatagggtatttctatatgtgctagataatgtaagtgcaaaaatttgtcttctaaaaaaggaaatattgaatgaatagatcgtaaattctgaaactttggtatttctttttcttccggacaagactgttctcgtagcgagaatgggatttctacaacgatcgcaaacccctcagatagaatctgagaataaaactcagaataaaaaaaattgttgtaatccaataatcgatcttggttaggatgattaaccaaattaatccaaaaattctgctgatacattcgaatcattaaccgtttcacaagtagtgaactaaatttcttgttattagaaccaataatttcgacaagttcggaaccatttaatccataatcatgggcaaacacataaatgtactcctgaaagagtagtgggtagacgaaatattgtctaggaaatttaagtttttctgaataaccctcgaatttttccatttgtatttctacttgaatcagagagagagaaatatttctcggtttatcaaatggtgatacatagtacaatatggtcagaacagggtgttgcattttttaatacaaacccctggggaagaaaaggagtctaatccacggatctttttccgctccttttctatccaatttgtttatgtttgttctaattacaaaagagaacaaatcctttatttttgcaggccaattgctcttttgactttgggatacagtctctttatcaatatactgcttcttttacacattcaatccataacatccttttcaatccaaaatcaagaataattaggatttctaaaaaaaaaagaaaaaatcaaaggtctactcataggaaaaccagcttttccctacatcaggcactaatctatttttaacgtctaattagatcagggagttcttccaattaagaagttaagctcgttgctttttgttttaccagaattggagccaggctctatccatttattcattagacccagaaaatcagaatttttttattccattccaaaaatccaaaataagaaattgattttattacgacatgctattttttccattcattacccttgaggatcagtcgcggtcttatagactctaccaagagtctggacgaattttttgcttcatccaaatgtgtaaaagatcatagtcgcacttaaaagccgagtactctaccattgagttagcaacccagataaactaggatcttagatacgatcgaaatccaaaaatcaatggaattacaccgcacacccctgtcaaaatcttaaaatagcaagacattaaaagaaagattttatcaccattgaaaacactcagataccaaaaggaacgggtctggttaaatttcactaaggttaaaagtggcaccaatcacgatcgtaaaattgtcatttttttagcatttttatttaaataaataaataaatcttgtatgagagtacaaacaagagggacaaccctaccatttgagcaaagtgtaggcaaaaaacctaatagggagtgaggataaagagacttatccatctacaaattctagatgttcaatggacctttgtcaatggaaatacaatggtaagaaaaaaattagatagaaaaactcaaaaaaataaaggcttatgttggattggcacgacataaatccagtcaaaaataggattaagaaagaggcaaattatttctaaatagttagacaacaagggatactagtgagcctctcctagttttttattcatttagttcttcaattaactcaaagttctttctttttctttaaagaattccgccttccttaaaatatcagaaacggttcttgtaggttgagcacctttttcaaggaaatagagaatagctggaacatttaaacaagtttgattctttatcggatcataaaaacctacttttcgaagatctcttccttctcttcgagatcgaacatcaattgcaacgattcgatagacagcttattgggatagatgtagataaataaagccccccctagaaacgtataggaggttttctcctcatacggctcgagaatatgacttgcattaatttccgtacagaaaaaacaaatttcatttatactcatgactcaagttgactaattttgattgacagacttgaaagaaaaaaatcctttgaaattttttgagtcgtctctaaactcttttctttgcctcatctcgaacaaattcacttttattccttattccggtccaattctattgttgagacagttgaaaatcgtgtttacttgttcgggaatcctttatctttgatttgtgaaatccttgggtttaaacattacttcgggaattcttattcttttttctttcaaaagagtagcaacatacccttttttcttatttccttcgataaagcatttccctcttctatagaaatcgaatatgagcgattgattctgatagactttaatcaaaagagttttcccatatcttccaaaattggactttcttcttattttaaccttttgatttctatattatttcgatttctatattaagggtagaatgacaaagttggcctaatttattagttttcactaaccctagattctttcccttgataaaaaataaattctgtcctctcgagctccatcgtgtactatttacttagcttacttacaaacaacccagcgaaaattcggttcgggacgaatagaacagactatgtcgagccaagagcattttcattactatggaaaatggtggatagcaaaatccacaatcgatcgtgtccttcaagtcgcacgttgctttctaccacatcgttttaaacgaagttttaacataacattcctctaatttcattgcaaagtgttatagggaattgatccaatatggatggaatcatgaatagtcattagtttcgttttttgtatactaattcaaacttgctttgctatctatggagaaatatgaataaaagaaattaagtatttatcgggaaagactccgcaaagagccaatttatttaaacccatattctatcatatgaatgaaatatagttcgaaaaaagggaataaacaagtttgcttaagacttatttattatggaatttccatcctcaacagaggactcgagatgatcaatccaatcctgaaatgataagagaagaattgactcttctccaacaaataaactatcaacctcccgtttaattaatttaattaatatattagattagcaatctatttttccataccatttttccgtaacaaaactaattaactattaactagttaaactattgcaatgaaaagaaagttttttggtagttatagaattctcgtatttcttcgactcgaataccaaaagaaagaaaaaaatgaagtaaaaaaaacgcatttcctgtaaagtaaaattaaggtctttgcttttacttattttttcttttacctaaaagaagcaactccaaatcaaaattgaatccattctatctaacgagcagttcttatcttatctttaccgggatggatcattctggatatttaaaaaatcgcggatcgagatcgtttttgcttaaccaaagaaagaaaaagaagaaggaaccttttttactaataaaatactataaaaaaaatttatctctatcataaatctatctctaccataaaggaataggtctcgttttttatacaatgttctacgtcaagtttaaaattttttcatgaaaaaaagattttcaatttgactggacttgacactggattatgttttctgagacagaaaatgaacgcattaggactgcatcgaatctaagagtttataagagaaaaaaattctctttaataaactttatgtctcgtgcagaatacaatacgatttcatctttcgtttcatcagaaaaaatctgggacggaaggattcgaacctccgagtaacgggaccaaaacccgctgccttaccacttggccacgccccatttcgggttttatgcgacactaataaacagtattatgtttatttcttattcgtcaatcctacttcaattacataaaaatggggggtattctcttggtaggattctagacatgcgaataatatagaatccaaaaaatgcattgatcattacatggaattctattaagatattatatgaaagtcgaatttcttccactctcatttgagagtgcgaatacaaggaggtattttgtgtttgggaaagtccgaagaaaaaaggattttgaatcctccttttcctttttcccttagaaaaataactcaatcaaaatccaattatctactctacaagaacgaaacgcttgttatgcctaatatacttagtttaacctgtatttgttttaattctgttatttatccgactagttttttcttcgccaaattgcccgaagcttatgccattttcaatccaatcgtggattttatgcctgtcatacctgtactcttttttctattagcctttgtttggcaagctgctgtaagttttcgatgaaatctttactactctgtctgccaaattgaatcatgtattcattctaaaaaaattcgaaaaatggataagagccgagaagtcttatattatgaaccttcgattctaaaattcaaattcttctacattgaatgtatagctgcagcaataaatttggatcagcctttctactccctgcatctacgttgagcaggtatctttaggtaaccgcacaatacctaacctaatttattgataagagtgcttattataaatcaattcttgcaatttttttcaaaaattgatttttgcatttttaggtgtcaaaataaacaaaacccatcctagtggatttgtgtggtaaggaaaaacgggtaatctattccttaaaaaaaaatcttggagattatgtaatgcttactctcaaactttttgtttatacagtagtgatattctttgtttccctctttatctttggattcttatctaatgatccaggacgtaatcctgggcgtgacgagtaaaaatccaaaattttttcttacaaattggatttgtttcatacatttatctacgagaaaatccgggggtcagaattccttccaattcgaaagtcccaaacgatccgagggggcggaaagagagggattcgaaccctcggtacaaaaaaattgtacaacggattagcaatccgccgctttagtccactcagccatctctccccgttccaaatcgaaaggtttccgtgatatgacagaggcaagaaataacgattgcaaaaaatccttcctttttctttcaaaagttcaaaaaaattatattgccaattccattttagttatattcttttttcttaatgttaataaaaaaaagaagaaaattcttcttttttctttctaattctaaaattggatattggctaaaagacaatcagatagattttctcttcagcaggcatttccatataggacttgttataataaaacaagcaggttatagaaaaaaactcttttttttattatttatcaacaaagcaaaaaggggtcttatcaaaccaacccaccccataaaattggaaagaaagataaagtaagtggacctgactccttgaatgaggcctctatccgctattctgatatataaattcgatgtagatgaaattgtataagtggatttttttgtatttccttagacttagaccacgcaaggcaagaatttctcgctatttactatttcatattcttgttactagatgttctataggaataagaagaaatcgcaacccctttccgctacacataaaaatggatttcgaaagtcaatttttcttttcaatatctttactttttttcagaatcctatttttgttcttatacccatgcaatagagagcgagtgggaaaagggaggttactttttttcattttttccttaaaaaataggctttcttggaaataggaatcatggaataatctgaattccaatgtttatttctatagtataagaaaaactaattgaatcaaattcatggatttaccacgacctcggctgtgaccccatagataaaaatgcaaaatttctatcttcgagaccattgaaaaaaggcattgaacgagaaaaaatcgtccacagataatctatcgtatgccttggaagtgatataaggtgctcggaaatggttgaagtaattgaataggaggatcactatgactatagcccttggtagagttactaaagaagaaaatgatttatttgatattatggacgactggttacgaagggaccgttttgtttttgtaggatggtctggcctattgctttttccttgtgcttatttcgctttaggaggttggtttacagggacaacttttgtaacttcttggtatacccatggattggcgagttcctatttggaaggttgcaatttcttaaccgcagcagtttccacccctgccaatagtttagcacactctttgttgctactatggggcccggaagcacaaggggattttactcgttggtgtcaattaggtggtctgtggacttttgttgctctccatggggcttttgcactaataggtttcatgttacgtcaatttgaacttgctcggtctgttcaattgcggccttataatgcaatttcattctctggcccaatcgctgtttttgtttccgtattcctgatttatccactggggcaatccggttggttctttgcgccgagttttggcgtagcagcgatatttcgattcatcctcttcttccaaggatttcataattggacgttgaacccatttcatatgatgggagttgccggagtattaggcgcggctctgctatgcgctattcatggggcaaccgtgga
蛋白质序列:20种字母(氨基酸AA) 长度:50 – 6000 AA 实例:人的免疫球蛋白 ID A1BG_HUMAN STANDARD; PRT; 495 AA. ... ... ... KW Immunoglobulin domain; Glycoprotein; Plasma; Repeat; Signal. ... ... ...SQ SEQUENCE 495 AA; 54209 MW; 87A50C21CE89459C CRC64; MSMLVVFLLL WGVTWGPVTE AAIFYETQPS LWAESESLLK PLANVTLTCQ ARLETPDFQL FKNGVAQEPV HLDSPAIKHQ FLLTGDTQGR YRCRSGLSTG WTQLGKLLEL TGPKSLPAPW LSMAPVPWIT PGLKTTAVCR GVLRGETFLL RREGDHEFLE VPEAQEDVEA TFPVHQPGNY SCSYRTDGEG ALSEPSATVT IEELAAPPPP VLMHHGESSQ VLHPGNKVTL TCVAPLSGVD FQLRRGEKEL LVPRSSTSPD RIFFHLNAVA LGDGGHYTCR YRLHDNQNGW SGDSAPVELI LSDETLPAPE FSPEPESGRA LRLRCLAPLE GARFALVRED RGGRRVHRFQ SPAGTEALFE LHNISVADSA NYSCVYVDLK PPFGGSAPSE RLELHVDGPP PRPQLRATWS GAALAGRDAV LRCEGPIPDV TFELLREGET KAVKTIPTPG AAANLELIFV GPQHAGNYRC RYRSWVPHTF ESELSDPVELLVAES //
Gene-Finding by Computer Starting from early 1980s: • “Ab initio” or “de novo” algorithms: GeneMark, GenScan, FgeneSH, Genie, …based on gene-structure models and training data. (Our on-going project: BGF, the BGI Gene Finder) • Homolog methods based on sequence alignment with known genes in databases and comparative genomics of not-too-distant species • Mixed approach using both strategy: TwinScan
Different Stages of Gene-Finding • Use all possible existing programs and services on the web with a public-domain or home-made genome viewer • Write your own gene-finder, trained for the specific organism • A dream for the time being: design a self-training and self-developing program “for any species” which would improve itself iteratively starting from a few available reads, cDNAs, and ESTs
Performance of Gene-Finders in Eukaryote Genomes • M. Q. Zhang, Nature Review Genetics, 3 (2002) 698-710 (mostly for the human genome): Nucleotide level: 80% Exon level: 45% Whole gene structure: 20% • FgeneSH and BGF for rice (our tests on 128 cDNA-confirmed single-gene genomic sequences): Nucleotide level: 90% Exon level: 60% Whole gene structure: 40%
5‘ 3‘ 5‘ 3‘ • Each strand carries the same amount of information, but different sets of genes. • Two strands are equivalent in information content. • Two strands are not equivalent in gene content. • Biological processing (duplication, transcription) goes from 5’ to 3’. • Finding genes on one strand at a time or on two strands at the same time: one-pass or two-pass programs.
start stop 5’ Genomic DNA 3’ transcribe RNA Pol II +… Pre-mRNA splicesome u1u2u4u5u6RNP splice mRNA 5’-UTR 3’-UTR translate ribsome init. + elong. factors term. chaperonine AA seq ( protein primary seq ) fold Protein fold
Three Scales of Search • Local: signals with minimal signature (start, stop, splicing); movable signals (caps, promoters, polyAs, branching points, some very weak) --- clustering, discrimination analysis, various statistical models • Intermediate: exons, introns, intergenic --- Markov, semi-Markov, Hidden-Markov models; intron length distribution • Global: optimal combination of the above --- dynamic programming
Transcription Translation Translation Transcription start start end end {()【(.)(.)(.)】()} Signals: • { transcription start (downstream of promoters) • } transcription end (upstream of poly-A) • 【 translation start (ctg, 1/64 in a random seq.) • 】 translation end (tag, tga, taa, 3/64) • ( splicing donor site (minimal signal=gt, 1/16) • ) splicing accepter site (ag, 1/16) • · branching point (very weak …a…)
Transcription Translation Translation Transcription start start end end {()【(.)(.)(.)】()} • 【( First exon • )( Internal exon • )】 Last exon • {( Non-coding 5’ exon • )【 Non-coding 5’ exon • (.) Intron • 】( Non-coding 3’ exon (rare) • )} Non-coding 3’ exon (rare) • }{ Intergenic region
Signal and Sequence Models • eiid: equal probability independently and identically distributed • niid: non-equal probability independently and identically distributed • WWM: Windowed weight matrix, etc. • MMn: Markov chain model of order n: homogeneous and period-3 MM5 are used in many gene-finders • Consensus sequence
Consensus Sequences • TATAAT ( Pribnov or -10 box ): T80A95T45A60A50T96 • TTGACA ( -35 box ): T82T84G78A65C54A45 • CAAT ( CAAT or –75 box ): GGYCAATCT • TATA ( TATA or Goldberger-Hogness box ): TATAWAW • ATG ( Transcription start point ) However, in Aful:ATG –76%GTG –22%TTG –2%
GT-AG Rule for Intron 5’ splicing donor site exon …A64G73G100T100A62A68G84T63… …12PyNC65A100G100 N…exon 3’ splicing acceptor site
Exon Intron Arapdopsis Rice Human Exon and intron size distribution
Algorithms • Sequence models and scores for signals • Dynamic programming: optimal parse • Hidden Markov Model: geometric distribution of intron lengths • Semi-Hidden Markov Model: needs sequence-generating models and length probability for each node • Language theory approach
Flow Chart of GenScan Chris Burge (1996): A 27-state semi-HMM A simpler model: 19-state A model taking UTR introns into account: 35-state
Figure:N, intergenic region; P,promotor; F, 5’UTR; , single- exon gene; , initial exon; phase k internal exon; ,ter -minal exon; T, 3’UTR; A,polyadenylation signal; and, , phase k intron. ) strand.
Problems: Minor and Major • Ambiguity symbols (N, W, S, R, …) • (1-p) at flanking D-type nodes • Indels and frame-shifts • Gradient effects in gene structure • Introns in 5’-UTRs and 3’-UTRs: leading to 35-state Markov Models • Alternative splicing and sub-optimal paths • Limit of probabilistic models • Deterministic approaches
Dyck language: A language of nested parentheses • Many types of parentheses • Finite depth of nesting • Context-free language Our case: • Only 3 types of parentheses • Shallow nesting • Conjecture: may be regular language
Two Subspecies of Rice • Oryza sativa ssp. indica (籼稻) • Oryza sativa ssp. Japonica (粳稻) The difference was described in Xu Shen’s (许慎《说文解字》) Chinese Dictionary of East Han Dynasty (~ 2nd Century AD) J.H. Zhang et al. Rice cultivation of Jianhu Remains in Henan Province, Science J. (《科学》杂志),53(4),2002, 3 (in Chinese)
Two Test Datasets for RiceGene-Finders • The 28469 japonica full-length cDNAs (Kikuchi et al., Science301 (18 July 2003) • Select a high-quality subset without overlaps with publically available cDNAs • A single-gene set: 500 sequences with one gene in each • A multi-gene set: 46 sequences with 199 genes in total (at least 4 genes in a sequence)
Assessment of Gene-Finders Test done between 22 July and 2 August 2003 • FgeneSH (trained on monocotyledons) • GeneMark.hmm • RiceHMM • GlimmerR • GenScan (trained on maize) • BGF(rise.genomics.org.cn/bgf/)
Our Ultimate Goal • An iterative, self-training, self-improving gene-finder “for any species”, starting from a small number of reads with or without EST, cDNA supports • Annotaion and re-annotation of the rice genomes • Plant comparative genomics, especially, that of Gramene and Crucifers
tRNA features • tRNA gene pre-tRNA mature tRNA • Mature tRNA: 75 – 95 bases • Cloverleaf like structure • Five arms: acceptor arm, D arm, anticodon arm, V loop (extra arm), T C arm
How many tRNA genes are present in an organism? • Codon tRNA amino acid • 61 encoding codons • 20 amino acids • Are there 61 species of tRNA with all possible anticodons ? • Met (M) has one codon but two tRNAs
Wobble hypothesis Crick, 1966 • Many tRNAs recognize more than one codon • Through non-Watson-Crick base pairings • Less than 61 tRNAs are needed
The Modified Wobble Hypothesis(Guthrie & Abelson 1982) • In eukaryotes, 46 different tRNA species would be enough. • The modified wobble hypothesis is almost perfectly hold in H. sapiens, S. cerevisiae, A. thaliana, C.elegans whose complete collection of tRNAs are now known.
tRNA copies in Arabidopsis, C. elegans, and Human aa codon A C H anti aa codon A C H anti aa codon A C H anti aa codon A C H anti 1 UUU AAA UCU AGA UAU AUA UGU ACA 0 0 0 37 14 10 0 0 0 0 0 1 UUC GAA UCC GGA UAC GUA UGC GCA 16 16 14 0 0 76 19 11 15 13 30 UUA UAA UCA UGA UAA UUA UGA UCA 6 5 8 9 7 5 0 0 1 0 0 0 UUG CAA UCG CGA UAG CUA UGG CCA 10 7 6 4 5 4 0 0 1 14 11 7 CUU AAG CCU AGG CAU AUG CGU ACG 11 18 13 16 6 11 0 0 0 9 18 9 1 1 CUC GAG CCC GGG CAC GUG CGC GCG 0 0 0 0 0 10 17 12 0 0 CUA UAG CCA UGG CAA UUG CGA UCG 10 3 2 39 34 10 8 18 11 6 10 7 CUG CAG CCG CGG CAG CUG CGG CCG 3 5 6 5 3 4 9 7 21 4 3 5 1 AUU AAU ACU AGU AAU AUU AGU ACU 20 19 13 10 17 8 0 0 0 0 0 1 AUC GAU ACC GGU AAC GUU AGC GCU 0 0 0 0 0 16 20 33 13 9 7 AUA UAU ACA UGU AAA UUU AGA UCU 5 8 5 8 11 10 13 16 16 9 7 5 AUG CAU ACG CGU AAG CUU AGG CCU 23 20 17 6 7 7 18 33 22 8 3 4 1 GUU AAC GCU AGC GAU AUC GGU ACC 15 19 20 16 21 25 0 0 0 0 0 GUC GAC GCC GGC GAC GUC GGC GCC 0 0 0 0 0 0 23 22 10 23 14 11 GUA UAC GCA UGC GAA UUC GGA UCC 7 6 5 10 10 10 12 17 14 12 33 5 GUG CAC GCG CGC GAG CUC GGG CCC 8 5 19 7 4 5 13 20 8 5 3 8 F C Y S * * * W L H R P Q I N S T K R M D V A G E
tRNA Genes in the Rice Genome(Found by tRNAScan-SE + BLASTN)
Chloroplast tRNA genes in ssp. indica and japonica • 33 tRNA genes found in indica and japonica genome respectively. • They are completely identical, no mutation is found (E. C. Kemmerer and Ray Wu found two tRNA genes perfectly conserved). • It is remarkable that in spite of more than 9000 years of separation no mutation could be observed in the chloroplast tRNA genes in the two ssp.
The End Thank you!