560 likes | 666 Views
”Gene Finding in Eukaryotic Genomes” PhD course #27803 Spring 2003. Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU Technical University of Denmark nikob @cbs.dtu.dk. Human Genome Published HUGO: Nature, 15.feb.2001 Celera: Science, 16.feb.2001.
E N D
”Gene Finding in Eukaryotic Genomes” PhD course #27803 Spring 2003 Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU Technical University of Denmark nikob@cbs.dtu.dk
Human Genome Published HUGO: Nature, 15.feb.2001 Celera: Science, 16.feb.2001
We Have the Human Genome Sequence...now what? • So, what is the problem? • Well... • We don’t know how many genes there are! • We don’t know where they are! • We don’t know what they do!
The cellular machinery recognize genes without access to GenBank, SwissProt or computers – can we?
Needles in Haystacks... • Only 2% of human genome is coding regions • Intron-exon structure of genes • Large introns (average 3365 bp ) • Small exons (average 145 bp) • Long genes (average 27 kb)
AAGAGGTAATTAAAGCTAAATGAAGTTGTAAGAGTGGCCCTATCGCATAGGACTAGTGTCCCTATAAGAACACGAAGAAATCACCTTAGAAAGGCTGAGAAAGGGCTGCAGGGCAGTGGGAGTGCAGACTGAAAGATGCAGACCACTGGGCTTCTACTTCTGTTTCCATTTCTGATCCGGCCTGCATCTGCCTCCTTCCTGAACAGGCCAGAGAATTCATCTAAATAGCCTAAGCAGGCTGGGTGCTGTGGCTCACCTGTAATCCCAACACTTGGGAGGCCGAGGTGGGCAGATCACCTGAGGTCAGGAGTTCAAGGCTAGCCTAGCCAACATGACAAAACCCCATCTCTACTAAAAAAATACAAAAATTAGCCAGGCATAGTGGCGCCTATAGTTCCAGCTACTTGGGGGCTGAGGTAGGAAGATCGCTAGAGCCTGGGAGGTTAAGGCTGCGGTGAGCTGTGATTGTGCCACTGCACTCCAGCCTGGGTGACAGAGCAAGACCCTGCCTCAAAAATAAATAAATAAATAAATAAATAAAAATAAGAGTGCTTGGCAGCTTGATCAAGCTATGCCAGGAACCCATCTCTCAAGCAGCAGCTCTTCTCCTGTGCCATTGTCAGCTTTGTCCTGTCTGAGTCCATGGGACTCTTCTGTTTGATGGTGGTCTTCCTCATCCTCTTCATCATGTGAAGCTCCATGGAGATCACCTACCCATACCTGCTTCTGTGACCTCATGCCATTCCTGGTGTTGGAATGTGCCAAGGTTTGCCATTAAACACACATTTCTCATTTCATAATTTCATATATATTATATATATGTGTGTGTGTGTGTGTTTATATATGCGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTATATATATATATATATATATATATATATATATATATATATAAAATATATAGGAAGAGGCACCAGAGAGCTCTCTGCATAGTCACAGAGGAAAGGTCATGTGAGGACAGCCAGAAGGCAGATGTCACAAGCCTCACCAGCAACCTACCATACCCTGCTTGTACCTCCATCCTGGAAGTCCAGCTTCTAAAATTAGAAGAAAATAGTCGGGTGTGGTGGCTCGCACCTATAATCCCAGCACTTTGGGAGGCTGATGTGGGAGGATCATTTGAGGTCAAGAGTTTGAAACCAGCCTAGGCAACATAGGGAGACCCTGTCTTTAAAAAAAATTTTTTTTTGTTTTAATTAGCTGGGTGTGATGGTGCACACCTGAGTCCTAGCTACTTGGGAGGCTGAGGTAGGAGGATCCCCTGAGCCCAGGGAAGTGGAGGCTGCAGTGAGCCATGATCACACCACTGCAATACAGCCTGGGTGACAGAGCAAGACCTTATCTCAAAATAAACAAACAAACAAAAAAGATGACAAAATAAATGTCTGTCGTTTAAGTCACCCATTCTGTGATATCTTGTTACGGCAGCCTGAACTGACCAATACACTTCCTCACCCAGTTTAAATTCCATGCTCAATCATAATCAGCCATTGCAATTACCCTCAACTGTATTATCAACCCTCAATTTGTATTAGTTGCTTGGCAAAACCCAAACCCTTGTGAAATCCAGTTCTTCTATATCTACATCGATGCTGCCGAATATGGCTGAAGAAAAGCAACTGTGTTGACTGGACTGCTTTAAATTCATGACCACTTACCTCAAGTGGGCACTTAACTTCCTGGCAATTATTCTACATTTTTCTAGTCCATTAACTCTCCTCCTCTCTGAGTTAATTATTTCACAGCTTTTCCTCCCTCTTTATACATGTTCCATCCTAACTCTCTGCTGATGACCTTGTTTCTTATTTCACTAATGGAGGCCACCAGGAGAGAACTCCCACAGCCATCAAATTCACCAAGCCAACAGCATCCTTACACAAATCCTCTGCCTTCTCTCTGGGCTGGCTGTGCCCTCTCTTTGCTCCTGCAATTTCCCTAACTCTCCTATACTGTTGTTATTCACTCTCCAGTGGATAATCACCATCAGGATGCAAAGATGCTGTACTAGCTTCTGAACTCTCCAAAAACCCAGGAAACAAAAAGGCAAAGGCTAAGCTTTTTCTTATTCCCCCTTCCAGCTATTGTACTGTTTCTCTGCTTTTAATTTATTTTTATTTATTTATTTATTTATTTATTTATTTATTTTTGAGATGGAGCTTCACTCTTGTTGCCCAGGCTGGAGCGCAATGGCGCGATCTCAGCTCACCGCAACCTCTACTTCCCGAATTCAAGTGATTGTCCTGCCTCAGCCTCCCGAGTAGCCGGGATTACAGGCATGCGCCACCACGCCTGGCTAATTTTGTACTTTTAGTAGAGACGGGGTTTCTCCATGTTGCTCAGCCTGGTCACAAACTCCCGATCTCAGGTGATCTGCCTGCCTCGGCCTCCCAAAGTGCTGGGATTACAGGCGTGAGCCACCACGCCCCACCGTCTCTGTTCTCTTTTAAAGCACAATCCCTCAACACAAGTGTCTATACTCAGCGTCTCCACTTTCCCTCCATCTGGTCTTCCCAGTGCCCCCTTGTCAGGTTTTCACCCCATGCTCCTCCAGGGCTAGTCTGCTCTTGCTTCCCGTCTTACTGGAAGACCAGCAGCATTTGACAGAGTTGGTCACTCTCTCCTCCTTGGACACCTTTTCTTCACTTGGTTTCCAGAACAGCATTATCTCCTGCTTATTGTCTTCCTCAGTCTACCTCAGTGAAAAGCTTTACTGGTTCCTCCACATCTCCCAGACCTCCAGTAATAACAGGAATGTACCATGCCATTGCTCTCTCTCTCTCCTTTTTTTTTTTTTTTTTTTTTTTTTGTTGAGACAGAGTCTCAATTTTATCACCCAGACTGAAGCACAATGGCATGATCATAGCTCATTGCAGTCTCGAACTCGTGGGCTCAAGCAATCCTCCCACCTCAGCCTCCTGAATAGCTGGGACTACAAGCAACACCACCATGCCCAGCTAACTTTCTATTTTTTATTTTTATTTTTTGTAGAGATGAGGTTTTACTATGTTGCCTAGGCTAGTCTTGAACTCCTGGGCCCAAATGATCCTCCCACCTTGGTCTCCCAAAGTGCTGGGATTATAGGCGTGAGCCACCGTGTCCAACTTCTCTTTCTTAATGGAATTTAGGCAAAAGTTATTACTCATGGCCTTGGAATGCTCTTTCCTCAGATAGCCACATGGCTCACCATTACTTCCTTCCAGCTTTCTTCAAAGATCCACTTCTCAGTGAAGCTTTGTCCTGACCACCCAGCTGAAAATTGCAATCCTCTTCTGTCTACCATGTACATACTCTCTATTTGCTTTCCTTCCTTTATTTCTCTCTGTAGGTGTGACCTAACATAACATATAATTTACTTCTGTACCTTGTTTGCTTTCTGTCTTCCCCTTTAGAACATAAGCTCCATGAGGGAAGGCGTTTTTGCCTGCTTTAGTCACTTTATCTCCAGCAACTACAACTATATGTATATATACACACACATATATATACACACACATATATATACACACACATATATATATACATATATATATATAGTAGGCACTCAATAAACATTCACTGAATGAATGAACAGTAATGCTCACTTGCCCATAAATACAAGTACCTCATCTTTTACCACAAAGGGTATTTGTAAATATTTAGGTTGTTTCTACCCAGATTATGGCTTGGTAATTCTTTTTTTTTTTTTCTAATTTTTATTTTTTTTCTAGGGACAGGGTCTCACTATGTTGCCCAGGATGGTCTTGAACTCCTGGGCTCAAGCATTCTGCCTGCCTTGGCCTCCTAAAGTGCTGAGATTACAGGCATGAGCCACCGTGCCTGCCTTCATGTATGTTTTTAGAACACAGAGAAAATGTGTTCTAAATGTGCTCATTGCTCAGCAATGAGCAAAGGCTTATGCAGTCACCACCAATCAAAAACTTTTTTTTTTTTTTTTGAGACAAGATCTTGCTCTGTTGCCCAGGCTGGAGTGCAGTGGCAGGATCATAGCAAGCTGCAGTCTTGACCTCATAGGCCTAAATCATCCTCCCACCTCAGCCTCACAAGTAGCTAAGACCACAGGTACAAGCCACCGTATCTAGCTAACTTTCAAAATTTTTTGAATTTTTAAATTTAAAAATTTTGAGGCCAGGCTGGCCTCAAACTCCTGAGCTCAAGCAATCCTCCCACCTTGGCTTCCCAAAGTGCTGGGATTATAGGCGTGAGCAACTGTACCTGGCAAAAACTTTTTAAGAGCTTCGCTTCCAGGATTAGGCAACTTTAACCTTCAACAGTGATCATAACCCTTAGTTTTCAGATCCGATTAAGGGAAATGTGTAATGTCTTACTGACACACTAATCCCATCACTGCTCACACCACCCACAATTAGCTGAGAAGAGGTAATTAAAGCTAAATGAAGTTGTAAGAGTGGCCCTATCGCATAGGACTAGTGTCCCTATAAGAACACGAAGAAATCACCTTAGAAAGGCTGAGAAAGGGCTGCAGGGCAGTGGGAGTGCAGACTGAAAGATGCAGACCACTGGGCTTCTACTTCTGTTTCCATTTCTGATCCGGCCTGCATCTGCCTCCTTCCTGAACAGGCCAGAGAATTCATCTAAATAGCCTAAGCAGGCTGGGTGCTGTGGCTCACCTGTAATCCCAACACTTGGGAGGCCGAGGTGGGCAGATCACCTGAGGTCAGGAGTTCAAGGCTAGCCTAGCCAACATGACAAAACCCCATCTCTACTAAAAAAATACAAAAATTAGCCAGGCATAGTGGCGCCTATAGTTCCAGCTACTTGGGGGCTGAGGTAGGAAGATCGCTAGAGCCTGGGAGGTTAAGGCTGCGGTGAGCTGTGATTGTGCCACTGCACTCCAGCCTGGGTGACAGAGCAAGACCCTGCCTCAAAAATAAATAAATAAATAAATAAATAAAAATAAGAGTGCTTGGCAGCTTGATCAAGCTATGCCAGGAACCCATCTCTCAAGCAGCAGCTCTTCTCCTGTGCCATTGTCAGCTTTGTCCTGTCTGAGTCCATGGGACTCTTCTGTTTGATGGTGGTCTTCCTCATCCTCTTCATCATGTGAAGCTCCATGGAGATCACCTACCCATACCTGCTTCTGTGACCTCATGCCATTCCTGGTGTTGGAATGTGCCAAGGTTTGCCATTAAACACACATTTCTCATTTCATAATTTCATATATATTATATATATGTGTGTGTGTGTGTGTTTATATATGCGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTATATATATATATATATATATATATATATATATATATATATAAAATATATAGGAAGAGGCACCAGAGAGCTCTCTGCATAGTCACAGAGGAAAGGTCATGTGAGGACAGCCAGAAGGCAGATGTCACAAGCCTCACCAGCAACCTACCATACCCTGCTTGTACCTCCATCCTGGAAGTCCAGCTTCTAAAATTAGAAGAAAATAGTCGGGTGTGGTGGCTCGCACCTATAATCCCAGCACTTTGGGAGGCTGATGTGGGAGGATCATTTGAGGTCAAGAGTTTGAAACCAGCCTAGGCAACATAGGGAGACCCTGTCTTTAAAAAAAATTTTTTTTTGTTTTAATTAGCTGGGTGTGATGGTGCACACCTGAGTCCTAGCTACTTGGGAGGCTGAGGTAGGAGGATCCCCTGAGCCCAGGGAAGTGGAGGCTGCAGTGAGCCATGATCACACCACTGCAATACAGCCTGGGTGACAGAGCAAGACCTTATCTCAAAATAAACAAACAAACAAAAAAGATGACAAAATAAATGTCTGTCGTTTAAGTCACCCATTCTGTGATATCTTGTTACGGCAGCCTGAACTGACCAATACACTTCCTCACCCAGTTTAAATTCCATGCTCAATCATAATCAGCCATTGCAATTACCCTCAACTGTATTATCAACCCTCAATTTGTATTAGTTGCTTGGCAAAACCCAAACCCTTGTGAAATCCAGTTCTTCTATATCTACATCGATGCTGCCGAATATGGCTGAAGAAAAGCAACTGTGTTGACTGGACTGCTTTAAATTCATGACCACTTACCTCAAGTGGGCACTTAACTTCCTGGCAATTATTCTACATTTTTCTAGTCCATTAACTCTCCTCCTCTCTGAGTTAATTATTTCACAGCTTTTCCTCCCTCTTTATACATGTTCCATCCTAACTCTCTGCTGATGACCTTGTTTCTTATTTCACTAATGGAGGCCACCAGGAGAGAACTCCCACAGCCATCAAATTCACCAAGCCAACAGCATCCTTACACAAATCCTCTGCCTTCTCTCTGGGCTGGCTGTGCCCTCTCTTTGCTCCTGCAATTTCCCTAACTCTCCTATACTGTTGTTATTCACTCTCCAGTGGATAATCACCATCAGGATGCAAAGATGCTGTACTAGCTTCTGAACTCTCCAAAAACCCAGGAAACAAAAAGGCAAAGGCTAAGCTTTTTCTTATTCCCCCTTCCAGCTATTGTACTGTTTCTCTGCTTTTAATTTATTTTTATTTATTTATTTATTTATTTATTTATTTATTTTTGAGATGGAGCTTCACTCTTGTTGCCCAGGCTGGAGCGCAATGGCGCGATCTCAGCTCACCGCAACCTCTACTTCCCGAATTCAAGTGATTGTCCTGCCTCAGCCTCCCGAGTAGCCGGGATTACAGGCATGCGCCACCACGCCTGGCTAATTTTGTACTTTTAGTAGAGACGGGGTTTCTCCATGTTGCTCAGCCTGGTCACAAACTCCCGATCTCAGGTGATCTGCCTGCCTCGGCCTCCCAAAGTGCTGGGATTACAGGCGTGAGCCACCACGCCCCACCGTCTCTGTTCTCTTTTAAAGCACAATCCCTCAACACAAGTGTCTATACTCAGCGTCTCCACTTTCCCTCCATCTGGTCTTCCCAGTGCCCCCTTGTCAGGTTTTCACCCCATGCTCCTCCAGGGCTAGTCTGCTCTTGCTTCCCGTCTTACTGGAAGACCAGCAGCATTTGACAGAGTTGGTCACTCTCTCCTCCTTGGACACCTTTTCTTCACTTGGTTTCCAGAACAGCATTATCTCCTGCTTATTGTCTTCCTCAGTCTACCTCAGTGAAAAGCTTTACTGGTTCCTCCACATCTCCCAGACCTCCAGTAATAACAGGAATGTACCATGCCATTGCTCTCTCTCTCTCCTTTTTTTTTTTTTTTTTTTTTTTTTGTTGAGACAGAGTCTCAATTTTATCACCCAGACTGAAGCACAATGGCATGATCATAGCTCATTGCAGTCTCGAACTCGTGGGCTCAAGCAATCCTCCCACCTCAGCCTCCTGAATAGCTGGGACTACAAGCAACACCACCATGCCCAGCTAACTTTCTATTTTTTATTTTTATTTTTTGTAGAGATGAGGTTTTACTATGTTGCCTAGGCTAGTCTTGAACTCCTGGGCCCAAATGATCCTCCCACCTTGGTCTCCCAAAGTGCTGGGATTATAGGCGTGAGCCACCGTGTCCAACTTCTCTTTCTTAATGGAATTTAGGCAAAAGTTATTACTCATGGCCTTGGAATGCTCTTTCCTCAGATAGCCACATGGCTCACCATTACTTCCTTCCAGCTTTCTTCAAAGATCCACTTCTCAGTGAAGCTTTGTCCTGACCACCCAGCTGAAAATTGCAATCCTCTTCTGTCTACCATGTACATACTCTCTATTTGCTTTCCTTCCTTTATTTCTCTCTGTAGGTGTGACCTAACATAACATATAATTTACTTCTGTACCTTGTTTGCTTTCTGTCTTCCCCTTTAGAACATAAGCTCCATGAGGGAAGGCGTTTTTGCCTGCTTTAGTCACTTTATCTCCAGCAACTACAACTATATGTATATATACACACACATATATATACACACACATATATATACACACACATATATATATACATATATATATATAGTAGGCACTCAATAAACATTCACTGAATGAATGAACAGTAATGCTCACTTGCCCATAAATACAAGTACCTCATCTTTTACCACAAAGGGTATTTGTAAATATTTAGGTTGTTTCTACCCAGATTATGGCTTGGTAATTCTTTTTTTTTTTTTCTAATTTTTATTTTTTTTCTAGGGACAGGGTCTCACTATGTTGCCCAGGATGGTCTTGAACTCCTGGGCTCAAGCATTCTGCCTGCCTTGGCCTCCTAAAGTGCTGAGATTACAGGCATGAGCCACCGTGCCTGCCTTCATGTATGTTTTTAGAACACAGAGAAAATGTGTTCTAAATGTGCTCATTGCTCAGCAATGAGCAAAGGCTTATGCAGTCACCACCAATCAAAAACTTTTTTTTTTTTTTTTGAGACAAGATCTTGCTCTGTTGCCCAGGCTGGAGTGCAGTGGCAGGATCATAGCAAGCTGCAGTCTTGACCTCATAGGCCTAAATCATCCTCCCACCTCAGCCTCACAAGTAGCTAAGACCACAGGTACAAGCCACCGTATCTAGCTAACTTTCAAAATTTTTTGAATTTTTAAATTTAAAAATTTTGAGGCCAGGCTGGCCTCAAACTCCTGAGCTCAAGCAATCCTCCCACCTTGGCTTCCCAAAGTGCTGGGATTATAGGCGTGAGCAACTGTACCTGGCAAAAACTTTTTAAGAGCTTCGCTTCCAGGATTAGGCAACTTTAACCTTCAACAGTGATCATAACCCTTAGTTTTCAGATCCGATTAAGGGAAATGTGTAATGTCTTACTGACACACTAATCCCATCACTGCTCACACCACCCACAATTAGCTGAG
AAGAGGTAATTAAAGCTAAATGAAGTTGTAAGAGTGGCCCTATCGCATAGGACTAGTGTCCCTATAAGAACACGAAGAAATCACCTTAGAAAGGCTGAGAAAGGGCTGCAGGGCAGTGGGAGTGCAGACTGAAAGATGCAGACCACTGGGCTTCTACTTCTGTTTCCATTTCTGATCCGGCCTGCATCTGCCTCCTTCCTGAACAGGCCAGAGAATTCATCTAAATAGCCTAAGCAGGCTGGGTGCTGTGGCTCACCTGTAATCCCAACACTTGGGAGGCCGAGGTGGGCAGATCACCTGAGGTCAGGAGTTCAAGGCTAGCCTAGCCAACATGACAAAACCCCATCTCTACTAAAAAAATACAAAAATTAGCCAGGCATAGTGGCGCCTATAGTTCCAGCTACTTGGGGGCTGAGGTAGGAAGATCGCTAGAGCCTGGGAGGTTAAGGCTGCGGTGAGCTGTGATTGTGCCACTGCACTCCAGCCTGGGTGACAGAGCAAGACCCTGCCTCAAAAATAAATAAATAAATAAATAAATAAAAATAAGAGTGCTTGGCAGCTTGATCAAGCTATGCCAGGAACCCATCTCTCAAGCAGCAGCTCTTCTCCTGTGCCATTGTCAGCTTTGTCCTGTCTGAGTCCATGGGACTCTTCTGTTTGATGGTGGTCTTCCTCATCCTCTTCATCATGTGAAGCTCCATGGAGATCACCTACCCATACCTGCTTCTGTGACCTCATGCCATTCCTGGTGTTGGAATGTGCCAAGGTTTGCCATTAAACACACATTTCTCATTTCATAATTTCATATATATTATATATATGTGTGTGTGTGTGTGTTTATATATGCGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTATATATATATATATATATATATATATATATATATATATATAAAATATATAGGAAGAGGCACCAGAGAGCTCTCTGCATAGTCACAGAGGAAAGGTCATGTGAGGACAGCCAGAAGGCAGATGTCACAAGCCTCACCAGCAACCTACCATACCCTGCTTGTACCTCCATCCTGGAAGTCCAGCTTCTAAAATTAGAAGAAAATAGTCGGGTGTGGTGGCTCGCACCTATAATCCCAGCACTTTGGGAGGCTGATGTGGGAGGATCATTTGAGGTCAAGAGTTTGAAACCAGCCTAGGCAACATAGGGAGACCCTGTCTTTAAAAAAAATTTTTTTTTGTTTTAATTAGCTGGGTGTGATGGTGCACACCTGAGTCCTAGCTACTTGGGAGGCTGAGGTAGGAGGATCCCCTGAGCCCAGGGAAGTGGAGGCTGCAGTGAGCCATGATCACACCACTGCAATACAGCCTGGGTGACAGAGCAAGACCTTATCTCAAAATAAACAAACAAACAAAAAAGATGACAAAATAAATGTCTGTCGTTTAAGTCACCCATTCTGTGATATCTTGTTACGGCAGCCTGAACTGACCAATACACTTCCTCACCCAGTTTAAATTCCATGCTCAATCATAATCAGCCATTGCAATTACCCTCAACTGTATTATCAACCCTCAATTTGTATTAGTTGCTTGGCAAAACCCAAACCCTTGTGAAATCCAGTTCTTCTATATCTACATCGATGCTGCCGAATATGGCTGAAGAAAAGCAACTGTGTTGACTGGACTGCTTTAAATTCATGACCACTTACCTCAAGTGGGCACTTAACTTCCTGGCAATTATTCTACATTTTTCTAGTCCATTAACTCTCCTCCTCTCTGAGTTAATTATTTCACAGCTTTTCCTCCCTCTTTATACATGTTCCATCCTAACTCTCTGCTGATGACCTTGTTTCTTATTTCACTAATGGAGGCCACCAGGAGAGAACTCCCACAGCCATCAAATTCACCAAGCCAACAGCATCCTTACACAAATCCTCTGCCTTCTCTCTGGGCTGGCTGTGCCCTCTCTTTGCTCCTGCAATTTCCCTAACTCTCCTATACTGTTGTTATTCACTCTCCAGTGGATAATCACCATCAGGATGCAAAGATGCTGTACTAGCTTCTGAACTCTCCAAAAACCCAGGAAACAAAAAGGCAAAGGCTAAGCTTTTTCTTATTCCCCCTTCCAGCTATTGTACTGTTTCTCTGCTTTTAATTTATTTTTATTTATTTATTTATTTATTTATTTATTTATTTTTGAGATGGAGCTTCACTCTTGTTGCCCAGGCTGGAGCGCAATGGCGCGATCTCAGCTCACCGCAACCTCTACTTCCCGAATTCAAGTGATTGTCCTGCCTCAGCCTCCCGAGTAGCCGGGATTACAGGCATGCGCCACCACGCCTGGCTAATTTTGTACTTTTAGTAGAGACGGGGTTTCTCCATGTTGCTCAGCCTGGTCACAAACTCCCGATCTCAGGTGATCTGCCTGCCTCGGCCTCCCAAAGTGCTGGGATTACAGGCGTGAGCCACCACGCCCCACCGTCTCTGTTCTCTTTTAAAGCACAATCCCTCAACACAAGTGTCTATACTCAGCGTCTCCACTTTCCCTCCATCTGGTCTTCCCAGTGCCCCCTTGTCAGGTTTTCACCCCATGCTCCTCCAGGGCTAGTCTGCTCTTGCTTCCCGTCTTACTGGAAGACCAGCAGCATTTGACAGAGTTGGTCACTCTCTCCTCCTTGGACACCTTTTCTTCACTTGGTTTCCAGAACAGCATTATCTCCTGCTTATTGTCTTCCTCAGTCTACCTCAGTGAAAAGCTTTACTGGTTCCTCCACATCTCCCAGACCTCCAGTAATAACAGGAATGTACCATGCCATTGCTCTCTCTCTCTCCTTTTTTTTTTTTTTTTTTTTTTTTTGTTGAGACAGAGTCTCAATTTTATCACCCAGACTGAAGCACAATGGCATGATCATAGCTCATTGCAGTCTCGAACTCGTGGGCTCAAGCAATCCTCCCACCTCAGCCTCCTGAATAGCTGGGACTACAAGCAACACCACCATGCCCAGCTAACTTTCTATTTTTTATTTTTATTTTTTGTAGAGATGAGGTTTTACTATGTTGCCTAGGCTAGTCTTGAACTCCTGGGCCCAAATGATCCTCCCACCTTGGTCTCCCAAAGTGCTGGGATTATAGGCGTGAGCCACCGTGTCCAACTTCTCTTTCTTAATGGAATTTAGGCAAAAGTTATTACTCATGGCCTTGGAATGCTCTTTCCTCAGATAGCCACATGGCTCACCATTACTTCCTTCCAGCTTTCTTCAAAGATCCACTTCTCAGTGAAGCTTTGTCCTGACCACCCAGCTGAAAATTGCAATCCTCTTCTGTCTACCATGTACATACTCTCTATTTGCTTTCCTTCCTTTATTTCTCTCTGTAGGTGTGACCTAACATAACATATAATTTACTTCTGTACCTTGTTTGCTTTCTGTCTTCCCCTTTAGAACATAAGCTCCATGAGGGAAGGCGTTTTTGCCTGCTTTAGTCACTTTATCTCCAGCAACTACAACTATATGTATATATACACACACATATATATACACACACATATATATACACACACATATATATATACATATATATATATAGTAGGCACTCAATAAACATTCACTGAATGAATGAACAGTAATGCTCACTTGCCCATAAATACAAGTACCTCATCTTTTACCACAAAGGGTATTTGTAAATATTTAGGTTGTTTCTACCCAGATTATGGCTTGGTAATTCTTTTTTTTTTTTTCTAATTTTTATTTTTTTTCTAGGGACAGGGTCTCACTATGTTGCCCAGGATGGTCTTGAACTCCTGGGCTCAAGCATTCTGCCTGCCTTGGCCTCCTAAAGTGCTGAGATTACAGGCATGAGCCACCGTGCCTGCCTTCATGTATGTTTTTAGAACACAGAGAAAATGTGTTCTAAATGTGCTCATTGCTCAGCAATGAGCAAAGGCTTATGCAGTCACCACCAATCAAAAACTTTTTTTTTTTTTTTTGAGACAAGATCTTGCTCTGTTGCCCAGGCTGGAGTGCAGTGGCAGGATCATAGCAAGCTGCAGTCTTGACCTCATAGGCCTAAATCATCCTCCCACCTCAGCCTCACAAGTAGCTAAGACCACAGGTACAAGCCACCGTATCTAGCTAACTTTCAAAATTTTTTGAATTTTTAAATTTAAAAATTTTGAGGCCAGGCTGGCCTCAAACTCCTGAGCTCAAGCAATCCTCCCACCTTGGCTTCCCAAAGTGCTGGGATTATAGGCGTGAGCAACTGTACCTGGCAAAAACTTTTTAAGAGCTTCGCTTCCAGGATTAGGCAACTTTAACCTTCAACAGTGATCATAACCCTTAGTTTTCAGATCCGATTAAGGGAAATGTGTAATGTCTTACTGACACACTAATCCCATCACTGCTCACACCACCCACAATTAGCTGAGAAGAGGTAATTAAAGCTAAATGAAGTTGTAAGAGTGGCCCTATCGCATAGGACTAGTGTCCCTATAAGAACACGAAGAAATCACCTTAGAAAGGCTGAGAAAGGGCTGCAGGGCAGTGGGAGTGCAGACTGAAAGATGCAGACCACTGGGCTTCTACTTCTGTTTCCATTTCTGATCCGGCCTGCATCTGCCTCCTTCCTGAACAGGCCAGAGAATTCATCTAAATAGCCTAAGCAGGCTGGGTGCTGTGGCTCACCTGTAATCCCAACACTTGGGAGGCCGAGGTGGGCAGATCACCTGAGGTCAGGAGTTCAAGGCTAGCCTAGCCAACATGACAAAACCCCATCTCTACTAAAAAAATACAAAAATTAGCCAGGCATAGTGGCGCCTATAGTTCCAGCTACTTGGGGGCTGAGGTAGGAAGATCGCTAGAGCCTGGGAGGTTAAGGCTGCGGTGAGCTGTGATTGTGCCACTGCACTCCAGCCTGGGTGACAGAGCAAGACCCTGCCTCAAAAATAAATAAATAAATAAATAAATAAAAATAAGAGTGCTTGGCAGCTTGATCAAGCTATGCCAGGAACCCATCTCTCAAGCAGCAGCTCTTCTCCTGTGCCATTGTCAGCTTTGTCCTGTCTGAGTCCATGGGACTCTTCTGTTTGATGGTGGTCTTCCTCATCCTCTTCATCATGTGAAGCTCCATGGAGATCACCTACCCATACCTGCTTCTGTGACCTCATGCCATTCCTGGTGTTGGAATGTGCCAAGGTTTGCCATTAAACACACATTTCTCATTTCATAATTTCATATATATTATATATATGTGTGTGTGTGTGTGTTTATATATGCGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTATATATATATATATATATATATATATATATATATATATATAAAATATATAGGAAGAGGCACCAGAGAGCTCTCTGCATAGTCACAGAGGAAAGGTCATGTGAGGACAGCCAGAAGGCAGATGTCACAAGCCTCACCAGCAACCTACCATACCCTGCTTGTACCTCCATCCTGGAAGTCCAGCTTCTAAAATTAGAAGAAAATAGTCGGGTGTGGTGGCTCGCACCTATAATCCCAGCACTTTGGGAGGCTGATGTGGGAGGATCATTTGAGGTCAAGAGTTTGAAACCAGCCTAGGCAACATAGGGAGACCCTGTCTTTAAAAAAAATTTTTTTTTGTTTTAATTAGCTGGGTGTGATGGTGCACACCTGAGTCCTAGCTACTTGGGAGGCTGAGGTAGGAGGATCCCCTGAGCCCAGGGAAGTGGAGGCTGCAGTGAGCCATGATCACACCACTGCAATACAGCCTGGGTGACAGAGCAAGACCTTATCTCAAAATAAACAAACAAACAAAAAAGATGACAAAATAAATGTCTGTCGTTTAAGTCACCCATTCTGTGATATCTTGTTACGGCAGCCTGAACTGACCAATACACTTCCTCACCCAGTTTAAATTCCATGCTCAATCATAATCAGCCATTGCAATTACCCTCAACTGTATTATCAACCCTCAATTTGTATTAGTTGCTTGGCAAAACCCAAACCCTTGTGAAATCCAGTTCTTCTATATCTACATCGATGCTGCCGAATATGGCTGAAGAAAAGCAACTGTGTTGACTGGACTGCTTTAAATTCATGACCACTTACCTCAAGTGGGCACTTAACTTCCTGGCAATTATTCTACATTTTTCTAGTCCATTAACTCTCCTCCTCTCTGAGTTAATTATTTCACAGCTTTTCCTCCCTCTTTATACATGTTCCATCCTAACTCTCTGCTGATGACCTTGTTTCTTATTTCACTAATGGAGGCCACCAGGAGAGAACTCCCACAGCCATCAAATTCACCAAGCCAACAGCATCCTTACACAAATCCTCTGCCTTCTCTCTGGGCTGGCTGTGCCCTCTCTTTGCTCCTGCAATTTCCCTAACTCTCCTATACTGTTGTTATTCACTCTCCAGTGGATAATCACCATCAGGATGCAAAGATGCTGTACTAGCTTCTGAACTCTCCAAAAACCCAGGAAACAAAAAGGCAAAGGCTAAGCTTTTTCTTATTCCCCCTTCCAGCTATTGTACTGTTTCTCTGCTTTTAATTTATTTTTATTTATTTATTTATTTATTTATTTATTTATTTTTGAGATGGAGCTTCACTCTTGTTGCCCAGGCTGGAGCGCAATGGCGCGATCTCAGCTCACCGCAACCTCTACTTCCCGAATTCAAGTGATTGTCCTGCCTCAGCCTCCCGAGTAGCCGGGATTACAGGCATGCGCCACCACGCCTGGCTAATTTTGTACTTTTAGTAGAGACGGGGTTTCTCCATGTTGCTCAGCCTGGTCACAAACTCCCGATCTCAGGTGATCTGCCTGCCTCGGCCTCCCAAAGTGCTGGGATTACAGGCGTGAGCCACCACGCCCCACCGTCTCTGTTCTCTTTTAAAGCACAATCCCTCAACACAAGTGTCTATACTCAGCGTCTCCACTTTCCCTCCATCTGGTCTTCCCAGTGCCCCCTTGTCAGGTTTTCACCCCATGCTCCTCCAGGGCTAGTCTGCTCTTGCTTCCCGTCTTACTGGAAGACCAGCAGCATTTGACAGAGTTGGTCACTCTCTCCTCCTTGGACACCTTTTCTTCACTTGGTTTCCAGAACAGCATTATCTCCTGCTTATTGTCTTCCTCAGTCTACCTCAGTGAAAAGCTTTACTGGTTCCTCCACATCTCCCAGACCTCCAGTAATAACAGGAATGTACCATGCCATTGCTCTCTCTCTCTCCTTTTTTTTTTTTTTTTTTTTTTTTTGTTGAGACAGAGTCTCAATTTTATCACCCAGACTGAAGCACAATGGCATGATCATAGCTCATTGCAGTCTCGAACTCGTGGGCTCAAGCAATCCTCCCACCTCAGCCTCCTGAATAGCTGGGACTACAAGCAACACCACCATGCCCAGCTAACTTTCTATTTTTTATTTTTATTTTTTGTAGAGATGAGGTTTTACTATGTTGCCTAGGCTAGTCTTGAACTCCTGGGCCCAAATGATCCTCCCACCTTGGTCTCCCAAAGTGCTGGGATTATAGGCGTGAGCCACCGTGTCCAACTTCTCTTTCTTAATGGAATTTAGGCAAAAGTTATTACTCATGGCCTTGGAATGCTCTTTCCTCAGATAGCCACATGGCTCACCATTACTTCCTTCCAGCTTTCTTCAAAGATCCACTTCTCAGTGAAGCTTTGTCCTGACCACCCAGCTGAAAATTGCAATCCTCTTCTGTCTACCATGTACATACTCTCTATTTGCTTTCCTTCCTTTATTTCTCTCTGTAGGTGTGACCTAACATAACATATAATTTACTTCTGTACCTTGTTTGCTTTCTGTCTTCCCCTTTAGAACATAAGCTCCATGAGGGAAGGCGTTTTTGCCTGCTTTAGTCACTTTATCTCCAGCAACTACAACTATATGTATATATACACACACATATATATACACACACATATATATACACACACATATATATATACATATATATATATAGTAGGCACTCAATAAACATTCACTGAATGAATGAACAGTAATGCTCACTTGCCCATAAATACAAGTACCTCATCTTTTACCACAAAGGGTATTTGTAAATATTTAGGTTGTTTCTACCCAGATTATGGCTTGGTAATTCTTTTTTTTTTTTTCTAATTTTTATTTTTTTTCTAGGGACAGGGTCTCACTATGTTGCCCAGGATGGTCTTGAACTCCTGGGCTCAAGCATTCTGCCTGCCTTGGCCTCCTAAAGTGCTGAGATTACAGGCATGAGCCACCGTGCCTGCCTTCATGTATGTTTTTAGAACACAGAGAAAATGTGTTCTAAATGTGCTCATTGCTCAGCAATGAGCAAAGGCTTATGCAGTCACCACCAATCAAAAACTTTTTTTTTTTTTTTTGAGACAAGATCTTGCTCTGTTGCCCAGGCTGGAGTGCAGTGGCAGGATCATAGCAAGCTGCAGTCTTGACCTCATAGGCCTAAATCATCCTCCCACCTCAGCCTCACAAGTAGCTAAGACCACAGGTACAAGCCACCGTATCTAGCTAACTTTCAAAATTTTTTGAATTTTTAAATTTAAAAATTTTGAGGCCAGGCTGGCCTCAAACTCCTGAGCTCAAGCAATCCTCCCACCTTGGCTTCCCAAAGTGCTGGGATTATAGGCGTGAGCAACTGTACCTGGCAAAAACTTTTTAAGAGCTTCGCTTCCAGGATTAGGCAACTTTAACCTTCAACAGTGATCATAACCCTTAGTTTTCAGATCCGATTAAGGGAAATGTGTAATGTCTTACTGACACACTAATCCCATCACTGCTCACACCACCCACAATTAGCTGAG
Gene Features • Codon frequency/bias • Organism dependent • Hexamer statistics • Transcriptional • Promoters/enhancers • Exon/introns • Length distributions • ORFs • Splicing • Donor/acceptor sites • Branchpoints • Translational • Ribosome binding sites
Codon Bias • Gene Finders are often organism specific • Coding regions often modelled by 5th order Markov chain (hexamers/di-codons)
Gene Finding Challenges • Need the correct reading frame • Introns can interrupt an exon in mid-codon • There is no hard and fast rule for identifying donor and acceptor splice sites • Signals are very weak
Overpredicting Genes • Easy to predict all exons • Report all sequences flanked by ..AG and GT.. as exons • Sensitivity = 100% • Specificity ~ 0%
Sensor-based methods • Similarity searches misses some/many genes • cDNA/EST libraries are not perfect • Ab initio Gene Finders • HMM-based • GenScan • HMMgene • Neural network-based • GRAIL • NetGene2 (splice sites)
Gene Prediction • ”Isolated” methods • Predict individual features • E.g. splice sites, coding regions • NetGene (Neural network) • http://www.cbs.dtu.dk/services/NetGene2/ • ”Integrated” methods • Predict genes in context • ”Grammar” of genes • Certain elements in specific order are required • HMMgene http://www.cbs.dtu.dk/services/HMMgene/ • GenScan (HMM-based) http://genes.mit.edu/GENSCAN.html
Gene Grammar Isolated features HAPPYEUGENEAWASGUYFINDER
Gene Grammar Isolated features HAPPYEUGENEAWASGUYFINDER Intron 3’UTR Exon Promoter Exon RBS
Gene Grammar Integrated features EUGENEFINDERWASAHAPPYGUY HAPPYEUGENEAWASGUYFINDER
Gene Grammar Integrated features EUGENEFINDERWASAHAPPYGUY PromRBSExonIntronExon3’UTR
Gene Grammar ”Isolated” methods (e.g.NN): HAPPYEUGENEAWASGUYFINDER ”Integrated” methods (e.g.HMM): EUGENEFINDERWASAHAPPYGUY
HMMs for genefinding • GenScan principle • E=exon • I=intron • F=5’ UTR • T=3’ UTR • P=promoter • N=intergenic
HMMgene http://www.cbs.dtu.dk/services/HMMgene/ • Columns • Sequence identifier • Program name • Prediction (see table below for the meaning). • Beginning • End • Score between 0 and 1 • Strand: $+$ for direct and $-$ for complementary • Frame (for exons it is the position of the donor in the frame) • Group to which prediction belong. If several CDS's are found they will be called cds_1, cds_2, etc. `bestparse:' is there because alternative predictions will also be available (see below). NameMeaning firstex The coding part of the first coding exon starting with the first base of the start codon. exon_N The N'th predicted internal coding exon. lastex The coding part of the last coding exon ending with the last base of the stop codon. singleex The coding part of an exon in a gene with only one coding exon. CDS Coding region composed of the exon predictions prior to this line.
Defining the term ’exon’ • Gene Prediction programs often use • Exon = CDS (coding sequence) • Real exons may contain 5’ or 3’ UTRs (untranslated regions)
NIX – Visualizing Gene Predictions http://www.hgmp.mrc.ac.uk/NIX/
Repeatmasker • Repetitive sequences in human/eukaryotic genomes are a problem • Run gene predictions on large genomic regions before and after masking of repetitive sequence: • http://ftp.genome.washington.edu/cgi-bin/RepeatMasker • Up to 45% of human genomic sequence derived from transposable/repetitive elements
Future Challenges • Bootstrapping: prediction improves as more genes become known • ’Extreme’ genes (long/short) still difficult • Initial and terminal exons are predicted with lower confidence • Combine with Sequence Similarity Matches • Non-coding RNAs • Most gene prediction programs only predict protein-coding genes • tRNA and rRNA genes are not predicted • Prokaryotic gene finding • Much easier (no introns), but still not perfect • Especially short genes (<300 bp) difficult
Gene Prediction • Take home messages • Human genome sequence is known • Number of human genes is unknown! • Before 2001: est.30,000-140,000 • Anno 2003: 30,000-40,000 • Location, structure and function of many human genes is unknown! • Genes may be discovered by different means and methods • ...
Gene Prediction • Take home messages • Genes may be predicted by computer programs • Masking of repetitive sequences may be required for large genomic sequences • ’Unusual’ genes are difficult (high GC%, short or terminal exons) • HMM-based gene prediction programs are suitable for “Gene Grammar” • Prediction methods are not perfect!
Gene Prediction Exercises I. Gene Finding in Prokaryotic Sequence II. Gene Finding in Eukaryotic Sequence Exercises at: http://www.cbs.dtu.dk/phdcourse/programme.html http://www.cbs.dtu.dk/phdcourse/cookbooks/genefinding/pro.html http://www.cbs.dtu.dk/phdcourse/cookbooks/genefinding/euk.html
Gene Prediction Exercise http://www.cbs.dtu.dk/dtucourse/cookbooks/nikob/exercises/gf_exercise_solution.html