E N D
TCCTGGCCTACATGTTCTTTGGCAAAGGATCTTCAAAATCAACGGCTCCCGGTGCGGCGATCATCCATTTCTTCGGAGGGATTCACGAGATTTACTTCCCGTACATTCTGATGAAACCTGGCCCTGATTCTCGCAGCCATTGCCGGCGGAGCAAGCGGACTCTTAACATTACGATCTTTAATGCCGGACTTGTCGCGGCAGCGTCACCGGGAAGCATTATCGCATTGATGGCAATGACGCCAAGAGGAGGCTATTTCGGCGTATTGGCGGGTGTATTGGTCGCTGCAGCTGTATCGTTCATCGTTTCAGCAGTGATCCTGAAATCCTCTAAAGCTAGTGAAGAAGACCTGGCTGCCGCAACAGAAAAAATGCAGTCCATGAAGGGGAAGAAAAGCCAAGCAGCAGCTGCTTTAGAGGCGGAACAAGCCAAAGCAGAGAAGCGTCTGAGCTGTCTCCTGAAAGCGCGAACAAAATTATCTTTTCGTGTGATCCGGGATGGGATCAAGTGCCATGGGGGCATCCATCTTAAGAAACAAAGTGAAAAAGCGGAGCTTGACATCAGTGTGACCAACACGGCCATTAACAATCTGCCAAGCGATGCGGATATTGTCATCACCCACAAAGATTTAACAGACCGCGCGAAAGCAAAGCTGCCGAACGCGACGCACATATCAGTGGATAACTTCTTAAACAGCCCGAAATACGACGAGCTGATTGAAAAGCTGAAAAGTAATCTTATAGAAAGAGAGTATTGTCATGCAAGTACTCGCAAAGGAAACATTAAACTCAATCAAACGGTATCATCAAAAGAAGAGGCTATCAAATTGGCAGGCCAGACGCTGATTGACAACGGCTACGTGACAGAGGATTACATTAGCAAAATGTTTGACCGTGAAGAAACGTCTTCTACGTTTATGGGGAATTTCATTGCCATTCCACACGGCACAGAAGAAGCGAAAAGCGAGGTGCTTCACTCAGGAATTTCAATCATACAGATTCCAGAGGGCGTTGAGTACGGAGAAGGCAACACGGCAAAAGTGGTATTCGGCATTGCGGGTAAAAATAATGAGCATTTAGACATTTTGTCTAACATCGCCATTATCTGTTCAGAAGAAGAAACATTGAACGCCTGATCTCCGCTAAAGCGAAGAAGATTTGATCGCCATTTCAACGAGGTGAACTGACATGATCGCCTTACATTTCGGTGCGGGAAATATCGGGAGAGGATTTATCGGCGCGCTGCTTCACCACTCCGGCTATGATGTGGTGTTTGCGGATGTGAACGAAACGATGGTCAGCCTCCTCAATGAAAAAAAAGAATACACAGTGGAACTGGCGGAAGAGGGACGTTCATCGGAGATCATTGGCCCGGTGAGCGCTATTAACAGCGGCAGTCAGACCGAGGAGCTGTACCGGCTGATGAATGAGGCGGCGCTCATCACAACAGCTGTCGGCCCGAATGTCCTGAAGCTGATTGCCCCGTCTATCGCAGAAGGTTTAAGACGAAGAAATACTGCAAACACACTGAATATCATTGCCTGCGAAAATATGATTGGCGGAAGCAGCTTCTTAAAGAAAGAAATATACAGCCATTTAACGGAAGCAGAGCAGAAATCCGTCAGTGAAACGTTAGGTTTTCCGAATTCTGCCGTTGACCGGATCGTCCCGATTCAGCATCATGAAGACCCGCTGAAAGTATCGGTTGAACCATTTTTCGAATGGGTCATTGATGAATCAGGCTTTAAAGGGAAAACACCAGTCATAAACGGCGCACTGTTTGTTGATGATTTAACGCCGTACATCGAACGGAAGCTGTTTACGGTCAATACCGGACACGCGGTCACAGCGTATGTCGGCTATCAGCGCGGACTCAAAACGGTCAAAGAAGCAATTGATCATCCGGAAATCCGCCGTGTTGTTCATTCGGCGCTGCTTGAAACTGGTGACTATCTCGTCAAATCGTATGGCTTTAAGCAAACTGAACACGAACAATATATTAAAAATCAGCGGTCGCTTTTAAAATCCTTTCATTTCGGACGATGTGACCCGCGTAGCGAGGTCACCTCTCAGAAAACTGGGAGAAAATGTAGACTTGTAGGCCCGGCAAAGAAAATAAAAGAACCGAATGCACTGGCTGAAGGAATTGCCGCAGCACTGCGCTTCGATTTCACCGGTGACCCTGAAGCGGTTGAACTGCAAGCGCTGATCGAAGAAAAGGATACAGCGGCGTACTTCAAGAGGTGTGCGGCATTCAGTCCCATGAACCGTTGCACGCCATCATTTTAAAGAAACTTAATCAATAACCGACCACCCGTGACACAATGTCACGGGCTTTTTACTATCTCGCAATCTAGTATAATAGAAAGCGCTTACGATAACAGGGGAAGGAGAATGACGATGAAACAATTTGAGATTGCGGCAATACCGGGAGACGGAGTAGGAAAGAGGTTGTAGCGGCTGCTGAGAAAGTGCTTCATACAGCGGCTGAGGTACACGGAGGTTTGTCATTCTCATTCACAGCTTTTCCATGGAGCTGTGATTATTACTTGGAGCACGGCAAAAATGATGCCCGAAGATGGAATACATACGCTTACTCAATTTGAAGCAGTTTTTGGGAGCTGTCGGAAATCCGAAGCTGGTTCCCGATCATATATCGTTATGGGGCTGCTGCTGAAATCCGGAGGGAGCTTGAGCTTTCCATTAATATGAGACCCGCCAAACAAATGGCAGGCATTACGTCGCCGCTTCTGCATCCAAATGATTTTTGACTTCGTGGTGATTCGCGAGAACAGTGAAGGTGAATACAGTGAAGTTGTCGGGCGCATTCACAGAGGCGATGATGAAATCGCCATCCAGAATGCCGTGTTTACGAGAAAAGCGACAGAACGTGTCATGCGCTTTGCCTTCGAATTGGCGAAAAAACGGCGCACACTCGTGACAAGCGCCACAAAGTCTAACGGCATTTATCACGCGATGCCGTTTTGGGATGAAGTCTTTCAGCAGACAGCCGCTGATTATAGCGGAATCGAGACATCATCTCAGCATATTGATGCGCTGGCCGCTTTTTTTGTGACGCGTCCGGAAACGTTTGATGTCATTGTGGCGAGCAAATTGTTCGGTGATATTTTAACCGACATCAGCTCAAGCCTGATGGAAAGCATCGGCATTGCGCCTCCCGACATCAATCCATCCGGCAAATATCCGTCCATGTTTGAACCGGTTCACGGCTCAGCTCCTGACATTGCCGGACAGGCCTTGCCAATCCGATCGGCCAGATTTGGACAGCGAAGCTGATGCTCGACCACTTCGGAGAGGAAGAATTGGGGGCGAAAATTCTGGATGTAATGGAGCAAGTGACTGCCGACGGCATCAAAACACGCGACATTGGGGGACAAAGCACAACGGCTGAGGTCACTGATGAAATCTGTTCGCGCTTAAGAAAGCTCTGATGAATCAGGCCGGTGGCAGATGGCTGCCCCGGTCTGTCCATTTCCTTACGAAAATTTCCACGAAAGTCTAACCAAGCAGATCCAAATGCTGTATAATAATTTGGAATTCTTAGGAAAGCATCGGGTGAAGGAAGTTGAATGCAAAAACAATCACGTTAAAGAAAAAAAGAAAAATCAAAACGATCGTTGTACTCAGTATCATTATGATCGCAGCTCTCATTTTTACGATCAGATTGGTGTTTTACAAGCCTTTTCTTATTGAAGGATCATCAATGGCCCCAACGCTTAAAGACTCAGAAAGAATTCTGGTTGATAAAGCAGTCAAATGGACTGGCGGGTTTCACAGAGGAGACATCATAGTCATTCATGACAAAAAGAGCGGCCGCTCATTTGTCAAACGTTTAATCGGTTTGCCTGGTGACAGCATTAAAATGAAAAATGATCAGCTATACATAAATGATAAAAAGGTGGAAGAACCATACTTAAAGGAATATAAACAGGAGGTCAAAGAGTCGGGTGTAACCTTAACAGGTGACTTCGAAGTTGAGGTTCCTTCCGGTAAATATTTTGTGATGGGAGATAACCCTGATATAAGTGGAGCAATTAAACAAAATGGCGCCAAAGGATGTACGCGCCCTGATACGAGAGGGGAAAATAAACGGGCCGACCGCAGGCATGTCCGGCGGCTACGCCCAAGCGAATCTTGTGGTTTTGAAAAAGGACCTTGCGTTTGATTTTCTGCTGTTTTGCCAGCGAAATCAAAAGCCCTGCCCCGTGCTGGATGTGACTGAAGCAGGTTCGCCTGTGCCGTCTCTGCTGCGCCGGATGCTGATATCCAGAACGGACTTTCCGAAATACCGTATTTACAGGCACGGTATCCTAACGGAAGAAGTATCTGATATTACGCCATACTTCCTGGCCTACATGTTCTTTGGCAAAGGATCTTCAAAATCAACGGCTCCCGGTGCGGCGATCATCCATTTCTTCGGAGGGATTCACGAGATTTACTTCCCGTACATTCTGATGAAACCTGGCCCTGATTCTCGCAGCCATTGCCGGCGGAGCAAGCGGACTCTTAACATTACGATCTTTAATGCCGGACTTGTCGCGGCAGCGTCACCGGGAAGCATTATCGCATTGATGGCAATGACGCCAAGAGGAGGCTATTTCGGCGTATTGGCGGGTGTATTGGTCGCTGCAGCTGTATCGTTCATCGTTTCAGCAGTGATCCTGAAATCCTCTAAAGCTAGTGAAGAAGACCTGGCTGCCGCAACAGAAAAAATGCAGTCCATGAAGGGGAAGAAAAGCCAAGCAGCAGCTGCTTTAGAGGCGGAACAAGCCAAAGCAGAGAAGCGTCTGAGCTGTCTCCTGAAAGCGCGAACAAAATTATCTTTTCGTGTGATCCGGGATGGGATCAAGTGCCATGGGGGCATCCATCTTAAGAAACAAAGTGAAAAAGCGGAGCTTGACATCAGTGTGACCAACACGGCCATTAACAATCTGCCAAGCGATGCGGATATTGTCATCACCCACAAAGATTTAACAGACCGCGCGAAAGCAAAGCTGCCGAACGCGACGCACATATCAGTGGATAACTTCTTAAACAGCCCGAAATACGACGAGCTGATTGAAAAGCTGAAAAGTAATCTTATAGAAAGAGAGTATTGTCATGCAAGTACTCGCAAAGGAAACATTAAACTCAATCAAACGGTATCATCAAAAGAAGAGGCTATCAAATTGGCAGGCCAGACGCTGATTGACAACGGCTACGTGACAGAGGATTACATTAGCAAAATGTTTGACCGTGAAGAAACGTCTTCTACGTTTATGGGGAATTTCATTGCCATTCCACACGGCACAGAAGAAGCGAAAAGCGAGGTGCTTCACTCAGGAATTTCAATCATACAGATTCCAGAGGGCGTTGAGTACGGAGAAGGCAACACGGCAAAAGTGGTATTCGGCATTGCGGGTAAAAATAATGAGCATTTAGACATTTTGTCTAACATCGCCATTATCTGTTCAGAAGAAGAAACATTGAACGCCTGATCTCCGCTAAAGCGAAGAAGATTTGATCGCCATTTCAACGAGGTGAACTGACATGATCGCCTTACATTTCGGTGCGGGAAATATCGGGAGAGGATTTATCGGCGCGCTGCTTCACCACTCCGGCTATGATGTGGTGTTTGCGGATGTGAACGAAACGATGGTCAGCCTCCTCAATGAAAAAAAAGAATACACAGTGGAACTGGCGGAAGAGGGACGTTCATCGGAGATCATTGGCCCGGTGAGCGCTATTAACAGCGGCAGTCAGACCGAGGAGCTGTACCGGCTGATGAATGAGGCGGCGCTCATCACAACAGCTGTCGGCCCGAATGTCCTGAAGCTGATTGCCCCGTCTATCGCAGAAGGTTTAAGACGAAGAAATACTGCAAACACACTGAATATCATTGCCTGCGAAAATATGATTGGCGGAAGCAGCTTCTTAAAGAAAGAAATATACAGCCATTTAACGGAAGCAGAGCAGAAATCCGTCAGTGAAACGTTAGGTTTTCCGAATTCTGCCGTTGACCGGATCGTCCCGATTCAGCATCATGAAGACCCGCTGAAAGTATCGGTTGAACCATTTTTCGAATGGGTCATTGATGAATCAGGCTTTAAAGGGAAAACACCAGTCATAAACGGCGCACTGTTTGTTGATGATTTAACGCCGTACATCGAACGGAAGCTGTTTACGGTCAATACCGGACACGCGGTCACAGCGTATGTCGGCTATCAGCGCGGACTCAAAACGGTCAAAGAAGCAATTGATCATCCGGAAATCCGCCGTGTTGTTCATTCGGCGCTGCTTGAAACTGGTGACTATCTCGTCAAATCGTATGGCTTTAAGCAAACTGAACACGAACAATATATTAAAAATCAGCGGTCGCTTTTAAAATCCTTTCATTTCGGACGATGTGACCCGCGTAGCGAGGTCACCTCTCAGAAAACTGGGAGAAAATGTAGACTTGTAGGCCCGGCAAAGAAAATAAAAGAACCGAATGCACTGGCTGAAGGAATTGCCGCAGCACTGCGCTTCGATTTCACCGGTGACCCTGAAGCGGTTGAACTGCAAGCGCTGATCGAAGAAAAGGATACAGCGGCGTACTTCAAGAGGTGTGCGGCATTCAGTCCCATGAACCGTTGCACGCCATCATTTTAAAGAAACTTAATCAATAACCGACCACCCGTGACACAATGTCACGGGCTTTTTACTATCTCGCAATCTAGTATAATAGAAAGCGCTTACGATAACAGGGGAAGGAGAATGACGATGAAACAATTTGAGATTGCGGCAATACCGGGAGACGGAGTAGGAAAGAGGTTGTAGCGGCTGCTGAGAAAGTGCTTCATACAGCGGCTGAGGTACACGGAGGTTTGTCATTCTCATTCACAGCTTTTCCATGGAGCTGTGATTATTACTTGGAGCACGGCAAAAATGATGCCCGAAGATGGAATACATACGCTTACTCAATTTGAAGCAGTTTTTGGGAGCTGTCGGAAATCCGAAGCTGGTTCCCGATCATATATCGTTATGGGGCTGCTGCTGAAATCCGGAGGGAGCTTGAGCTTTCCATTAATATGAGACCCGCCAAACAAATGGCAGGCATTACGTCGCCGCTTCTGCATCCAAATGATTTTTGACTTCGTGGTGATTCGCGAGAACAGTGAAGGTGAATACAGTGAAGTTGTCGGGCGCATTCACAGAGGCGATGATGAAATCGCCATCCAGAATGCCGTGTTTACGAGAAAAGCGACAGAACGTGTCATGCGCTTTGCCTTCGAATTGGCGAAAAAACGGCGCACACTCGTGACAAGCGCCACAAAGTCTAACGGCATTTATCACGCGATGCCGTTTTGGGATGAAGTCTTTCAGCAGACAGCCGCTGATTATAGCGGAATCGAGACATCATCTCAGCATATTGATGCGCTGGCCGCTTTTTTTGTGACGCGTCCGGAAACGTTTGATGTCATTGTGGCGAGCAAATTGTTCGGTGATATTTTAACCGACATCAGCTCAAGCCTGATGGAAAGCATCGGCATTGCGCCTCCCGACATCAATCCATCCGGCAAATATCCGTCCATGTTTGAACCGGTTCACGGCTCAGCTCCTGACATTGCCGGACAGGCCTTGCCAATCCGATCGGCCAGATTTGGACAGCGAAGCTGATGCTCGACCACTTCGGAGAGGAAGAATTGGGGGCGAAAATTCTGGATGTAATGGAGCAAGTGACTGCCGACGGCATCAAAACACGCGACATTGGGGGACAAAGCACAACGGCTGAGGTCACTGATGAAATCTGTTCGCGCTTAAGAAAGCTCTGATGAATCAGGCCGGTGGCAGATGGCTGCCCCGGTCTGTCCATTTCCTTACGAAAATTTCCACGAAAGTCTAACCAAGCAGATCCAAATGCTGTATAATAATTTGGAATTCTTAGGAAAGCATCGGGTGAAGGAAGTTGAATGCAAAAACAATCACGTTAAAGAAAAAAAGAAAAATCAAAACGATCGTTGTACTCAGTATCATTATGATCGCAGCTCTCATTTTTACGATCAGATTGGTGTTTTACAAGCCTTTTCTTATTGAAGGATCATCAATGGCCCCAACGCTTAAAGACTCAGAAAGAATTCTGGTTGATAAAGCAGTCAAATGGACTGGCGGGTTTCACAGAGGAGACATCATAGTCATTCATGACAAAAAGAGCGGCCGCTCATTTGTCAAACGTTTAATCGGTTTGCCTGGTGACAGCATTAAAATGAAAAATGATCAGCTATACATAAATGATAAAAAGGTGGAAGAACCATACTTAAAGGAATATAAACAGGAGGTCAAAGAGTCGGGTGTAACCTTAACAGGTGACTTCGAAGTTGAGGTTCCTTCCGGTAAATATTTTGTGATGGGAGATAACCCTGATATAAGTGGAGCAATTAAACAAAATGGCGCCAAAGGATGTACGCGCCCTGATACGAGAGGGGAAAATAAACGGGCCGACCGCAGGCATGTCCGGCGGCTACGCCCAAGCGAATCTTGTGGTTTTGAAAAAGGACCTTGCGTTTGATTTTCTGCTGTTTTGCCAGCGAAATCAAAAGCCCTGCCCCGTGCTGGATGTGACTGAAGCAGGTTCGCCTGTGCCGTCTCTGCTGCGCCGGATGCTGATATCCAGAACGGACTTTCCGAAATACCGTATTTACAGGCACGGTATCCTAACGGAAGAAGTATCTGATATTACGCCATACT
the different strategies tobuild the structure of genes . experimental . predictiveextrinsic / comparativeintrinsic / ab-initio
Methods to localize genes on genome sequences • The experimental approach identify & clone the cognate transcripts (as cDNA), sequence it and compare cDNA and gDNAit is the ONLY secure method!
The experimentalapproachEven this method has its bottlenecks : cDNA are rarely full length ... There are often alternative transcripts … but only one or a few cloned or considered for analysis The nucleic acid sequence does not provide experimental information on translation product(s) a minimum of bioinformatics is needed: cDNA and gDNA sequence comparison ... and exact localization ofsplice sitesat intron-exon borders: NNNag/Gtaagt……AG/gtNNN this requires a specific software for high throughput: e.g. Sim4
Methods to localize genes on genome sequences • Predictive Methods theextrinsic (comparative) method
Methods to localize genes on genome sequences • Predictive Methodsthe extrinsic method search for similarities in protein & nucleic acid sequence databasesrationale: many genes and proteins are already documented the genomic DNA may contain such one, or at least a close or distant homologue
Predictive Methodsthe extrinsic method protein databases due to a richer alphabet (20 amino acids compared to 4 nucleotides) protein sequence databases are the most efficient and the most informative in the best case, a hit in a database search indicates the existence of a gene the complete exon-intron structure of this gene for which function this gene codes for
:Multiple Alignment, instead of one-to-one, allows to finds outliers among database homologues [e.g. partial sequences] or point to peculiarities of the gene product which is the object of the search : here the N-terminal extension signs organelle subcellular localization
Predictive Methodsthe extrinsic methodlimits & bottlenecks there is a need for closely homologous sequences to be in databases : orphan and fast evolving genes are typically not found this way partial and wrong sequences are causing problems this approach identify and give the structure for a fraction of genes in a complete genome (e.g. 40%) and incomplete information for another fraction (e.g. 20%)
Predictive Methodsthe extrinsic methodflaws & bottlenecks protein searches rely on correct gene annotation in databases … does a given database hit refer to an experimentally documented or to a virtual entity ? how to track the source of information and validate the features given in databases ?
Predictive Methodsthe extrinsic methodgDNA versus mRNAs The EST case : what is it for real ? Expressed Sequence Tags obtained from mRNA isolated from a given organ cloned as cDNA in large libraries sequenced from one extremity (often 3’) in a single pass as far as possible (100-800 bp)
Predictive Methodsthe extrinsic methodEST pros& cons + the closest to the experimental method no assumption needed alternative transcripts are often found this way - poor quality of EST sequences (error range >1%) unequal coverage, depending on gene expression level partial sequences (though may be assembled) directional: 3’ (and 5’) exons best covered many ESTs needed for correct annotation: >106 for human
Predictive Methodsthe extrinsic method gDNA versus gDNA The “Conserved Exon” Method: comparison of non-documented genomic DNA with another non-documented gDNA Rationale : the coding sequences being more conserved in evolution, (coding) exons should be seen as more similar to each other than introns and intergenics No need for transcript or protein data. Applies well to comparison between genomes of closely related species : e.g. mouse-human…
Methods to localize genes on genome sequences • Predictive Methods theintrinsic (ab initio) method
Intrinsic Gene Prediction • Not every DNA sequence is a gene • Sequences of genes have specific features, which are often linked to the expression of these genes : • this apply to properties of sequences as a whole • Coding sequences : 3bp-periodicity, codon usage, GC content • or to local signals • translation start and stops, splice sites, polyA site, TATA box, promoter cis-acting motifs....
Intrinsic Gene Prediction The case of prokaryotic (bacterial) genomes : Genes do not contain introns and are generally close to each other The task then consists essentially in finding Potential Protein Coding Sequences (CDS)
Intrinsic Gene Prediction Finding Protein Coding Sequences Search for n-mers (hexamers) 3-periodic Markov models (GeneMark, Glimmer)
Why is this frame coding, and not any of the other 5 ? 1 1 2 3 4 5 6
Intrinsic Gene Prediction The case of eukaryotic genomes : Genes quite often do contain introns which may sometimes be numerous and/or big (example) The space between genes (intergenic regions) may be important and may contain transposons and repeats
The gene internal exons 5’UTR exon start exon stop exon non coding coding coding non coding stop ATG stop 3’UTR exon ATG Translation initiation Transcription Start Site 3’UTR intron 5’UTR intron internal introns CDS 5’UTR 3’UTR Coding SEQUENCE CAP AAAAAAA ATG stop The transcript
Intrinsic Gene Prediction Relies on combinatorial, statistical and/or A.I. methods may integrate several individual sensors Needs training sets of documented genes
Intrinsic Gene Prediction Is not universal ! Each (group of) species has its own genome “style”. Therefore : each method has to be trained and even adapted for a given genome, and need a species-specific gene set for this purpose the performance of a given algorithm or integrated software may vary a lot from one species to another...
EUGENEas an example of an integrated gene prediction and modeling platform
Content potential for coding, intron and intergenic Poplar IMM join(9265..9395,9749..99342). complement(join(10164..10295,10349..10420,10467..10514,10566..10626,10681..10770,10823..10949,11001)) Blastn TBlastx RepeatMasker SpliceMachine Genome Sequence Extrinsic modules Gene Models Arabidopsis genome Poplar RepBase Poplar cDNA & EST ATCCGTAAGATGGTGCGATGCCCTAAATGGGTCGGTTTATAAAGGCGCGTAGGTAAGTGCAATTTATTCTTCAAGTTCCGAATTTTATATGCGCATATCGTCAGTTCTTCTGTTGCAGTTGGCGCACTTGGACTACCTGCAATTTATTCTTCAAGTTCCGAATTTTATAT Eugene, a Black Box ? EuGene DAG Splice Sites Start ATG Translation Start Site prediction Output Input Intrinsic modules
Poplar proteins Other At proteins Other Plant proteins SwissProt Content potential for coding, intron and intergenic Poplar IMM PIR Arabidopsis FLcDNA supported proteins Poplar RepBase Poplar cDNA & EST join(9265..9395,9749..99342). complement(join(10164..10295,10349..10420,10467..10514,10566..10626,10681..10770,10823..10949,11001)) TBlastx Blastn Blastx RepeatMasker SpliceMachine Extrinsic modules Genome Sequence Gene Models Arabidopsis genome ATCCGTAAGATGGTGCGATGCCCTAAATGGGTCGGTTTATAAAGGCGCGTAGGTAAGTGCAATTTATTCTTCAAGTTCCGAATTTTATATGCGCATATCGTCAGTTCTTCTGTTGCAGTTGGCGCACTTGGACTACCTGCAATTTATTCTTCAAGTTCCGAATTTTATAT EuGene DAG Splice Sites Start ATG Translation Start Site prediction Output Input Intrinsic modules
Coding potential CDS Poplar IMM Select predicted genes covered by FL cDNA Let EuGene make prediction based on extrinsic data Blastx Blastn RepeatMasker SpliceMachine TBlastN against Arabidopsis full lengthproteins Discard cDNAs giving no hit Extrinsic modules EuGene self-trainingof intrinsic modules Arabidopsis proteins Poplar RepBase Poplar cDNA & EST ATCCGTAAGATGGTGCGAT GCCCTAAATGGGTCGGTTT ATAAAGGCGCGTAGGTAAG EuGene DAG Splice Sites Intrinsic modules Start ATG Translation Start Site prediction Training set of poplar cDNAs mapped on genome seq.