150 likes | 246 Views
Exploiting Genome Comparison for Gene Structure Prediction in Plants. Michael Brent Ping Hu. Performance Improvement on single gene set: Explicit Intron Length Model. Performance Improvement on genome set: Explicit Intron Length Model. Splice Donor Model for both GT/AG and GC/AG introns.
E N D
Exploiting Genome Comparison for Gene Structure Prediction in Plants Michael Brent Ping Hu
Performance Improvement on single gene set: Explicit Intron Length Model
Performance Improvement on genome set: Explicit Intron Length Model
Splice Donor Model for both GT/AG and GC/AG introns • GC/AG introns represent: • 252/33350=0.75% in Arabidopsis • 0.7% of total human pre-mRNA introns; • ~0.6% in C. elegans (Nuc Acid Research 30(15) 3360-3368). • 27/2034 = 1.3% in crypto • Old model can not predict the GC/AG intron
Decision Tree Model for GT/GC Donors Donor sites NNNG1T2NNNN NNNG1C2NNNN NNNGTNNG5N NNNGTNNĞ5N NNG-1GTNNGN NNĞ-1GTNNGN NA-2GGTNNGN NĂ-2GGTNNGN NAGGTNNGT6 NAGGTNNGŤ6
Breakdown of Arabidopsis Predictions Total Prediction: 30634/Total annotation: 28581 Identical to ann: 15063 Not Identical to ann: 15588 Overlap with confirmed ann: 3246 Not overlap with confirmed ann: 12342 Not overlap with any ann: 4394 Overlap with other ann: 7948 Same start/Same stop 2358 Diff start/Same stop 2770 Same start/Diff stop 1879 Diff start/Diff stop 941
First Experiment Result 2000 1650 1000 850 650 500 400 300 200 100 M 1 2 3 4 5 6 7 8 9 10 11 12
Rice Annotation Data Set • TIGR data set: • Most annotations were based on FgeneSH • Get manually curate contigs: 3171 genes • May still been influenced by FgeneSH • Gene bank cDNA confirmed data set: • Download Genes with full length cDNA from Genebank • Total: 1084 mRNA and 443 DNA • Filter out the bad genes with stop codon in frame, 341 DNA left • Limitation of this data set: • UTR and Intergenic region very limited • All positive strand, small • Other data sets are all from automatic pipelines
Performance Improvement on Rice GeneBank cDNA-Confirmed Data