150 likes | 300 Views
Gene Prediction in Zea Mays. 07/20/06. Project Summary. Building a Training Set Curation and filtering Training, fine tuning of Twinscan Parameter Estimation Model Redesign and Performance Analysis Maize Twinscan 1.0 Prediction of novel genes in Maize BACs Selection, contig masking
E N D
Gene Prediction in Zea Mays 07/20/06
Project Summary • Building a Training Set • Curation and filtering • Training, fine tuning of Twinscan • Parameter Estimation • Model Redesign and Performance Analysis • Maize Twinscan 1.0 • Prediction of novel genes in Maize BACs • Selection, contig masking • Identification of strong novel candidates • Prospectus
Learning • Twinscan must be trained on new species, much like a human learns a new language: Parameter Estimation • Building a Training Set • Curation • Cleaning • Processing • Model Development • Training Set Analysis • Revision • iParameterEstimation
Genbank • 508 mRNA records from Genbank • EST-Genome • 126 were bad alignments • Bad splice sites • Inframe stop codons • 382 good alignments exist • Protein BLAST yielded 273 clusters • No known retrotransposons found
Public/Private Monsanto cDNA Sequences • EST-Genome against Zea Mays 4.0 release • Cleaned for non-canonical splice sites • Clustered redundant genes from Genbank • In all 1257 training sequences • 809 Public Monsanto mRNAs • 212 Proprietary Monsanto mRNAs • 273 Genbank mRNAs
Applying iParameterEstimation • Several retraining iterations steadily improved the performance of Maize Twinscan: • Donor site revamp • Addition of more training data • Addition of geometric tail on intron length distribution
Benchmarking Performance • 4-fold cross-validation • Twinscan Maize 1.0: Current Best
Novel Predictions • Release 4 of AZM • Collaborative effort between Wash U, CSH, Iowa State and Arizona • Contains 65,325 contigs out of 1573 BACs • Isolated and repeat masked • Danforth Center repeat library and RepBase • Run with and without masking
Novel Candidates • Filter the remaining contig-internal predictions • Blast against all MaizeGDB • Search for Retrotransposons from Rice and Arabidopsis in predictions • 30 Rice homolog found in the remainder • 330 remaining putative novel genes
Prospectus for the Next 6 Months • PCR 150 of the novel candidates • Continue improving Twinscan • Train on RT-PCR products • Use rice-trained Twinscan • Make predictions on new sequence as it arrives • Only 1573 of estimated 18k BACs are sequenced • Total estimated gene count is 45k-50k • Explore Other Cereals • Sorghum - JGI 5x Shotgun, done later this year • Soy - JGI, possibly finished in 2007