110 likes | 246 Views
Performance Improvement For Plant Gene Prediction. 7-9-2003 Lab Meeting Ping Hu. Rice Gene Prediction. Problems for the Rice gene prediction Most annotations was based on FgeneSH Very hard to train Performance only ~20% GSN even for self training self testing result
E N D
Performance Improvement For Plant Gene Prediction 7-9-2003 Lab Meeting Ping Hu
Rice Gene Prediction • Problems for the Rice gene prediction • Most annotations was based on FgeneSH • Very hard to train • Performance only ~20% GSN even for self training self testing result • Major differences: splice site prediction
Current Improvement For Rice • Focus on the splice site, using the TIGR manually curate data sets (may still be affected by FgeneSH), 3272 genes • Modification of the parameter • Change acceptor site from ½ order to 3 order • Revise and update donor site, exon length and transitional probability using GTF module • Double check some default setting for parameter estimation, like pseudo count and smooth window
Improvement of Arabidopsis • Use single gene data set to check whether the performance improvement is over fitting • Total gene number: 7833 • 3rd order 43 = 256 pattern, average exon number 5.2 per gene, 7833*5.2/256=159 should be enough • All the changes have been tested separately by cross-validation
Current Release of Arabidopsis Annotation • Total 27178 protein encoded genes, 28581 transcripts wih clear CDS annotation • Bad Start: 12 • Bad stop 2 • Bad donor 521 • Bad Acceptor 41 • In frame stop 41, 21 1-2 bp terminal exon • Bad fasta 510
Decide UTR Length • Get the mRNA annotation with different start and stop than the CDS annotation • 5’UTR number: 16258; • 5’UTR Average Length: 204.77 • 3’UTR number: 17788; • 3’UTR Average Length: 255.57
Next Step • Run on the new data set with the correct UTR length • Find out the difference with the current annotation, and decide the testing set for PCR reaction • Spare time: further improve the rice prediction