180 likes | 315 Views
Using Gene Prediction to Guide Experiments by Summing Over Consistent Gene Models. Chaochun Wei Advisor Michael Brent Genome Informatics 5/5/2003. Outline. Background cDNA and EST MGC project Twinscan model Methods Simple method Summation method Result Future work. cDNA.
E N D
Using Gene Prediction to Guide Experiments by Summing Over Consistent Gene Models Chaochun Wei Advisor Michael Brent Genome Informatics 5/5/2003 http://genes.cs.wustl.edu/
Outline • Background • cDNA and EST • MGC project • Twinscan model • Methods • Simple method • Summation method • Result • Future work http://genes.cs.wustl.edu/
cDNA 5’ Truncated ORF 5’EST 3’EST ATG 5’EST 3’EST ESTs and cDNAs cDNA EST Full ORF cDNA FACT: 5’EST containing the start of translationindicates the cDNA contains a full ORF. http://genes.cs.wustl.edu/
MGC (Mammalian Gene Collection) Project (http://mgc.nci.nih.gov) • Goal To find a cDNA clone containing a full ORF for each human and mouse gene. • Pipeline Generate 5’ and 3’ ESTs cDNA library Construction Full-length sequencing Select candidate clones with full length ORF by analyzing ESTs http://genes.cs.wustl.edu/
Twinscan model • A Gene prediction model developed in Brent Lab Washington University (Korf, I., et al, 2001) • A generalized HMM. It also uses conservation information derived from genome comparison. • It’s one of the best gene predictors now. (Flicek, P., 2002) http://genes.cs.wustl.edu/
Motivation • To develop a Twinscan based method that can rank the EST sequences by their potential to contain the start of translation. http://genes.cs.wustl.edu/
Problem • Input: • EST sequences • The Genomic sequence • Twinscan Model • Output: • A Score for each EST sequence according to its potential to contain the start of translation. http://genes.cs.wustl.edu/
EST alignment EST alignment Method: Ranking ESTs • Two methods • 1. Simple method. Find the most likely model by using Twinscan, then compare the prediction with the EST alignment. • 2. Summation method Use the EST information from the start, then sum the probabilities of all models consistent with EST constrains. Gene Prediction EST http://genes.cs.wustl.edu/
Difficulties of the Problem • The quality of the ESTs is low (5% error rate) • EST alignment is not straight forward. • It’s not clear what’s the best way to incorporate EST information into Twinscan. http://genes.cs.wustl.edu/
interg Prom 5’UTR Initial Exon Intron 5’UTR Initial Exon Intron Internal Exon Intron 5’UTR initial exon intron Consistent path, Good EST Consistent path, Bad EST Inconsistent Path Twinscan Prediction Paths and EST Alignments EST alignment Genomic sequence 5’UTR Intron 3’UTR Intergenic http://genes.cs.wustl.edu/
5’UTR Initial Exon Intron Terminal Exon 3’UTR 5’UTR Initial Exon Intron Exon Intron Exon Not strictly consistent Inconsistent path Twinscan Prediction Paths and EST Alignments EST alignment Genomic sequence 5’UTR Intron 3’UTR Intergenic http://genes.cs.wustl.edu/
Forward Algorithm • Use forward algorithm to determine whether the start of translation begins within an aligned EST region. • Total-probability : P(All the consistent paths | EST alignment) • No-probability: P(All consistent paths without initial codon in the EST region | EST alignment) • Yes-probability P(All the consistent paths with initial codon in the EST region | EST alignment) http://genes.cs.wustl.edu/
Result Datasets: Align MGC ESTs for human (26991) to refseqs(13526) , get 3611 ESTs with identity > 95% Classify ESTs into two classes: • good ESTs (2558) • bad ESTs (1053) • Evaluate the result by: 1. The rate of good ESTs predicted as good ESTs and 2. The rate of bad ESTs predicted as good ESTs (false positive rate) http://genes.cs.wustl.edu/
Result: The Summation Method 2.6 http://genes.cs.wustl.edu/
Result: The Summation Method http://genes.cs.wustl.edu/
Result Compare • Results are from different datasets (3255 refseqs, MGC Program Team Dec. 2002) http://genes.cs.wustl.edu/
Future Work • Find a right way to use EST alignment for our method • Tune the parameters for the method • Extend this method to other applications http://genes.cs.wustl.edu/
Acknowledgement • Advisor: Dr. Michael Brent • Members in the Brent Lab • MGC for data • NHGRI for funding http://genes.cs.wustl.edu/