170 likes | 308 Views
GeneScout: a data mining system for predicting vertebrate genes in genomic DNA sequences. Authors : Michael M. Yin and Jason T. L. Wang Sources : Information Sciences, 163(1-3), pp. 201-218, 2004. Advisor : Min-Shiang Hwang Speaker : Chun-Ta Li. Outline . Introduction Related work
E N D
GeneScout: a data mining system for predicting vertebrate genes in genomic DNA sequences Authors:Michael M. Yin and Jason T. L. Wang Sources:Information Sciences, 163(1-3), pp. 201-218, 2004. Advisor:Min-Shiang Hwang Speaker:Chun-Ta Li
Outline • Introduction • Related work • The proposed approach • Experiments and results • Conclusion • Comments
Introduction – 1/4 • Data mining – knowledge discovery from data • Data mining in life sciences: • Finding clustering rules for gene expressions • Discovering classification rules for proteins • Detecting associations between metabolic pathways • Predicting genes in genomic DNA sequences
codon:密碼子 introns:內含子 exons:編碼順序 donor:捐贈者 Introduction – 2/4 • A genomic DNA sequence • Four types of nucleotides (A, C, G, T) • The basic structure for a vertebrate gene • A sequence fragment containing an exon of 296 nucleotides coding sequences
Introduction – 3/4 coding region
Introduction – 4/4 • A number of programs have been developed for locating gene coding regions (exons). • Insufficient: • The vertebrate DNA sequence signals involved in gene determination are usually ill defined. • The automated interpretation without experimental validation of genomic data is still myth. • Motivation: • GeneScout: Developing accurate methods for automatically detecting vertebrate genomic DNA structures. • Exon: start sites, junction donor, acceptor sites
Related work – 1/2 • NN-based techniques (Neural Network) • Gene structure prediction • Training
Related work – 2/2 • HMM-based techniques (Hidden Markov Models) • To describe sequential data or processes • Using a number of states • Probabilistic state transitions • Example: cast a dice Normal Fake
The proposed approach – 1/4 • HMM models for predicting functional sites • Star Site Model 1 1 Start codon
The proposed approach – 2/4 • An HMM model for computing coding potentials • The Codon Model • First state is base T • Second state is base A or G • Third State can only be C or T (A, G is not defined) Stop codons: TAA, TAG, TGA, TGG
:exon : intron The proposed approach – 3/4 • Graph representation of the gene detection problem • DNA sequence Directed acyclic graph dynamic programming algorithm optimal path • candidate exon, candidate intron, candidate gene
The proposed approach – 4/4 • A dynamic programming algorithm • Weight of the vertex v – W(v) • Weight of the edge (v1,v2) – W(v1,v2) start acceptor acceptor acceptor donor donor donor stop
Experiments and results – 1/3 • Data: • GeneBank 570 vertebrate sequences 28,992,149 nucleotides 2649 exons 444,498 nucleotides • start condon – ATG • donor site – GT • acceptor site – AG • Evaluating method: • 10-way cross-validation • 570 sequences 10 sets 9 sets training data 1 set test data
Experiments and results – 2/3 :正確認出nucleotide的比率 :正確認出nucleotide的比率相較於誤認是nucleotide的比率 :在nucleotide level的總預測精確度(1~-1) :正確認出exon的比率 :正確認出exon的比率相較於誤認是exon的比率
Experiments and results – 3/3 • 8 sequences GeneScout correctly detected nucleotides about 85% but GeneScan did not correctly predict any coding nucleotide • GeneScout funs much faster than GeneScan
Conclusion • GeneScout uses hidden Markov models to detect functional sites. • A vertebrate genomic DNA sequence A directed acyclic graph A dynamic programming algorithm optimal path • Experiment results shows GeneScout can detect 51% of exons in the data set.
Comments • Enhanced the accuracy of detect the DNA sequences: • More models or rules • Association rules known exons rules • Rules DNA sequences Candidate exons