1 / 17

GeneScout: a data mining system for predicting vertebrate genes in genomic DNA sequences

GeneScout: a data mining system for predicting vertebrate genes in genomic DNA sequences. Authors : Michael M. Yin and Jason T. L. Wang Sources : Information Sciences, 163(1-3), pp. 201-218, 2004. Advisor : Min-Shiang Hwang Speaker : Chun-Ta Li. Outline . Introduction Related work

sonja
Download Presentation

GeneScout: a data mining system for predicting vertebrate genes in genomic DNA sequences

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. GeneScout: a data mining system for predicting vertebrate genes in genomic DNA sequences Authors:Michael M. Yin and Jason T. L. Wang Sources:Information Sciences, 163(1-3), pp. 201-218, 2004. Advisor:Min-Shiang Hwang Speaker:Chun-Ta Li

  2. Outline • Introduction • Related work • The proposed approach • Experiments and results • Conclusion • Comments

  3. Introduction – 1/4 • Data mining – knowledge discovery from data • Data mining in life sciences: • Finding clustering rules for gene expressions • Discovering classification rules for proteins • Detecting associations between metabolic pathways • Predicting genes in genomic DNA sequences

  4. codon:密碼子 introns:內含子 exons:編碼順序 donor:捐贈者 Introduction – 2/4 • A genomic DNA sequence • Four types of nucleotides (A, C, G, T) • The basic structure for a vertebrate gene • A sequence fragment containing an exon of 296 nucleotides coding sequences

  5. Introduction – 3/4 coding region

  6. Introduction – 4/4 • A number of programs have been developed for locating gene coding regions (exons). • Insufficient: • The vertebrate DNA sequence signals involved in gene determination are usually ill defined. • The automated interpretation without experimental validation of genomic data is still myth. • Motivation: • GeneScout: Developing accurate methods for automatically detecting vertebrate genomic DNA structures. • Exon: start sites, junction donor, acceptor sites

  7. Related work – 1/2 • NN-based techniques (Neural Network) • Gene structure prediction • Training

  8. Related work – 2/2 • HMM-based techniques (Hidden Markov Models) • To describe sequential data or processes • Using a number of states • Probabilistic state transitions • Example: cast a dice Normal Fake

  9. The proposed approach – 1/4 • HMM models for predicting functional sites • Star Site Model 1 1 Start codon

  10. The proposed approach – 2/4 • An HMM model for computing coding potentials • The Codon Model • First state is base T • Second state is base A or G • Third State can only be C or T (A, G is not defined) Stop codons: TAA, TAG, TGA, TGG

  11. :exon : intron The proposed approach – 3/4 • Graph representation of the gene detection problem • DNA sequence  Directed acyclic graph  dynamic programming algorithm  optimal path • candidate exon, candidate intron, candidate gene

  12. The proposed approach – 4/4 • A dynamic programming algorithm • Weight of the vertex v – W(v) • Weight of the edge (v1,v2) – W(v1,v2) start acceptor acceptor acceptor donor donor donor stop

  13. Experiments and results – 1/3 • Data: • GeneBank  570 vertebrate sequences  28,992,149 nucleotides  2649 exons  444,498 nucleotides • start condon – ATG • donor site – GT • acceptor site – AG • Evaluating method: • 10-way cross-validation • 570 sequences  10 sets 9 sets  training data 1 set  test data

  14. Experiments and results – 2/3 :正確認出nucleotide的比率 :正確認出nucleotide的比率相較於誤認是nucleotide的比率 :在nucleotide level的總預測精確度(1~-1) :正確認出exon的比率 :正確認出exon的比率相較於誤認是exon的比率

  15. Experiments and results – 3/3 • 8 sequences  GeneScout correctly detected nucleotides about 85% but GeneScan did not correctly predict any coding nucleotide • GeneScout funs much faster than GeneScan

  16. Conclusion • GeneScout uses hidden Markov models to detect functional sites. • A vertebrate genomic DNA sequence  A directed acyclic graph  A dynamic programming algorithm  optimal path • Experiment results shows GeneScout can detect 51% of exons in the data set.

  17. Comments • Enhanced the accuracy of detect the DNA sequences: • More models or rules • Association rules  known exons  rules • Rules  DNA sequences  Candidate exons

More Related