1 / 10

An MLE-based clustering method on DNA barcode

An MLE-based clustering method on DNA barcode. Ching-Ray Yu Statistics Department Rutgers University, USA chingray@stat.rutgers.edu 07/07/2006. Outline. Problem & Motivation Statistical modeling Preliminary result Future work. Problem & Motivation.

Download Presentation

An MLE-based clustering method on DNA barcode

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An MLE-based clustering method on DNA barcode Ching-Ray Yu Statistics Department Rutgers University, USA chingray@stat.rutgers.edu 07/07/2006

  2. Outline • Problem & Motivation • Statistical modeling • Preliminary result • Future work Rutgers University

  3. Problem & Motivation • Species clustering problem is a very important issue on DNA barcode project. • MEGA3 by S. Kumar, K. Tamura, and M. Nei (2004) has limited success when applying the true data set. • A good method to cluster new biological specimens to the proper species is very important. This is also one of the goals of this DNA initiative. Rutgers University

  4. Statistical modeling • Statistical model. • Assumption: • Nucleotides are iid. random variables which are multinomial distributed with parameter , where i stands for the DNA (COI) sequence and j indicates the position. Rutgers University

  5. MLE-based clustering method • Maximum likelihood estimator (MLE) • Suppose are the MLE of in species k. • The clustering rule : A new unidentified specimen , where , is assigned to species k if Rutgers University

  6. Preliminary result • There are 150 species with 1623 COI sequences in the training datasets provide by DIMACS. • In order to performance the misclassification rate of the MLE-based clustering method, I randomly select 200 COI sequences as testing set. The MLE of parameters is estimated from the rest 1423 COI sequences. The misclassification rate is about 16%. Rutgers University

  7. Future work • The misclassification rate could be improved if more complicated models are considered. • For example: • gaps of the COI sequences. • Kimura-two-parameter (K2P) model (1980) considers the biological diversity with different evolution rate. • synonymous and non-synonymous substitutions. • Remark:Synonymous substitutions will change the encoded amino acid. Rutgers University

  8. Proposed deliverables • If the MLE-based clustering method works well on barcode sequences, it can be written as a paper with R-code for public users of barcode data. Rutgers University

  9. Acknowledgement • NSF grant • Data Analysis Working Group • DIMACS at Rutgers • My advisor: Dr. Javier Cabrera Rutgers University

  10. References • M. Kimura (1980) A simple method for estimating evolutionary rate of base substitution through comparative studies of nucleotide sequences. J. Mol. Evol. 16: 111-120. • Sudhir Kumar, Koichior Tamura and Masatoshi Nei (2004) MEGA 3: Integrated software for Molecular Evolutionary Genetics Analysis and sequence alignment. Briefings in Bioinformatics. Vol 5. No 2. 150-163. Rutgers University

More Related