An MLE-based clustering method on DNA barcode

An MLE-based clustering method on DNA barcode Ching-Ray Yu Statistics Department Rutgers University, USA chingray@stat.rutgers.edu 07/07/2006

Outline • Problem & Motivation • Statistical modeling • Preliminary result • Future work Rutgers University

Problem & Motivation • Species clustering problem is a very important issue on DNA barcode project. • MEGA3 by S. Kumar, K. Tamura, and M. Nei (2004) has limited success when applying the true data set. • A good method to cluster new biological specimens to the proper species is very important. This is also one of the goals of this DNA initiative. Rutgers University

Statistical modeling • Statistical model. • Assumption: • Nucleotides are iid. random variables which are multinomial distributed with parameter , where i stands for the DNA (COI) sequence and j indicates the position. Rutgers University

MLE-based clustering method • Maximum likelihood estimator (MLE) • Suppose are the MLE of in species k. • The clustering rule : A new unidentified specimen , where , is assigned to species k if Rutgers University

Preliminary result • There are 150 species with 1623 COI sequences in the training datasets provide by DIMACS. • In order to performance the misclassification rate of the MLE-based clustering method, I randomly select 200 COI sequences as testing set. The MLE of parameters is estimated from the rest 1423 COI sequences. The misclassification rate is about 16%. Rutgers University

Future work • The misclassification rate could be improved if more complicated models are considered. • For example: • gaps of the COI sequences. • Kimura-two-parameter (K2P) model (1980) considers the biological diversity with different evolution rate. • synonymous and non-synonymous substitutions. • Remark:Synonymous substitutions will change the encoded amino acid. Rutgers University

Proposed deliverables • If the MLE-based clustering method works well on barcode sequences, it can be written as a paper with R-code for public users of barcode data. Rutgers University

Acknowledgement • NSF grant • Data Analysis Working Group • DIMACS at Rutgers • My advisor: Dr. Javier Cabrera Rutgers University

References • M. Kimura (1980) A simple method for estimating evolutionary rate of base substitution through comparative studies of nucleotide sequences. J. Mol. Evol. 16: 111-120. • Sudhir Kumar, Koichior Tamura and Masatoshi Nei (2004) MEGA 3: Integrated software for Molecular Evolutionary Genetics Analysis and sequence alignment. Briefings in Bioinformatics. Vol 5. No 2. 150-163. Rutgers University

An MLE-based clustering method on DNA barcode

An MLE-based clustering method on DNA barcode

Presentation Transcript

Software Clustering Based on Information Loss Minimization

Clustering Method for Repeat Analysis in DNA sequences

Personalization in Folksonomies Based on Tag Clustering

Density Biased Sampling An improved method for clustering

Density based Clustering

Pattern-based Clustering

Graph Clustering based on Random Walk

An On-line Document Clustering Method Based on Forgetting Factors

A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining

Auto administration of databases based on clustering

An Automatic Lip-reading Method Based on Polynomial Fitting

Integrating an MLE with Voyager

A Similarity-Based Robust Clustering Method

Top-Down Clustering Method Based On TV-Tree

KCK-means A Clustering Method based on Kernel Canonical Correlation Analysis

SURVEY PROJECT “ A Clustering Method for Repeat Analysis in Dna Sequences”

An Adaptive Clustering Algorithm Based on Data Field in Complex Networks

A clustering method for repeat analysis in DNA sequences

Hot wire measurement method based on inverse method