310 likes | 443 Views
The Barcode of Life Integrating machine learning techniques for species prediction and discovery www.barcodinglife.com. Welcome to the Meeting Barcode of Life -- Great opportunity to contribute to a fast growing area of research. Some questions to keep in mind for the afternoon….
E N D
The Barcode of LifeIntegrating machine learning techniques for species prediction and discoverywww.barcodinglife.com
Welcome to the Meeting • Barcode of Life -- • Great opportunity to contribute to a fast growing area of research. • Some questions to keep in mind for the afternoon…..
Open Questions • Species discovery vs. prediction • Data structure, missing data, sample sizes • New visualization tools for both discovery and prediction • Confidence measures for species discovery and individual specimen assignments – controlling number of false discoveries - power of detection
Barcoding Data – A first look at the data.Dimacs BOL Data Analysis Working Group meetingSeptember 26 2005 Rebecka Jornsten Department of Statistics, Rutgers University http://www.stat.rutgers.edu/~rebecka/DIMACSBOL/DimacsMeetingDATA/ Thanks to Kerri-Ann Norton
Outline • Data Structure and Data Retrieval • Sequencing and Base Calling • Distance metrics – Sequence information • Clustering • Classification • Open Questions – Discussion • Questions to think about are highlighted in red.
Sequencing • Peak finding • Deconvolution • Denoising • Normalization • Base calling • Quality assessment • (ABI base caller, Phred)
www.barcodinglife.com Leptasterias data – six-rayed sea stars Astraptes data - moths Collembola data - springtails Sample Data
Leptasterias data – six-rayed sea stars 5 species,21 specimens Sample sizes 3-7 Sequence length 1644 Astraptes data – moths 12 species, 451 specimens Sample sizes 3-96, 8 with more than 20 Sequence length 594 Collembola data – springtails 18 species, 54 specimens Sample sizes 1-5 Sequence length 635 Sample Data
Sequence Information • We can compute the information content for each nucleotide. • Is there a lot of variability between species at locus j? • Is there lot of variability within a species at locus j? • Are the same loci discriminating between multiple species?
Astraptes: Within-species entropy for the 9 species with 20+ specimens “pure”
Pair wise (Mutual Information) 10 vs. 11 10 vs. 12 10 vs. 12 2 vs. 10 2 vs. 12
Distance Metric • To group the specimens in an unsupervised fashion we need to come up with a distance metric. • Without prior information of which loci are informative, we compute distances using the entire sequence (for Astraptes 594 bases) • The 0-1 distance metric is the most commonly used • However, some bases are ‘uncalled’ – usually denoted by letter other than a,c,g,t • How should we take this into account?
Another example: Leptasterias – 5 species, 21 specimens All groups Groups 3 vs. 4
H. Clustering: Leptasterias Group 1 Group 2 Problem….
PAM Clustering: LeptasteriasSelecting the number of clusters via silhouette width, CV etc leads to the combining of species 3 and 4 – these data does not support a separate species.
Classification • A classifier that in principle closely resembles the hierarchical clustering approach is kNN • Leave-one-out Cross-Validation: • On the Leptasterias data 1-2 specimens are misallocated with this classifier. • Both of these specimens are in group 4 (and mislabeled as 3). • Via cross validation we see that one observation is only labeled as 4 if it’s in the training set, o/w 3. • The other mislabeled observation fails in 15 out of 20 training scenarios. Both these specimens may have been mislabeled?
Classification • A simple alternative is to use a centroid-based classifiers • Assign new specimens to the species with respect to which the specimen is closest to the species consensus sequence. • We can match specimens to a consensus sequence based on the 0-1 distance, or • use the position weights of each letter base in the consensus sequence.
Classification • On the Leptasterias data, the consensus sequence (CS) based classifier makes 2 errors (LOO CV) • “Vote of confidence”=weighted 0/1 distance to CS
Classification • Relative voting (RV) strength illustrates that species 3 and 4 are difficult to separate, and the misallocated specimens are associated with low relative votes • RV = max(weighted similarity)-(max-1)(weighted similarity) • (max-1)(weighted similarity)
Base calling is not perfect – errors are made and there are programs (e.g. phred) that can analyze the ABI traces and assign confidence measures to each base. An interesting question is – can we obtain similar error rates for species prediction and discovery with smaller sample sizes if quality measures are incorporated into the analysis?
Before the Discussion Session • Try out some clustering techniques on the sample data • Number of uncalled bases? • Length of sequences? • Sample sizes – effect on clustering? • Sample sizes – effect on classification?