410 likes | 604 Views
General Data Analysis Issues and Approaches in Metabolomics. Bruce S. Kristal, Ph.D. Department of Neurosurgery, Brigham and Women’s Hospital Department of Surgery, Harvard Medical School (Pending) Secretary, Metabolomics Society.
E N D
General Data Analysis Issues and Approaches in Metabolomics Bruce S. Kristal, Ph.D. Department of Neurosurgery, Brigham and Women’s Hospital Department of Surgery, Harvard Medical School (Pending) Secretary, Metabolomics Society
…the statistician’s task, in fact, is limited to the extraction of the whole of the available information on any particular issue. R.A. Fisher
Statistics: What is the probability that what was observed occurred by chance?
Informatics What was observed?
Data vs Information
Data Information
Hierarchical Clustering
Probably Happy Probably Sad
Megavariate Analysis • Clustering • Principal components • Pattern recognition HUMANS DO MEGAVARIATE ANALYSIS INATELY
What is Multi-/Megavariate Analysis? • Simplifying large data sets for human consideration • Clustering and Principal Components • Pattern Recognition: • Classifying unknowns into previously defined groups
What is Multi-/Megavariate Analysis? • Data-mining • How many customers who buy pretzels also buy potato chips? • Estimation and prediction • Multivariate regression • Which variables are most important? • Mathematical modeling • Outlier diagnostics • Enables data-driven approaches
Sample Analysis Sample Collection AL8 AL7 AL5 AL1 3 SD AL4 AL3 2 SD AL2 AL6 DR8 DR6 DR5 DR7 DR1 2 SD DR4 DR2 DR3 1.0 0.8 0.6 0.4 0.2 0.0 Database Curation 0.80 0.60 Response (µA) 0.40 0.20 0.00 1 0.0 20.0 40.0 60.0 80.0 100.0 Retention time (minutes) Computational Modeling of Metabolic Serotypes Objectively Defining Class Identity Observed Values vs. Predicted Values Mechanistic Insight Drug Development Toxicology Classification Prediction Functional genomics Sub-threshold studies Others Actual Predicted Modeling Metabolic Interactions Following Biochemical Pathways Bioinformatics
Data Validation, Data Normalization, Missing Data Decisions, Inclusion/Exclusion Criteria Subgroups, Class-specific models Outlier removal scaling transformations Unsupervised: Clustering SOMs PCA Supervised: kNN SIMCA PLS PLS-DA Random Forest Machine learning: Neural Nets GAs GPs Overfit tests, Internal validation, optimization, External validation, optimization, 2o validation Informatics: An example classification workflow
Multiple Approaches • Mathematical robustness • Megavariate analysis is not word processing • Different algorithms see different things! • Different answers can be both right, or both wrong
…the statistician’s task, in fact, is limited to the extraction of the whole of the available information on any particular issue. R.A. Fisher
“THE” Problem: Overfitting • Beware the power of today’s tools • PLS-DA/O-PLS • GAs/GPs, neural nets, machine learning • Try to understand your tools • At least conceptually • PCA and selective reporting • choosing components is not objective • Beware of “low value” components • Clustering and rotations • DO NOT search until you like what you see • Choosing multiple tools/conditions is fine – in the model building phase
“Solutions” • Data analysis is not word processing • Permutation Testing is a step in the right direction • The Gold Standard is biological replication • Training Sets and test sets should have no members in common • Rarely recognized • Not always possible… • Set up design as rigorously as possible • In advance… • Our definition: • Training sets are proof of principle • Test sets are, theoretically, validation
Three “final” thoughts • There is an inherent statistical and informatics minefield that arises when the number of variables queried far exceeds the number of observations (“N vs P problem”) • Caution: mathematical validation in NOT biological validation • Report what you do
Data Validation, Data Normalization, Missing Data Decisions, Inclusion/Exclusion Criteria Subgroups, Class-specific models Outlier removal scaling transformations Unsupervised: Clustering SOMs PCA Supervised: kNN SIMCA PLS PLS-DA Random Forest Machine learning: Neural Nets GAs GPs Overfit tests, Internal validation, optimization, External validation, optimization, 2o validation Informatics: An example classification workflow