General Data Analysis Issues and Approaches in Metabolomics

General Data Analysis Issues and Approaches in Metabolomics Bruce S. Kristal, Ph.D. Department of Neurosurgery, Brigham and Women’s Hospital Department of Surgery, Harvard Medical School (Pending) Secretary, Metabolomics Society

…the statistician’s task, in fact, is limited to the extraction of the whole of the available information on any particular issue. R.A. Fisher

Working Definitions

Statistics: What is the probability that what was observed occurred by chance?

Informatics What was observed?

Data vs Information

Data Information

Can you group these?

Partitional Clustering

Can you group these?

Hierarchical Clustering

How much information is enough?

Principal Components Analysis

Given experience, what can we know about unknowns

Probably Happy Probably Sad

Pattern Recognition

Megavariate Analysis • Clustering • Principal components • Pattern recognition HUMANS DO MEGAVARIATE ANALYSIS INATELY

What we don’t do so well…

What is Multi-/Megavariate Analysis? • Simplifying large data sets for human consideration • Clustering and Principal Components • Pattern Recognition: • Classifying unknowns into previously defined groups

What is Multi-/Megavariate Analysis? • Data-mining • How many customers who buy pretzels also buy potato chips? • Estimation and prediction • Multivariate regression • Which variables are most important? • Mathematical modeling • Outlier diagnostics • Enables data-driven approaches

Why do it?

Omics datasets are otherwise beyond human comprehension

Informatics in Metabolomics

Sample Analysis Sample Collection AL8 AL7 AL5 AL1 3 SD AL4 AL3 2 SD AL2 AL6 DR8 DR6 DR5 DR7 DR1 2 SD DR4 DR2 DR3 1.0 0.8 0.6 0.4 0.2 0.0 Database Curation 0.80 0.60 Response (µA) 0.40 0.20 0.00 1 0.0 20.0 40.0 60.0 80.0 100.0 Retention time (minutes) Computational Modeling of Metabolic Serotypes Objectively Defining Class Identity Observed Values vs. Predicted Values Mechanistic Insight Drug Development Toxicology Classification Prediction Functional genomics Sub-threshold studies Others Actual Predicted Modeling Metabolic Interactions Following Biochemical Pathways Bioinformatics

Data Validation, Data Normalization, Missing Data Decisions, Inclusion/Exclusion Criteria Subgroups, Class-specific models Outlier removal  scaling  transformations Unsupervised: Clustering SOMs PCA Supervised: kNN SIMCA PLS PLS-DA Random Forest Machine learning: Neural Nets GAs GPs Overfit tests, Internal validation, optimization, External validation, optimization, 2o validation Informatics: An example classification workflow

Practicality important – not theory

Multivariate Analysis is Easy

But…

Art – Not Science

Multiple Approaches • Mathematical robustness • Megavariate analysis is not word processing • Different algorithms see different things! • Different answers can be both right, or both wrong

Multivariate Analysis can be easy – or too easy

…the statistician’s task, in fact, is limited to the extraction of the whole of the available information on any particular issue. R.A. Fisher

“THE” Problem: Overfitting • Beware the power of today’s tools • PLS-DA/O-PLS • GAs/GPs, neural nets, machine learning • Try to understand your tools • At least conceptually • PCA and selective reporting • choosing components is not objective • Beware of “low value” components • Clustering and rotations • DO NOT search until you like what you see • Choosing multiple tools/conditions is fine – in the model building phase

“Solutions” • Data analysis is not word processing • Permutation Testing is a step in the right direction • The Gold Standard is biological replication • Training Sets and test sets should have no members in common • Rarely recognized • Not always possible… • Set up design as rigorously as possible • In advance… • Our definition: • Training sets are proof of principle • Test sets are, theoretically, validation

Three “final” thoughts • There is an inherent statistical and informatics minefield that arises when the number of variables queried far exceeds the number of observations (“N vs P problem”) • Caution: mathematical validation in NOT biological validation • Report what you do

Data Validation, Data Normalization, Missing Data Decisions, Inclusion/Exclusion Criteria Subgroups, Class-specific models Outlier removal  scaling  transformations Unsupervised: Clustering SOMs PCA Supervised: kNN SIMCA PLS PLS-DA Random Forest Machine learning: Neural Nets GAs GPs Overfit tests, Internal validation, optimization, External validation, optimization, 2o validation Informatics: An example classification workflow

General Data Analysis Issues and Approaches in Metabolomics