1 / 40

General Data Analysis Issues and Approaches in Metabolomics

General Data Analysis Issues and Approaches in Metabolomics. Bruce S. Kristal, Ph.D. Department of Neurosurgery, Brigham and Women’s Hospital Department of Surgery, Harvard Medical School (Pending) Secretary, Metabolomics Society.

dionne
Download Presentation

General Data Analysis Issues and Approaches in Metabolomics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. General Data Analysis Issues and Approaches in Metabolomics Bruce S. Kristal, Ph.D. Department of Neurosurgery, Brigham and Women’s Hospital Department of Surgery, Harvard Medical School (Pending) Secretary, Metabolomics Society

  2. …the statistician’s task, in fact, is limited to the extraction of the whole of the available information on any particular issue. R.A. Fisher

  3. Working Definitions

  4. Statistics: What is the probability that what was observed occurred by chance?

  5. Informatics What was observed?

  6. Data vs Information

  7. Data Information

  8. Can you group these?

  9. Partitional Clustering

  10. Can you group these?

  11. Hierarchical Clustering

  12. How much information is enough?

  13. How much information is enough?

  14. How much information is enough?

  15. How much information is enough?

  16. How much information is enough?

  17. Principal Components Analysis

  18. Given experience, what can we know about unknowns

  19. Probably Happy Probably Sad

  20. Pattern Recognition

  21. Megavariate Analysis • Clustering • Principal components • Pattern recognition HUMANS DO MEGAVARIATE ANALYSIS INATELY

  22. What we don’t do so well…

  23. What is Multi-/Megavariate Analysis? • Simplifying large data sets for human consideration • Clustering and Principal Components • Pattern Recognition: • Classifying unknowns into previously defined groups

  24. What is Multi-/Megavariate Analysis? • Data-mining • How many customers who buy pretzels also buy potato chips? • Estimation and prediction • Multivariate regression • Which variables are most important? • Mathematical modeling • Outlier diagnostics • Enables data-driven approaches

  25. Why do it?

  26. Omics datasets are otherwise beyond human comprehension

  27. Informatics in Metabolomics

  28. Sample Analysis Sample Collection AL8 AL7 AL5 AL1 3 SD AL4 AL3 2 SD AL2 AL6 DR8 DR6 DR5 DR7 DR1 2 SD DR4 DR2 DR3 1.0 0.8 0.6 0.4 0.2 0.0 Database Curation 0.80 0.60 Response (µA) 0.40 0.20 0.00 1 0.0 20.0 40.0 60.0 80.0 100.0 Retention time (minutes) Computational Modeling of Metabolic Serotypes Objectively Defining Class Identity Observed Values vs. Predicted Values Mechanistic Insight Drug Development Toxicology Classification Prediction Functional genomics Sub-threshold studies Others Actual Predicted Modeling Metabolic Interactions Following Biochemical Pathways Bioinformatics

  29. Data Validation, Data Normalization, Missing Data Decisions, Inclusion/Exclusion Criteria Subgroups, Class-specific models Outlier removal  scaling  transformations Unsupervised: Clustering SOMs PCA Supervised: kNN SIMCA PLS PLS-DA Random Forest Machine learning: Neural Nets GAs GPs Overfit tests, Internal validation, optimization, External validation, optimization, 2o validation Informatics: An example classification workflow

  30. Practicality important – not theory

  31. Multivariate Analysis is Easy

  32. But…

  33. Art – Not Science

  34. Multiple Approaches • Mathematical robustness • Megavariate analysis is not word processing • Different algorithms see different things! • Different answers can be both right, or both wrong

  35. Multivariate Analysis can be easy – or too easy

  36. …the statistician’s task, in fact, is limited to the extraction of the whole of the available information on any particular issue. R.A. Fisher

  37. “THE” Problem: Overfitting • Beware the power of today’s tools • PLS-DA/O-PLS • GAs/GPs, neural nets, machine learning • Try to understand your tools • At least conceptually • PCA and selective reporting • choosing components is not objective • Beware of “low value” components • Clustering and rotations • DO NOT search until you like what you see • Choosing multiple tools/conditions is fine – in the model building phase

  38. “Solutions” • Data analysis is not word processing • Permutation Testing is a step in the right direction • The Gold Standard is biological replication • Training Sets and test sets should have no members in common • Rarely recognized • Not always possible… • Set up design as rigorously as possible • In advance… • Our definition: • Training sets are proof of principle • Test sets are, theoretically, validation

  39. Three “final” thoughts • There is an inherent statistical and informatics minefield that arises when the number of variables queried far exceeds the number of observations (“N vs P problem”) • Caution: mathematical validation in NOT biological validation • Report what you do

  40. Data Validation, Data Normalization, Missing Data Decisions, Inclusion/Exclusion Criteria Subgroups, Class-specific models Outlier removal  scaling  transformations Unsupervised: Clustering SOMs PCA Supervised: kNN SIMCA PLS PLS-DA Random Forest Machine learning: Neural Nets GAs GPs Overfit tests, Internal validation, optimization, External validation, optimization, 2o validation Informatics: An example classification workflow

More Related