1 / 15

Genome Center Bioinformatics Technology Forum Tobias Kind – September 2006

Machine learning for metabolomics. Genome Center Bioinformatics Technology Forum Tobias Kind – September 2006. A quick introduction into machine learning Algorithms and tools and importance of feature selection Machine Learning for classification and prediction of cancer data.

fleta
Download Presentation

Genome Center Bioinformatics Technology Forum Tobias Kind – September 2006

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Machine learning for metabolomics Genome Center Bioinformatics Technology Forum Tobias Kind – September 2006 • A quick introduction into machine learning • Algorithms and tools and importance of feature selection • Machine Learning for classification and prediction of cancer data

  2. Artificial Intelligence Machine Learning Algorithms unsupervised learning: Clustering methods Support vector machines MARS (multivariate adaptive regression splines) Neural networksRandom Forest, Boosting trees, Honest trees, Decision trees CART (Classification and regression trees) Genetic programming supervised learning: Bayesian Committee Machine Transductive Support Vector Machine transduction: ...thanks to WIKI and Stuart Gansky

  3. Applications in metabolomics • Classification - genotype/wildtype, sick/healthy, happy*/unhappy sleepy*/awake, love*/hate, old/young*) • Regression - predicting biological activities • and molecular properties of unknown substances (QSAR and QSPR) • Optimization – using experimental design (DOE) for optimizing experiments (LC, GC, extraction of metabolites) with multiple variables as input (trial and error is not “old school” its pstudi) * solved at the end

  4. Algorithms we use in metabolomics

  5. Basic Statistics, Remove extreme outliers, transform or normalize datasets, mark sets with zero variances Data Preparation Feature Selection Predict important features with MARS, PLS, NN, SVM, GDA, GA; apply voting or meta-learning Model Training + Cross Validation Use only important features, apply bootstrapping if only few datasets; Use GDA, CART, CHAID, MARS, NN, SVM, Naive Bayes, kNN for prediction Model Testing Calculate Performance with Percent disagreement and Chi-square statistics Model Deployment Deploy model for unknown data; use PMML, VB, C++, JAVA Concept of predictive data mining for classification

  6. Automated machine learning workflow implemented in Statistica Dataminer

  7. Eπίκουρος 341 BC – 270 BC William of Ockham 1285-1349 Occam meets Epicurusaka feature selection Occam's Razor:“Of two equivalent theories or explanations, all other things being equal, the simpler one is to be preferred.” Epicurus:Principle of multiple explanations “all consistent model should be retained” ...thanks to WIKI

  8. What's the deal with feature selection? • Reduces computational complexity • Curse of dimensionality is avoided • Improves accuracy • The selected features can provide insights about the nature of the problem* * Margin Based Feature Selection Theory and Algorithms; Amir Navot

  9. What's the deal with feature selection? Example! Principal component analysis (PCA) example (here of microarray data) WITHOUT feature selection  no separation possible (red and green points overlap) Golub, 1999 Science Mag

  10. Feature selection Important variables Certain variables are more important/useful than others. All other variables could be noise or slow down computation. Machine learning or classification/regression still need to be applied. Feature selection is a powerful pre-filter (based on different algorithms)

  11. With feature selection The same dataset, but only important variables, classification is now possible Certain algorithms have an in-built feature selection (like MARS, PLS, NN) Currently it is always useful to perform a feature selection

  12. Response curves PLS Tree model Cluster Analysis Neural Network Feature selection Machine Learning (KNN) Machine Learning and statistic tools We use Statistica Dataminer as a comprehensive datamining worktool.WEKA or YALE or R are free but currently not as powerful as the Dataminer. Multiprocessor support still absent in most versions   = that sucks...

  13. LIVE Demo with Statistica Dataminer (~10-15 min) Classification of cancer data from LC-MS and GC-MS experiments

  14. QUIZ solutions • happy/unhappy - serotonin in bananas make us happy • sleepy/awake - tryptophan in turkey and watching sport on TV makes us sleepy • love/hate - oxytocin makes the baby loving the mommy and vice versa • old/young - secret (minds setting + genotype)

  15. Machine learning for metabolomics Genome Center Bioinformatics Technology Forum Tobias Kind – September 2006 Thank you! Thanks to the FiehnLab!

More Related