Genome Center Bioinformatics Technology Forum Tobias Kind – September 2006

Machine learning for metabolomics Genome Center Bioinformatics Technology Forum Tobias Kind – September 2006 • A quick introduction into machine learning • Algorithms and tools and importance of feature selection • Machine Learning for classification and prediction of cancer data

Artificial Intelligence Machine Learning Algorithms unsupervised learning: Clustering methods Support vector machines MARS (multivariate adaptive regression splines) Neural networksRandom Forest, Boosting trees, Honest trees, Decision trees CART (Classification and regression trees) Genetic programming supervised learning: Bayesian Committee Machine Transductive Support Vector Machine transduction: ...thanks to WIKI and Stuart Gansky

Applications in metabolomics • Classification - genotype/wildtype, sick/healthy, happy*/unhappy sleepy*/awake, love*/hate, old/young*) • Regression - predicting biological activities • and molecular properties of unknown substances (QSAR and QSPR) • Optimization – using experimental design (DOE) for optimizing experiments (LC, GC, extraction of metabolites) with multiple variables as input (trial and error is not “old school” its pstudi) * solved at the end

Algorithms we use in metabolomics

Basic Statistics, Remove extreme outliers, transform or normalize datasets, mark sets with zero variances Data Preparation Feature Selection Predict important features with MARS, PLS, NN, SVM, GDA, GA; apply voting or meta-learning Model Training + Cross Validation Use only important features, apply bootstrapping if only few datasets; Use GDA, CART, CHAID, MARS, NN, SVM, Naive Bayes, kNN for prediction Model Testing Calculate Performance with Percent disagreement and Chi-square statistics Model Deployment Deploy model for unknown data; use PMML, VB, C++, JAVA Concept of predictive data mining for classification

Automated machine learning workflow implemented in Statistica Dataminer

Eπίκουρος 341 BC – 270 BC William of Ockham 1285-1349 Occam meets Epicurusaka feature selection Occam's Razor:“Of two equivalent theories or explanations, all other things being equal, the simpler one is to be preferred.” Epicurus:Principle of multiple explanations “all consistent model should be retained” ...thanks to WIKI

What's the deal with feature selection? • Reduces computational complexity • Curse of dimensionality is avoided • Improves accuracy • The selected features can provide insights about the nature of the problem* * Margin Based Feature Selection Theory and Algorithms; Amir Navot

What's the deal with feature selection? Example! Principal component analysis (PCA) example (here of microarray data) WITHOUT feature selection  no separation possible (red and green points overlap) Golub, 1999 Science Mag

Feature selection Important variables Certain variables are more important/useful than others. All other variables could be noise or slow down computation. Machine learning or classification/regression still need to be applied. Feature selection is a powerful pre-filter (based on different algorithms)

With feature selection The same dataset, but only important variables, classification is now possible Certain algorithms have an in-built feature selection (like MARS, PLS, NN) Currently it is always useful to perform a feature selection

Response curves PLS Tree model Cluster Analysis Neural Network Feature selection Machine Learning (KNN) Machine Learning and statistic tools We use Statistica Dataminer as a comprehensive datamining worktool.WEKA or YALE or R are free but currently not as powerful as the Dataminer. Multiprocessor support still absent in most versions   = that sucks...

LIVE Demo with Statistica Dataminer (~10-15 min) Classification of cancer data from LC-MS and GC-MS experiments

QUIZ solutions • happy/unhappy - serotonin in bananas make us happy • sleepy/awake - tryptophan in turkey and watching sport on TV makes us sleepy • love/hate - oxytocin makes the baby loving the mommy and vice versa • old/young - secret (minds setting + genotype)

Machine learning for metabolomics Genome Center Bioinformatics Technology Forum Tobias Kind – September 2006 Thank you! Thanks to the FiehnLab!

Genome Center Bioinformatics Technology Forum Tobias Kind – September 2006