220 likes | 237 Views
An idiot's guide to data analysis using Analyze/StripMiner™. Standard scripts, predictive modeling, outlier detection, visualization, and code specifics.
E N D
Analyze/StripMiner™ • • Analyze/StripMiner ™ Overview • To obtain an idiot’s guide type “analyze > readme.txt” • Standard Analyze Scripts • Predicting on Blind Data • PLS (Please Listen to Svante Wold) • • LOO, BOO and n-Fold Cross-Validation Error Measures • Albumin Data Set and Feature Selection • • Bio-Informatics
Analyze/StripMiner™ • Modeling • ANN (Neural Networks) • SVM (Support Vector Machines) • PLS (Partial-Least Squares) • GA-based regression clustering • PCA regression • Local Learning • Outlier Detection (GAMOL) • Data Processing • Interface with RECON • Different Scaling Modes • Outlier detection/data cleansing • Visualization • Correlation Plots • 2-D Sensitivity Plots • Outlier Visualization Plots • Different Scaling Options • Cluster Ranking Plots • Standard ROC curves • Continuous ROC curves • Learning Modes • Bootstrapping • Bagging • Boosting • Leave-one-out cross-validation • Code Specifics • Tight Classic C-code (< 15000 lines) • Script-Based Shell Program • Runs on all Platforms • Ultra Fast • Use: TransScan – GE - KODAK • Doppler broadening • Macro-Economics Analysis • Feature Selection • Sensitivity Analysis • Genetic Algorithms • Correlation GA (GAFEAT) • Method specific DDASSL
Analyze/StripMiner ™ Coding Philosophy • Standard C code that compiles on all platforms • WINDOWS™ and Linux platforms • Supporting visualizations use Java and/or gnuplot • Flexible GUI with sample problems and demos • Fastest code possible with efficient memory requirements • Long history of code use with variety of users for troubleshooting • Flexible code based on scripts and operators • Operates on a numeric standard data mining format file
Practical Tips for PCA • NIPALS algorithm assumes the features are zero centered • It is standard practice to do a Mahalanobis scaling of the data • PCA regression does not consider the response data • The t’s are called the scores • It is common practice to drop 4 sigma outlier features • (if there are many features)
StripMiner Script Examples • PCA visualization (pca.bat) • Pharma-plot (pharma.bat) • Prediction for iris with PCA (iris.bat) • Bootstrap prediction for iris (iris_boo.bat) • Predicting with an external test set example (iris_ext.bat)) • PLS and ROC curve for iris problem (roc.bat) • Leave-One-Out PLS for HIV (loo_hiv.bat) • Feature selection for HIV (prune.bat) • Starplots (star.bat)
File Flow for PCA.bat Script num_eg.txt stats.txt la_sscala.txt iris.txt.txt.txt.txt • num_eg.txt contains the number of PCAs (2-10) • usually data are first Mahalanobis scaled (option #-3: “PLS scaling”, data only)
File Flow for pharma.bat script num_eg.txt stats.txt la_sscala.txt dmatrix.txt a.txt pharmaplot • num_eg.txt has to contain a 4 for a pharmaplot • use pharmaplot.m for visualization in MATLAB • adjust color setting threshold in pharmaplot.m
File Flow For iris.bat Script: Predicting Class stats.txt la_sscala.txt a.txt cmatrix.txt dmatrix.txt resultss.xxx resultss.ttt results.xxx results.ttt num_eg.txt • For the random seed in splitting routine don’t use 0 (preserves order) • The test set is really only for validation purposes (answer is known) • Note: descaling from PLS uses la_sscala.txt file • Notice q2, Q2, and RSME error measures
File Flow for iris_boo.bat Script: Bootstrap Validation for Estimating Prediction Confidence stats.txt la_sscala.txt a.txt resultss.xxx resultss.ttt results.ttt num_eg.txt • We use bootstrap cross-validation (e.g., leave 7 out 100 times) • Use MATLAB script dos_mbotw results.ttt to display results for test set • Use MATLAB script dos_mbotw resultss.xxx to display results training set • Notice q2, Q2, and RSME error measures
Error Measure Criteria For training set we use: - RMSE: root mean square error for training set - r2 : correlation coefficient for training set - R2: PRESS R2 For validation/test set we use: - RMSE: reast mean square error for validation set - q2 : 1 – rtest2 - Q2: PRESS/SD
Script for Scaling with an External Test Set • 3305 scatterplot (Java) • -3305 scatterplot gnuplot • 3313 errorplot (Java) • -3313 errorplot (gnuplot)
Docking Ligands is a Nonlinear Problem DDASSL Drug Design and Semi-Supervised Learning
Feature Selection (data strip mining) PLS, K-PLS, SVM, ANN Fuzzy Expert System Rules GA or Sensitivity Analysis to select descriptors
Script for ALBUMIN_LOO.BAT: Pls-loo Validation For Albumin Data cmatrix.ori dmatrix.ori num_eg.txt stats.txt la_sscala.txt a.txt results.xxx results.ttt sel_lbls.txt bbmatrixx.txt bbmatrixxx.txt • PLS-LOO stands for leave-one-out PLS cross-validation • Training set is in cmatrix.ori and external validation set in dmatrix.ori • External validation set has –999 or 0 in the activity field • Note that we create generic labels and and that there is a test set • Notice the dropping of non-changing features and 4-sigma ouliers • Notice the acrobatics for displaying metrics (visualize with dos_mbotw)
PLS Feature Selection Script For Albumin Data aa.pat bbmatrixx.txt sel_lbls.txt select.txt sel_lbls.txt aa.pat aa.tes bbmatrixx.txt bbmatrixxx.txt • Do several iterative prunings, typically leave 7 out 100 x • Use different seeds • Number of selected feature example: 400, 300, 200, 150, 120, 100, 80, 60, 50, 45, …
STARPLOT.BAT: Starplot for Selected Features for Albumin sel_lbls.txt aa.pat bbmatrixxx.txt sel_lbls.txt starplot.txt starplot • First generate bbmatrixxx.txt which contains all sensitivities for (e.g.) 30 boostraps • using PLS bootstrap option 33 • Generate starplot.txt from bbmatrixxx.txt using option 3320 • Use the MATLAB routine starplot.m (operates on starplot.txt and sel_lbls.txt)