1 / 22

Analyze/StripMiner™: Efficient Data Mining and Modeling Software

An idiot's guide to data analysis using Analyze/StripMiner™. Standard scripts, predictive modeling, outlier detection, visualization, and code specifics.

festes
Download Presentation

Analyze/StripMiner™: Efficient Data Mining and Modeling Software

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Analyze/StripMiner™ • • Analyze/StripMiner ™ Overview • To obtain an idiot’s guide type “analyze > readme.txt” • Standard Analyze Scripts • Predicting on Blind Data • PLS (Please Listen to Svante Wold) • • LOO, BOO and n-Fold Cross-Validation Error Measures • Albumin Data Set and Feature Selection • • Bio-Informatics

  2. Analyze/StripMiner™ • Modeling • ANN (Neural Networks) • SVM (Support Vector Machines) • PLS (Partial-Least Squares) • GA-based regression clustering • PCA regression • Local Learning • Outlier Detection (GAMOL) • Data Processing • Interface with RECON • Different Scaling Modes • Outlier detection/data cleansing • Visualization • Correlation Plots • 2-D Sensitivity Plots • Outlier Visualization Plots • Different Scaling Options • Cluster Ranking Plots • Standard ROC curves • Continuous ROC curves • Learning Modes • Bootstrapping • Bagging • Boosting • Leave-one-out cross-validation • Code Specifics • Tight Classic C-code (< 15000 lines) • Script-Based Shell Program • Runs on all Platforms • Ultra Fast • Use: TransScan – GE - KODAK • Doppler broadening • Macro-Economics Analysis • Feature Selection • Sensitivity Analysis • Genetic Algorithms • Correlation GA (GAFEAT) • Method specific DDASSL

  3. Analyze/StripMiner ™ Coding Philosophy • Standard C code that compiles on all platforms • WINDOWS™ and Linux platforms • Supporting visualizations use Java and/or gnuplot • Flexible GUI with sample problems and demos • Fastest code possible with efficient memory requirements • Long history of code use with variety of users for troubleshooting • Flexible code based on scripts and operators • Operates on a numeric standard data mining format file

  4. Practical Tips for PCA • NIPALS algorithm assumes the features are zero centered • It is standard practice to do a Mahalanobis scaling of the data • PCA regression does not consider the response data • The t’s are called the scores • It is common practice to drop 4 sigma outlier features • (if there are many features)

  5. StripMiner Script Examples • PCA visualization (pca.bat) • Pharma-plot (pharma.bat) • Prediction for iris with PCA (iris.bat) • Bootstrap prediction for iris (iris_boo.bat) • Predicting with an external test set example (iris_ext.bat)) • PLS and ROC curve for iris problem (roc.bat) • Leave-One-Out PLS for HIV (loo_hiv.bat) • Feature selection for HIV (prune.bat) • Starplots (star.bat)

  6. File Flow for PCA.bat Script num_eg.txt stats.txt la_sscala.txt iris.txt.txt.txt.txt • num_eg.txt contains the number of PCAs (2-10) • usually data are first Mahalanobis scaled (option #-3: “PLS scaling”, data only)

  7. File Flow for pharma.bat script num_eg.txt stats.txt la_sscala.txt dmatrix.txt a.txt pharmaplot • num_eg.txt has to contain a 4 for a pharmaplot • use pharmaplot.m for visualization in MATLAB • adjust color setting threshold in pharmaplot.m

  8. File Flow For iris.bat Script: Predicting Class stats.txt la_sscala.txt a.txt cmatrix.txt dmatrix.txt resultss.xxx resultss.ttt results.xxx results.ttt num_eg.txt • For the random seed in splitting routine don’t use 0 (preserves order) • The test set is really only for validation purposes (answer is known) • Note: descaling from PLS uses la_sscala.txt file • Notice q2, Q2, and RSME error measures

  9. File Flow for iris_boo.bat Script: Bootstrap Validation for Estimating Prediction Confidence stats.txt la_sscala.txt a.txt resultss.xxx resultss.ttt results.ttt num_eg.txt • We use bootstrap cross-validation (e.g., leave 7 out 100 times) • Use MATLAB script dos_mbotw results.ttt to display results for test set • Use MATLAB script dos_mbotw resultss.xxx to display results training set • Notice q2, Q2, and RSME error measures

  10. Error Measure Criteria For training set we use: - RMSE: root mean square error for training set - r2 : correlation coefficient for training set - R2: PRESS R2 For validation/test set we use: - RMSE: reast mean square error for validation set - q2 : 1 – rtest2 - Q2: PRESS/SD

  11. Script for Scaling with an External Test Set • 3305 scatterplot (Java) • -3305 scatterplot gnuplot • 3313 errorplot (Java) • -3313 errorplot (gnuplot)

  12. Docking Ligands is a Nonlinear Problem DDASSL Drug Design and Semi-Supervised Learning

  13. Feature Selection (data strip mining) PLS, K-PLS, SVM, ANN Fuzzy Expert System Rules GA or Sensitivity Analysis to select descriptors

  14. Script for ALBUMIN_LOO.BAT: Pls-loo Validation For Albumin Data cmatrix.ori dmatrix.ori num_eg.txt stats.txt la_sscala.txt a.txt results.xxx results.ttt sel_lbls.txt bbmatrixx.txt bbmatrixxx.txt • PLS-LOO stands for leave-one-out PLS cross-validation • Training set is in cmatrix.ori and external validation set in dmatrix.ori • External validation set has –999 or 0 in the activity field • Note that we create generic labels and and that there is a test set • Notice the dropping of non-changing features and 4-sigma ouliers • Notice the acrobatics for displaying metrics (visualize with dos_mbotw)

  15. PLS Feature Selection Script For Albumin Data aa.pat bbmatrixx.txt sel_lbls.txt select.txt sel_lbls.txt aa.pat aa.tes bbmatrixx.txt bbmatrixxx.txt • Do several iterative prunings, typically leave 7 out 100 x • Use different seeds • Number of selected feature example: 400, 300, 200, 150, 120, 100, 80, 60, 50, 45, …

  16. DDASSL

  17. STARPLOT.BAT: Starplot for Selected Features for Albumin sel_lbls.txt aa.pat bbmatrixxx.txt sel_lbls.txt starplot.txt starplot • First generate bbmatrixxx.txt which contains all sensitivities for (e.g.) 30 boostraps • using PLS bootstrap option 33 • Generate starplot.txt from bbmatrixxx.txt using option 3320 • Use the MATLAB routine starplot.m (operates on starplot.txt and sel_lbls.txt)

More Related