310 likes | 418 Views
DIMACS Mixer Series, September 19, 2002. Datascope - a new tool for Logical Analysis of Data (LAD). Sorin Alexe RUTCOR, Rutgers University, Piscataway, NJ e-mail: salexe@rutcor.rutgers.edu URL: rutcor.rutgers.edu/~salexe. Hidden Function. LAD Approximation. LAD - Problem.
E N D
DIMACS Mixer Series, September 19, 2002 Datascope - a new tool for Logical Analysis of Data (LAD) Sorin Alexe RUTCOR, Rutgers University, Piscataway, NJ e-mail: salexe@rutcor.rutgers.edu URL: rutcor.rutgers.edu/~salexe
Hidden Function LADApproximation LAD - Problem Dataset
Negative Pattern LAD - Patterns Positive Pattern
Negative Theory Model LAD - Theories, Models, Classifications Positive Theory
Datascope Functions Support Set Identification Space Discretization Pattern Detection Model Construction Discriminant / Prognostic Index Classification Feature Analysis
Raw Data Cutpoints, Support Set Pattern Report Pre-Processing Discretization Pandect Generation Significant Features Theories/Models Discriminant Construction User Excel Model Internal Solver Matlab Solver Feature Analysis Pattern Space Diagnosis Prognosis Risk Stratification Datascope Dataflow
1. Support Set Identification Selects Small Subset of Significant Features Preserves Hidden Knowledge Feature Ranking Criteria: Statistical Correlation with Outcome Combinatorial Entropy Distribution Monotonicity Class Separation Envelope Eccentricity E.g., 10 proteins selected out of 15,144
Data Spreadsheet Oriented OLE (via Clipboard)/ Excel Spreadsheet / dBase tables Training / Test Generation Bootstrap k-Folding Jackknife New Features Correlation
Parameter Choice: • User Defined • Minimizing Support Set • Quality Measures: • Entropy • Separability 2. Space Discretization Criteria: Entropy Correlation with Output Bins (equipartitioning) Intervals Clustered Class Separation
Entropy Correlation with Output Bins Intervals Clustered Class Separation
3. Generation of Maximal Patterns Pattern Type Selection: Prime Cones Intervals Spanned • Parameter Bound Settings: Prevalence: • % of positive observations • % of negative observations Homogeneity: • on positive patterns • on negative patterns Degree. Post-Generation Filters: By Characteristics Maximality Strongness
i.e., Pattern Definition Training Set Test Set Positive Patterns
Pattern Definition Training Set Test Set Negative Patterns
Model Selection: 2 Set-Covering Problems Quadratic Set-Covering Problem 4. Theories and Models Pandect Theory Selection: via: Greedy Bottleneck Greedy Lexicographic Greedy Set Covering Heuristics
5. Discriminants (weighted sums of patterns) Weight Selection Methods: Direct 1. Prognostic Index 2. Weighted Prognostic Index LP-Based 3. Distance Maximizing Separator (SVM) 4. Cost Minimizing Separator 5. Expected Value Separator NLP-Based 6. Regression in Pattern Space (ANN) 7. Best Correlation with Output
Prognostic Index Weighted Prognostic Expected Value Index Separator Distance Maximizing Cost Minimizing Best Correlation Separator Separator with Output
Accuracy Specificity Sensitivity
Reporting Cutpoints Discretized Space Pandect Coverage of Observations by Patterns Pattern Report (Compact/Full Versions) Theories/Models Attribute Analysis Log File
Test + + + + + + - - - Patterns Pattern Space Positive Observations Unclassified Observations Training Negative Observations + + + + + + - - - Patterns
Validation Procedures Raw Data Stratified Random Partition Bootstrap K-Folding Jackknife LAD Model on Training Set Accuracy Sensitivity Specificity Performance Evaluation
Special Features Generating User Model Generation (Excel Files) Datascope Macro Language Multiple and Complex Experiments Interface with Other Applications (Datascope Server)
Performance Tjen-Sien Lim, Wei-Yin Loh and Yu-Shan Shin A Comparison of Prediction Accuracy, Complexity, and Training Time of Thirty-three Old and New Classification Algorithms, by, Machine Learning, 40, 203-229 (2000) http://www.ics.uci.edu/~mlearn/MLRepository.html
LAD Case Studies Assessing Long-Term Mortality Risk After Exercise Electrocardiography Ovarian Cancer Detection Using Proteomic Data Combinatorial Analysis of Breast Cancer Data from Image Cytometry and Gene Expression Microarrays Cell Proliferation on Medical Implants Country Risk Rating