620 likes | 645 Views
Spooky Stuff in Metric Space. Spooky Stuff Data Mining in Metric Space. Rich Caruana Alex Niculescu Cornell University. Motivation #1. Motivation #1: Pneumonia Risk Prediction. Motivation #1: Many Learning Algorithms. Neural nets Logistic regression Linear perceptron K-nearest neighbor
E N D
Spooky StuffData Mining in Metric Space Rich Caruana Alex Niculescu Cornell University
Motivation #1: Many Learning Algorithms • Neural nets • Logistic regression • Linear perceptron • K-nearest neighbor • Decision trees • ILP (Inductive Logic Programming) • SVMs (Support Vector Machines) • Bagging X • Boosting X • Rule learners (C2, …) • Ripper • Random Forests (forests of decision trees) • Gaussian Processes • Bayes Nets • … • No one/few learning methods dominates the others
Motivation #2: SLAC B/Bbar • Particle accelerator generates B/Bbar particles • Use machine learning to classify tracks as B or Bbar • Domain specific performance measure: SLQ-Score • 5% increase in SLQ can save $1M in accelerator time • SLAC researchers tried various DM/ML methods • Good, but not great, SLQ performance • We tried standard methods, got similar results • We studied SLQ metric: • similar to probability calibration • tried bagged probabilistic decision trees (good on C-Section)
Motivation #2: Bagged Probabilistic Trees • Draw N bootstrap samples of data • Train tree on each sample ==> N trees • Final prediction = average prediction of N trees … Average prediction (0.23 + 0.19 + 0.34 + 0.22 + 0.26 + … + 0.31) / # Trees = 0.24
Motivation #2: Improves Calibration Order of Magnitude single tree Poor Calibration 100 bagged trees Excellent Calibration
Motivation #2: Significantly Improves SLQ 100 bagged trees single tree
Motivation #2 • Can we automate this analysis of performance metrics so that it’s easier to recognize which metrics are similar to each other?
Scary Stuff • In ideal world: • Learn model that predicts correct conditional probabilities (Bayes optimal) • Yield optimal performance on any reasonable metric • In real world: • Finite data • 0/1 targets instead of conditional probabilities • Hard to learn this ideal model • Don’t have good metrics for recognizing ideal model • Ideal model isn’t always needed • In practice: • Do learning using many different metrics: ACC, AUC, CXE, RMS, … • Each metric represents different tradeoffs • Because of this, usually important to optimize to appropriate metric
In this work we compare nine commonly used performance metrics by applying data mining to the results of a massive empirical study • Goals: • Discover relationships between performance metrics • Are the metrics really that different? • If you optimize to metric X, also get good perf on metric Y? • Need to optimize to metric Y, which metric X should you optimize to? • Which metrics are more/less robust? • Design new, better metrics?
10 Binary Classification Performance Metrics • Threshold Metrics: • Accuracy • F-Score • Lift • Ordering/Ranking Metrics: • ROC Area • Average Precision • Precision/Recall Break-Even Point • Probability Metrics: • Root-Mean-Squared-Error • Cross-Entropy • Probability Calibration • SAR = ((1 - Squared Error) + Accuracy + ROC Area) / 3
Accuracy Predicted 1Predicted 0 correct a b True 0True 1 c d incorrect threshold accuracy = (a+d) / (a+b+c+d)
Lift • not interested in accuracy on entire dataset • want accurate predictions for 5%, 10%, or 20% of dataset • don’t care about remaining 95%, 90%, 80%, resp. • typical application: marketing • how much better than random prediction on the fraction of the dataset predicted true (f(x) > threshold)
Lift Predicted 1Predicted 0 a b True 0True 1 c d threshold
lift = 3.5 if mailings sent to 20% of the customers
Precision/Recall, F, Break-Even Pt harmonic average of precision and recall
better performance worse performance
Predicted 1Predicted 0 Predicted 1Predicted 0 true positive false negative FN TP True 0True 1 True 0True 1 false positive true negative FP TN Predicted 1Predicted 0 Predicted 1Predicted 0 misses P(pr0|tr1) hits P(pr1|tr1) True 0True 1 True 0True 1 false alarms correct rejections P(pr1|tr0) P(pr0|tr0)
ROC Plot and ROC Area • Receiver Operator Characteristic • Developed in WWII to statistically model false positive and false negative detections of radar operators • Better statistical foundations than most other measures • Standard measure in medicine and biology • Becoming more popular in ML • Sweep threshold and plot • TPR vs. FPR • Sensitivity vs. 1-Specificity • P(true|true) vs. P(true|false) • Sensitivity = a/(a+b) = Recall = LIFT numerator • 1 - Specificity = 1 - d/(c+d)
diagonal line is random prediction
Calibration • Good calibration: • If 1000 x’s have pred(x) = 0.2, ~200 should be positive
Calibration • Model can be accurate but poorly calibrated • good threshold with uncalibrated probabilities • Model can have good ROC but be poorly calibrated • ROC insensitive to scaling/stretching • only ordering has to be correct, not probabilities themselves • Model can have very high variance, but be well calibrated • Model can be stupid, but be well calibrated • Calibration is a real oddball
Measuring Calibration • Bucket method • In each bucket: • measure observed c-sec rate • predicted c-sec rate (average of probabilities) • if observed csec rate similar to predicted csec rate => good calibration in that bucket 0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95 # # # # # # # # # # # # # # # # # # # 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Base-Level Learning Methods • Decision trees • K-nearest neighbor • Neural nets • SVMs • Bagged Decision Trees • Boosted Decision Trees • Boosted Stumps • Each optimizes different things • Each best in different regimes • Each algorithm has many variations and free parameters • Generate about 2000 models on each test problem
Data Sets • 7 binary classification data sets • Adult • Cover Type • Letter.p1 (balanced) • Letter.p2 (unbalanced) • Pneumonia (University of Pittsburgh) • Hyper Spectral (NASA Goddard Space Center) • Particle Physics (Stanford Linear Accelerator) • 4 k train sets • Large final test sets (usually 20k)
Massive Empirical Comparison 7 base-level learning methods X 100’s of parameter settings per method = ~ 2000 models per problem X 7 test problems = 14,000 models X 10 performance metrics = 140,000 model performance evaluations
Scaling, Ranking, and Normalizing • Problem: • some metrics, 1.00 is best (e.g. ACC) • some metrics, 0.00 is best (e.g. RMS) • some metrics, baseline is 0.50 (e.g. AUC) • some problems/metrics, 0.60 is excellent performance • some problems/metrics, 0.99 is poor performance • Solution 1: Normalized Scores: • baseline performance => 0.00 • best observed performance => 1.00 (proxy for Bayes optimal) • puts all metrics on equal footing • Solution 2: Scale by Standard Deviation • Solution 3: Rank Correlation
Multi Dimensional Scaling • Find low-dimension embedding of 10x14,000 data • The 10 metrics span a 2-5 dimension subspace
Multi Dimensional Scaling • Look at 2-D MDS plots: • Scaled by standard deviation • Normalized scores • MDS of rank correlations • MDS on each problem individually • MDS averaged across all problems
2-D Multi-Dimensional Scaling Normalized Scores Scaling Rank-Correlation Distance
Adult Covertype Hyper-Spectral Letter Medis SLAC
Correlation Analysis • 2000 performances for each metric on each problem • Correlation between all pairs of metrics • 10 metrics • 45 pairwise correlations • Average of correlations over 7 test problems • Standard correlation • Rank correlation • Present rank correlation here
Rank Correlations • Correlation analysis consistent with MDS analysis • Ordering metrics have high correlations to each other • ACC, AUC, RMS have best correlations of metrics in each metric class • RMS has good correlation to other metrics • SAR has best correlation to other metrics
Summary • 10 metrics span 2-5 Dim subspace • Consistent results across problems and scalings • Ordering Metrics Cluster: AUC ~ APR ~ BEP • CAL far from Ordering Metrics • CAL nearest to RMS/MXE • RMS ~ MXE, but RMS much more centrally located • Threshold Metrics ACC and FSC do not cluster as tightly as ordering metrics and RMS/MXE • Lift behaves more like Ordering than Threshold metrics • Old friends ACC, AUC, and RMS most representative • New SAR metric is good, but not much better than RMS
New Resources • Want to borrow 14,000 models? • margin analysis • comparison to new algorithm X • … • PERF code: software that calculates ~2 dozen performance metrics: • Accuracy (at different thresholds) • ROC Area and ROC plots • Precision and Recall plots • Break-even-point, F-score, Average Precision • Squared Error • Cross-Entropy • Lift • … • Currently, most metrics are for boolean classification problems • We are willing to add new metrics and new capabilities • Available at: http://www.cs.cornell.edu/~caruana
Future/Related Work • Ensemble method optimizes any metric (ICML*04) • Get good probs from Boosted Trees (AISTATS*05) • Comparison of learning algs on metrics (ICML*06) • First step in analyzing different performance metrics • Develop new metrics with better properties • SAR is a good general purpose metric • Does optimizing to SAR yield better models? • but RMS nearly as good • attempts to make SAR better did not help much • Extend to multi-class or hierarchical problems where evaluating performance is more difficult