Spooky Stuff in Metric Space

Spooky Stuff in Metric Space

Spooky StuffData Mining in Metric Space Rich Caruana Alex Niculescu Cornell University

Motivation #1

Motivation #1: Pneumonia Risk Prediction

Motivation #1: Many Learning Algorithms • Neural nets • Logistic regression • Linear perceptron • K-nearest neighbor • Decision trees • ILP (Inductive Logic Programming) • SVMs (Support Vector Machines) • Bagging X • Boosting X • Rule learners (C2, …) • Ripper • Random Forests (forests of decision trees) • Gaussian Processes • Bayes Nets • … • No one/few learning methods dominates the others

Motivation #2

Motivation #2: SLAC B/Bbar • Particle accelerator generates B/Bbar particles • Use machine learning to classify tracks as B or Bbar • Domain specific performance measure: SLQ-Score • 5% increase in SLQ can save $1M in accelerator time • SLAC researchers tried various DM/ML methods • Good, but not great, SLQ performance • We tried standard methods, got similar results • We studied SLQ metric: • similar to probability calibration • tried bagged probabilistic decision trees (good on C-Section)

Motivation #2: Bagged Probabilistic Trees • Draw N bootstrap samples of data • Train tree on each sample ==> N trees • Final prediction = average prediction of N trees … Average prediction (0.23 + 0.19 + 0.34 + 0.22 + 0.26 + … + 0.31) / # Trees = 0.24

Motivation #2: Improves Calibration Order of Magnitude single tree Poor Calibration 100 bagged trees Excellent Calibration

Motivation #2: Significantly Improves SLQ 100 bagged trees single tree

Motivation #2 • Can we automate this analysis of performance metrics so that it’s easier to recognize which metrics are similar to each other?

Motivation #3

Scary Stuff • In ideal world: • Learn model that predicts correct conditional probabilities (Bayes optimal) • Yield optimal performance on any reasonable metric • In real world: • Finite data • 0/1 targets instead of conditional probabilities • Hard to learn this ideal model • Don’t have good metrics for recognizing ideal model • Ideal model isn’t always needed • In practice: • Do learning using many different metrics: ACC, AUC, CXE, RMS, … • Each metric represents different tradeoffs • Because of this, usually important to optimize to appropriate metric

Scary Stuff

In this work we compare nine commonly used performance metrics by applying data mining to the results of a massive empirical study • Goals: • Discover relationships between performance metrics • Are the metrics really that different? • If you optimize to metric X, also get good perf on metric Y? • Need to optimize to metric Y, which metric X should you optimize to? • Which metrics are more/less robust? • Design new, better metrics?

10 Binary Classification Performance Metrics • Threshold Metrics: • Accuracy • F-Score • Lift • Ordering/Ranking Metrics: • ROC Area • Average Precision • Precision/Recall Break-Even Point • Probability Metrics: • Root-Mean-Squared-Error • Cross-Entropy • Probability Calibration • SAR = ((1 - Squared Error) + Accuracy + ROC Area) / 3

Accuracy Predicted 1Predicted 0 correct a b True 0True 1 c d incorrect threshold accuracy = (a+d) / (a+b+c+d)

Lift • not interested in accuracy on entire dataset • want accurate predictions for 5%, 10%, or 20% of dataset • don’t care about remaining 95%, 90%, 80%, resp. • typical application: marketing • how much better than random prediction on the fraction of the dataset predicted true (f(x) > threshold)

Lift Predicted 1Predicted 0 a b True 0True 1 c d threshold

lift = 3.5 if mailings sent to 20% of the customers

Precision/Recall, F, Break-Even Pt harmonic average of precision and recall

better performance worse performance

Predicted 1Predicted 0 Predicted 1Predicted 0 true positive false negative FN TP True 0True 1 True 0True 1 false positive true negative FP TN Predicted 1Predicted 0 Predicted 1Predicted 0 misses P(pr0|tr1) hits P(pr1|tr1) True 0True 1 True 0True 1 false alarms correct rejections P(pr1|tr0) P(pr0|tr0)

ROC Plot and ROC Area • Receiver Operator Characteristic • Developed in WWII to statistically model false positive and false negative detections of radar operators • Better statistical foundations than most other measures • Standard measure in medicine and biology • Becoming more popular in ML • Sweep threshold and plot • TPR vs. FPR • Sensitivity vs. 1-Specificity • P(true|true) vs. P(true|false) • Sensitivity = a/(a+b) = Recall = LIFT numerator • 1 - Specificity = 1 - d/(c+d)

diagonal line is random prediction

Calibration • Good calibration: • If 1000 x’s have pred(x) = 0.2, ~200 should be positive

Calibration • Model can be accurate but poorly calibrated • good threshold with uncalibrated probabilities • Model can have good ROC but be poorly calibrated • ROC insensitive to scaling/stretching • only ordering has to be correct, not probabilities themselves • Model can have very high variance, but be well calibrated • Model can be stupid, but be well calibrated • Calibration is a real oddball

Measuring Calibration • Bucket method • In each bucket: • measure observed c-sec rate • predicted c-sec rate (average of probabilities) • if observed csec rate similar to predicted csec rate => good calibration in that bucket 0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95 # # # # # # # # # # # # # # # # # # # 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Calibration Plot

Experiments

Base-Level Learning Methods • Decision trees • K-nearest neighbor • Neural nets • SVMs • Bagged Decision Trees • Boosted Decision Trees • Boosted Stumps • Each optimizes different things • Each best in different regimes • Each algorithm has many variations and free parameters • Generate about 2000 models on each test problem

Data Sets • 7 binary classification data sets • Adult • Cover Type • Letter.p1 (balanced) • Letter.p2 (unbalanced) • Pneumonia (University of Pittsburgh) • Hyper Spectral (NASA Goddard Space Center) • Particle Physics (Stanford Linear Accelerator) • 4 k train sets • Large final test sets (usually 20k)

Massive Empirical Comparison 7 base-level learning methods X 100’s of parameter settings per method = ~ 2000 models per problem X 7 test problems = 14,000 models X 10 performance metrics = 140,000 model performance evaluations

COVTYPE: Calibration vs. Accuracy

Multi Dimensional Scaling

Scaling, Ranking, and Normalizing • Problem: • some metrics, 1.00 is best (e.g. ACC) • some metrics, 0.00 is best (e.g. RMS) • some metrics, baseline is 0.50 (e.g. AUC) • some problems/metrics, 0.60 is excellent performance • some problems/metrics, 0.99 is poor performance • Solution 1: Normalized Scores: • baseline performance => 0.00 • best observed performance => 1.00 (proxy for Bayes optimal) • puts all metrics on equal footing • Solution 2: Scale by Standard Deviation • Solution 3: Rank Correlation

Multi Dimensional Scaling • Find low-dimension embedding of 10x14,000 data • The 10 metrics span a 2-5 dimension subspace

Multi Dimensional Scaling • Look at 2-D MDS plots: • Scaled by standard deviation • Normalized scores • MDS of rank correlations • MDS on each problem individually • MDS averaged across all problems

2-D Multi-Dimensional Scaling

2-D Multi-Dimensional Scaling Normalized Scores Scaling Rank-Correlation Distance

Adult Covertype Hyper-Spectral Letter Medis SLAC

Correlation Analysis • 2000 performances for each metric on each problem • Correlation between all pairs of metrics • 10 metrics • 45 pairwise correlations • Average of correlations over 7 test problems • Standard correlation • Rank correlation • Present rank correlation here

Rank Correlations • Correlation analysis consistent with MDS analysis • Ordering metrics have high correlations to each other • ACC, AUC, RMS have best correlations of metrics in each metric class • RMS has good correlation to other metrics • SAR has best correlation to other metrics

Summary • 10 metrics span 2-5 Dim subspace • Consistent results across problems and scalings • Ordering Metrics Cluster: AUC ~ APR ~ BEP • CAL far from Ordering Metrics • CAL nearest to RMS/MXE • RMS ~ MXE, but RMS much more centrally located • Threshold Metrics ACC and FSC do not cluster as tightly as ordering metrics and RMS/MXE • Lift behaves more like Ordering than Threshold metrics • Old friends ACC, AUC, and RMS most representative • New SAR metric is good, but not much better than RMS

New Resources • Want to borrow 14,000 models? • margin analysis • comparison to new algorithm X • … • PERF code: software that calculates ~2 dozen performance metrics: • Accuracy (at different thresholds) • ROC Area and ROC plots • Precision and Recall plots • Break-even-point, F-score, Average Precision • Squared Error • Cross-Entropy • Lift • … • Currently, most metrics are for boolean classification problems • We are willing to add new metrics and new capabilities • Available at: http://www.cs.cornell.edu/~caruana

Future Work

Future/Related Work • Ensemble method optimizes any metric (ICML*04) • Get good probs from Boosted Trees (AISTATS*05) • Comparison of learning algs on metrics (ICML*06) • First step in analyzing different performance metrics • Develop new metrics with better properties • SAR is a good general purpose metric • Does optimizing to SAR yield better models? • but RMS nearly as good • attempts to make SAR better did not help much • Extend to multi-class or hierarchical problems where evaluating performance is more difficult

Thank You.

Spooky Stuff in Metric Space

Spooky Stuff in Metric Space

Presentation Transcript

The Very Spooky Halloween

The Very Spooky Halloween

Lost in Space: Finding Stuff You Can’t See

Space N’ Stuff

Spooky Review

Spooky halloween haplotype assembly

Dry ice spooky

Spooky Reader’s Chair

One Spooky Night

the spooky Halloween night

Probabilistic Tracking in a Metric Space

Spooky Stones

SIMILARITY SEARCH The Metric Space Approach

Spooky Spirituality

Spooky Bingo

Clustering Sequences in a Metric Space

The Spooky Hotel

SIMILARITY SEARCH The Metric Space Approach

Clustering Large Datasets in Arbitrary Metric Space

SIMILARITY SEARCH The Metric Space Approach

Compactness in Metric space

SIMILARITY SEARCH The Metric Space Approach