Using Machine Learning to Model Standard Practice: Retrospective Analysis of Group C-Section Rate via Bagged Decision T

Using Machine Learning to Model Standard Practice: Retrospective Analysis of Group C-Section Rate via Bagged Decision Trees Rich Caruana Cornell CS Stefan Niculescu CMU CS Bharat Rao Siemens Medical Cynthia Simms Magee Hospital

C-Section in the U.S. • C-section rate in U.S. too high • Western Europe has lower rates, but comparable outcomes • C-section is major surgery => tough on mother • C-section is expensive • Why is U.S. rate so high? • Convenience (rate highest Fridays, before sporting events, …) • Litigation • Social and Demographic issues • Current controls • Financial: pay-per-patient instead of pay-per-procedure • Physician reviews: monthly/quarterly evaluation of rates

Risk Adjustment • Some practices specialize in high risk patients • Some practices have low-risk demographics • Must correct for patient population seen by each practice

Risk Adjustment • Some practices specialize in high risk patients • Some practices have low-risk demographics • Must correct for patient population seen by each practice Need an accurate, objective, evidence-based method for assessing c-section rate of different physician practices

Modeling "Standard Practice" with Machine Learning • Not trying to improve outcomes • Maintain quality of outcomes while reducing c-sec rate • Compare physicians/practices to other physicians/practices • Warn physicians if rate higher than other practitioners

Data • 3 years of data: 1995-1997 from South-Western PA • 22,175 expectant mothers • 16.8% c-section • 144 attributes per patient (82 used for learning) • 17 physician practices • C-section rate varies from 13% to 23% in practices

Learning Methods • Logistic Regression • Artificial Neural Nets • MML Decision Trees • Buntine's IND decision tree package • Maximum size trees (1000's of nodes) • Probabilities generated by smoothing counts along path to leaf • Bagged MML Decision Trees • Better accuracy • Better ROC Area • Excellent calibration

Bagged Decision Trees • Draw 100 bootstrap samples of data • Train trees on each sample -> 100 trees • Average prediction of trees on out-of-bag samples … Average prediction (0.23 + 0.19 + 0.34 + 0.22 + 0.26 + … + 0.31) / # Trees = 0.24

Bagging Improves Accuracy and AUC

Calibration • Good calibration: • If 1000 patients have pred(x) = 0.1, ~100 should get csec

Calibration • Model can be accurate but poorly calibrated • good threshold with uncalibrated probabilities • Model can have good ROC but be poorly calibrated • ROC insensitive to scaling/stretching • only ordering has to be correct, not probabilities themselves • Model can have very high variance, but be well calibrated • if expected value <pred(patient)> is correct -> good calibration • not worried about prediction for any one patient: • aggregate prediction over groups of patients reduces variance • high variance of per patient prediction is OK • bias/variance tradeoff: favor reducing bias at loss of variance

Measuring Calibration • Bucket method • In each bucket: • measure observed c-sec rate • predicted c-sec rate (average of probabilities) • if observed csec rate similar to predicted csec rate => good calibration in that bucket 0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95 # # # # # # # # # # # # # # # # # # # 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Calibration Plot

Bagging Improves Calibration an Order of Magnitude!

Overall Bagged MML DT Performance • Out-of-bag test points (cross validation for bagging) • Accuracy: 90.1% (baseline 83.2%) • ROC Area: 0.9233 • Mean absolute calibration error = 0.013 • Calibration as good in tails as center of distribution

Aggregate Risk • Aggregate risk of c-section for a group is just the average probability of c-section for each patient in that group • If a group's aggregate risk matches the observed c-sec rate, then the group is performing c-secs in accordance with standard practice (as modeled by learned model)

Sanity Check • Could models have good calibration in tails, but still be inaccurate on some groups of patients? • Suppose two kinds of patients, A and B, both with elevated risk: p(A) = p(B) = 0.8 • Patients A and B differ, model might be more accurate or better calibrated on A than B, so a practice seeing patients A might over/under estimate relative to practice seeing B • Are models good on each physician group?

Assumptions • Intrinsic factors: variables given to learning as inputs • Extrinsic factors: variables not given to learning • Provider (HMO vs. pay-per-procedure vs. no insurance vs. …) • patient/physician preference (e.g. socio-economic class) • Learning compensates for intrinsics, not extrinsics • assumes extrinsic not correlated to intrinsic • correlations may weaken sensitivity • Difference between observed and predicted rates due to extrinsic factors, model bias, noise • If models are good, differences due mainly to extrinsic factors such as provider and physician/patient preferences

Summary & Conclusions • Bagged probabilistic decision trees are excellent • Accuracy: 86.9% -> 90.1% • ROC Area: 0.891 -> 0.923 • Calibration: 0.169 -> 0.013! • Learned models of standard practice are good • 3 of 17 groups have high c-section rate • one has high-risk population • one has normal risk population • one has low-risk population!!! • One group has csec rate 4% below predicted rate • how do they do it? • are their outcomes good?

Future Work • 3 more years of unused data: 1992-1994 • Do discrepancies to standard practice correlate with: • practice characteristics? • patient characteristics? • payor characteristics? • outcomes? • Compare with data/models for patients in Europe • Confidence intervals • Research on machine learning to improve calibration • Apply this methodology to other problems in medicine

Sub Population Assessment • Accurately predict c-section rate for groups of patients • aggregate prediction • not worried about prediction for individual patients • Retrospective analysis • not prediction for future patients (i.e. pregnant women) • prediction for pregnancies that already have gone full term • risk assumed by physician, not machine learning • Tell physician/practice if their rate does not match rate predicted by the standard practice model

Calibration • Good calibration: • Given 1000 patients with identical features, observed rate should equal average predicted rate on those patients • Often have: but that's not good enough

Using Machine Learning to Model Standard Practice: Retrospective Analysis of Group C-Section Rate via Bagged Decision T