770 likes | 817 Views
MACHINE LEARNING FOR IMPROVED RISK STRATIFICATION OF NCD PATIENTS IN ESTONIA Big Data and Machine Learning in Health Care. Marvin Ploetz Philip Docena Ojaswi Pandey Aakash Mohpal 23 April, 2019. Objectives.
E N D
MACHINE LEARNING FOR IMPROVED RISK STRATIFICATION OF NCD PATIENTS IN ESTONIABig Data and Machine Learning in Health Care Marvin Ploetz Philip Docena Ojaswi Pandey Aakash Mohpal 23 April, 2019
Objectives Propose an alternative - machine learning based - approach to patient risk stratification for ECM in Estonia Illustrate the use and applicability of machine learning to other areas of work relevant to EHIF
Overview Big Data and Machine Learning in Health Care Machine Learning Basics Context of ECM Research Question Data Overview & Sample Construction Feature Engineering Evaluation & Modelling Choices Results Conclusions
Big Data and Machine Learning in Health Care Take advantage of massive amounts of data and provide the right intervention to the right patient at the right time Personalized care to the patient Potentially benefit all agents in the health care system: patient, provider, payer, management
Uses of Machine Learning in Health Care Personalized medicine Benefits Right patient Right intervention Right time Patients Providers Payers
Example 1: Hip and knee replacement in the US • Osteoarthritis: a common and painful chronic condition • Often requires replacement of hip and knees • More than 500,000 Medicare beneficiaries receive replacements each year • Medical costs: roughly $15,000 per surgery • Medical benefits: accrue over time, since some months after surgery is painful and spent in disability • Therefore, a joint replacement only makes sense if you will live long enough to enjoy it. If you die soon after, could be futile and painful • Prediction/classification problem: Can we predict which surgeries will be futile using only data available at the time of surgery?
Example 1: Hip and knee replacement in the US 3,305 independent variables Train data 65,395 observations 20% of 7.4 million beneficiaries 98,090 had hip or knee replacement in 2010 Model to predict riskiest patients Test data 32,695 observations 1.4% died within one month of surgery 4.2% died within 1-12 months Traditional Analysis: About Averages Big Data and ML Analytics: Predict Individual Risks
Example 1: Hip and knee replacement in the US The first column sorts the test sample by risk percentiles. In the top 5th percentile riskiest population, the observed mortality rate within 1 year within 1-12 months was 43.5%. Reallocating these surgeries to those with median risk level (50th percentile) would have averted 1,984 futile procedures, and reallocated $30m to other beneficiaries.
Example 2: Diagnoses of pediatric conditions Apply natural language processing algorithms to extract data from EHRs Extract 101.6m data points from 1.3m EHRs of pediatric patients High diagnostic accuracy among multiple organ systems and comparable to performance of experienced pediatric physicians
Example 3: Breast cancer screening Most common form of cancer afflicting 2.5 million patients worldwide in 2015 Need to distinguish malignant tumors from benign ones Early detection is key Data: 62,219 mammography findings from the Wisconsin State Cancer Reporting System A Neural Network based algorithm does as well as radiologists in classifying the tumors
Definition of Big Data • Collection of large and complex data sets which are difficult to process using common database management tools or traditional data processing applications • Not only about size: finding insights from complex, noisy, heterogeneous, and longitudinal data sets • This includes capturing, storing, searching, sharing and analyzing
Types of Machine Learning Problems • Supervised – Making predictions using labeled/structured data • Classification: use data to predict which category something falls into • Examples: If an image contains a store front or not; If a patient is high risk or not • Regression: use data to make predictions on a continuous scale • Examples: Predict stock price of a company; given historical data, what will the temperature be tomorrow • Unsupervised – Detecting patterns from unstructured data • Problems where we have little or no idea what the results should look like • Provide algorithms with data and ask to look for hidden features and cluster the data in a way it makes sense • Examples: identify patterns from genomics data, separating voice from noise in audio files
Machine Learning Implementation Standardize and clean data Build model using train data Split data in test/train Collect data Validate model results using test data Data Model Results Build Machine Learning Model Train data 80% Data Feature engineering/ Data construction Data Test data 20% Data
Assessing Model Performance: Precision and Recall Accuracy = (TP+TN)/All Precision = TP/(TP+FP) Recall = TP/(TP+FN)
Assessing Model Performance: Precision and Recall Case I: High recall, low precision Case II: Low recall, high precision • Accuracy = 190/230 = 83% • Precision = 90/95 = 95% • Recall = 90/125 = 72% Accuracy = 150/165 = 78% Precision = 100/145 = 69% Recall = 100/105 = 95%
Assessing Model Performance: ROC Curve Plot the true and false positive rate for every classification threshold A perfect model has a curve that passes through the upper left corner (AUC = 1) The diagonal (red line) represents random guessing (AUC = 0.5)
Decision Tree: Playing Golf • A non-parametric supervised learning method used for classification and regression • Built in the form a tree structure • Breaks data down in smaller and smaller subsets while incrementally building tree • Final result is tree with decision nodes and leaf nodes
Decision Tree: Playing Golf Outlook Rainy Overcast Sunny No Golf Golf Windy False True Play Golf No Golf
Decision tree to Random Forest • A collection of decision trees whose results are aggregated into one final output • Use different sub-samples of the data and different set of features • Helps reduce overfitting, bias and variance
A Big Challenge of the Estonian Healthcare System Changes in the demand for health care due to population ageing and rise of non-communicable diseases Chronic conditions as the driving force behind needs for better care integration Low coverage of preventive services and considerable share of avoidable specialist and hospital care Opportunity to improve management of specific patient groups at the PHC level -> care management for empaneled patients Prediction for which patients breaches in care coordination will occur -> risk-stratification of patients
DM/ Hypertension/ Hyperlipidemia Risk Stratification Until Now No Yes Not eligible Min. and Max. Number/Combination of: CVD/ Respiratory/ Mental Health/ Functional Impairment No Yes Not eligible Dominant/complex condition (cancer, schizophrenia, rare disease etc.) No Yes Review by GPs (Behavioral & social factors, information not in data) Not eligible No Yes Not eligible ECM Candidate No actual prediction analysis done Involvement of providers to gain trust/understanding Behavioral and social criteria are key, but sparsely available -> use insider knowledge of doctors
Enhanced Care Management So Far In Estonia Successful enhanced care management pilot with 15 GPs and < 1,000 patients to assess the feasibility and acceptability of enhanced care management Commitment of the Estonian Health Insurance Fund (EHIF) to scale-up the care management pilot Model for risk stratification: - Clinical algorithm + provider intuition Need for a better risk-stratification approach!?
The Prediction Problem Target patients - Who benefits from care management? A combination of disease, social and behavioral factors… Objective of ECM -Ultimately improve health outcomes for patients with cardio-vascular, respiratory, and mental disease. What is the right proxy prediction variable in the data? There is not one single relevant adverse event (e.g. death, hospital admission, health complication, high healthcare spending) Some discussions on how to choose the dependent variable… -> Unplanned hospital admissions have a large negative impact on patient lives, are costly and relatively frequent. Some are also avoidable…
Many Patients Repeatedly Have Hospitalizations 22 percent of patients need to be hospitalized again in the following year…
Predicting Hospital Admissions • Hospital admissions are the main (avoidable) adverse health event • But predicting hospitalizations is a hard problem • Social factors matter a lot, patients may have a lot or no contacts with the healthcare systems at all… • Tradeoff to choose which hospitalizations we want to predict • Admissions due to specific conditions vs. hospitalizations in general
Predicting Hospital Admissions Key question Not “What is the best algorithm for predicting hospital admissions?” But “How can we obtain the most useful prediction of hospital admissions for a specific purpose?”
Administrative Claims Data (in Estonia) Very reliable High-quality data availability as of 2007/2008 Comprehensive coding requirements for providers Reporting lag of data is on average 2 weeks No info on clinical outcomes (i.e. test results) Limited information on social conditions and behavioral characteristics Need for a lot of feature engineering to create “meaningful” variables at the patient level
Characteristics of Patients in the ML sample vs. Total Population • Relative to the population, the ML sample is older and more likely to be female.
Characteristics of Patients in the ML sample vs. Total Population
Most Common Chronic Conditions The ML Sample population is also more sick on average (i.e. the prevalence of chronic conditions is higher)
Characteristics of Patients in the ML sample vs. Total Population
Feature Selection & Engineering Series of attempts with interim features to extract better performance… Final set: 141 features
Getting to Know the Data: Diagnosis and Admissions Single DGN Pairs of DGNs Afib (Atrial Fibrillation And Flutter), Chf (Congestive Heart Failure), Htn (Hypertension), and Ischemic Htd (Ischemic Heart Disease) are strong indicators of potential admissions in the following year (2017) Patient groups with these conditions have a non-trivial (~10% likelihood) of hospital admissions This likelihood increases to ~20%-~30% with one 2016 hospital admission and to >50% with 3 and more admissions in 2016
ML Models Selected for Evaluation • Selection criteria: • Algorithms are readily available, easy-to-use, comprehensive and well-tested open-source libraries in Python (scikit) • Algorithms and results are relatively easy to describe/explain (common algorithms) • For interpretability and model familiarity, no attempt at exploring more complex models; no deep networks • Included in comparison: • Decision Tree • Random Forest and Extremely Randomized Trees (ExtraTrees) • k-Nearest Neighbors* • Gaussian Naïve-Bayes** • Logistic Regression (L1, L2) • SVM (RBF, polynomial)*** • Multi-layer Perceptrons (1 hidden layer) • Adaboost (Decision Tree and Random Forest) • Gradient Boosted Trees (scikit GBT, not XGBoost) • Calibrated (isotonic) variations of above classifiers • Neural Networks • Eventually excluded: *kNN for execution time and memory requirements, **NB for weak performance, and ***SVMs for very slow training (but considered for final paper)
Evaluation metrics • Variable to be predicted: Yes/No hospital admission in 2017 • Use data from 2011-2016 • We deal with an unbalanced sample (i.e. 7.5% of patients had an admission in 2017) • Appropriate metrics of model performance in an unbalanced dataset: • Precision, Recall, ROC curve and area under the curve (AUC) • (Problem-specific custom metric to penalize mistakes) for one type of error more heavily: cost of a false positive (cost of ECM) vs. cost of a missed positive (cost of subsequent hospitalization) • Different ML models have different strengths, but differences should not be huge
Intuitive Interpretation of Metrics Precision is the probability that a patient classified as a patient with a hospital admission by an algorithm is actually going to have a hospital admission. Recall is the probability that a patient who is going to have a hospital admission is being classified as such by an algorithm. Which one is more important? It depends a lot on the application. There is a tradeoff between maximizing either of them…