380 likes | 395 Views
Analyze employee health care data to identify major problems and generate a report with explanations, recommendations, and user-friendly graphics.
E N D
Summarization KEFIR – Key Findings Reporter WSARE – What is Strange About Recent Events Outline
What is New? Old data new data
Summarization • Concisely summarize what is new and different, unexpected • with respect to previous values • with respect to expected values • … • Focus on what is actionable!
Problem: Healthcare Costs • Healthcare costs in US: 1 out of 7 GDP $ and rising • potential problems: fraud, misuse, … • understanding where the problems are is first step to fixing them • GTE – self insured for medical costs • GTE healthcare costs – $X00,000,000 • Task: Analyze employee health care data and generate a report that describes the major problems
GTE Key Findings Reporter: KEFIR • KEFIR Approach: • Analyze all possible deviations • Select interesting findings • Augment key findings with: • Explanations of plausible causes • Recommendations of appropriate actions • Convert findings to a user-friendly report with text and graphics
Deviation Detection • Drill Down through the search space • Generate a finding for each measure • deviation from previous period • deviation from norm • deviation projected for next period, if no action
Interestingness of Deviations Impact: how much the deviation affects the bottom line Savings Percentage: how much of the deviation from the norm can be expected to be saved by the action
Recommendations Hierarchical recommendation rules define appropriate intervention strategies for important measures and study areas. Example: If measure = admission rate per 1000 & study_area = Inpatient admissions & percent_change > 0.10 Then Utilization review is needed in the area of admission certification. Expected Savings: 20%
Explanation A measure is explained by finding the path of related measures with the highest impact The large increase in m1 in group s1 was caused by an increase in m3, which was caused by a rise in m5 , primarily in sector s13.
Report Generation • Automatic generation of business-user-oriented reports • Natural language generation with template matching • Graphics • delivered via browser
Sample KEFIR pages Overview Inpatient admissions
Status • Prototype implemented in GTE in 1995 • KEFIR received GTE’s highest award for technical achievement in 1995 • Key business user left GTE in 1996 and system was no longer used • Publication: • Selecting and Reporting What is Interesting: The KEFIR Application to Healthcare Data, C. Matheus, G. Piatetsky-Shapiro, and D. McNeill, in Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press, 1996
What’s Strange About Recent Events (WSARE) Weng-Keen Wong (Carnegie Mellon University) Andrew Moore (Carnegie Mellon University) Gregory Cooper (University of Pittsburgh) Michael Wagner (University of Pittsburgh) http://www.autonlab.org/wsare Designed to be easily applicable to any date/time-indexed biosurveillance-relevant data stream
Motivation Suppose we have access to Emergency Department data from hospitals around a city (with patient confidentiality preserved)
Traditional Approaches We need to build a univariate detector to monitor each interesting combination of attributes: Diarrhea cases among children Number of cases involving people working in southern part of the city Respiratory syndrome cases among females Number of cases involving teenage girls living in the western part of the city You’ll need hundreds of univariate detectors! We would like to identify the groups with the strangest behavior in recent events. Viral syndrome cases involving senior citizens from eastern part of city Botulinic syndrome cases Number of children from downtown hospital And so on…
WSARE Approach • Rule-Based Anomaly Pattern Detection • Association rules used to characterize anomalous patterns. For example, a two-component rule would be: Gender = Male AND 40 Age < 50
WSARE v2.0 Overview • Obtain Recent and Baseline datasets 2. Search for rule with best score All Data Recent Data 3. Determine p-value of best scoring rule through randomization test Baseline 4. If p-value is less than threshold, signal alert
Step 1: Obtain Recent and Baseline Data Recent Data Data from last 24 hours Baseline Baseline data is assumed to capture non-outbreak behavior. We use data from 35, 42, 49 and 56 days prior to the current day
Example Sat 12-23-2001 35.8% (48/134) of today's cases have 30 <= age < 40 17.0% (45/265) of other (baseline) cases have 30 <= age < 40
Step 2. Search for Best Rule For each rule, form a 2x2 contingency table eg. • Perform Fisher’s Exact Test to get a p-value (score) for each rule (for this data 0.00005) • Find rule R-best with the lowest score. • Caution: This score is not the true p-value of RBEST because of multiple tests
Step 3: Randomization Test • Take the recent cases and the baseline cases. Shuffle the date field to produce a randomized dataset called DBRand • Find the rule with the best score on DBRand.
Step 3: Randomization Test Repeat the procedure on the previous slide for 1000 iterations. Determine how many scores from the 1000 iterations are better than the original score. If the original score were here, it would place in the top 1% of the 1000 scores from the randomization test. We would be impressed and an alert should be raised. Estimated p-value of the rule is: # better scores / # iterations
Results on Actual ED Data from 2001 1. Sat 2001-02-13: SCORE = -0.00000004 PVALUE = 0.00000000 14.80% ( 74/500) of today's cases have Viral Syndrome = True and Encephalitic Prodome = False 7.42% (742/10000) of baseline have Viral Syndrome = True and Encephalitic Syndrome = False 2. Sat 2001-03-13: SCORE = -0.00000464 PVALUE = 0.00000000 12.42% ( 58/467) of today's cases have Respiratory Syndrome = True 6.53% (653/10000) of baseline have Respiratory Syndrome = True 3. Wed 2001-06-30: SCORE = -0.00000013 PVALUE = 0.00000000 1.44% ( 9/625) of today's cases have 100 <= Age < 110 0.08% ( 8/10000) of baseline have 100 <= Age < 110 4. Sun 2001-08-08: SCORE = -0.00000007 PVALUE = 0.00000000 83.80% (481/574) of today's cases have Unknown Syndrome = False 74.29% (7430/10001) of baseline have Unknown Syndrome = False 5. Thu 2001-12-02: SCORE = -0.00000087 PVALUE = 0.00000000 14.71% ( 70/476) of today's cases have Viral Syndrome = True and Encephalitic Syndrome = False 7.89% (789/9999) of baseline have Viral Syndrome = True and Encephalitic Syndrome = False
WSARE 3:0 Improving the Baseline Baseline Recall that the baseline was assumed to be captured by data that was from 35, 42, 49, and 56 days prior to the current day. What if this assumption isn’t true? What if data from 7, 14, 21 and 28 days prior is better? We would like to determine the baseline automatically!
Temporal Trends From: Goldenberg, A., Shmueli, G., Caruana, R. A., and Fienberg, S. E. (2002). Early statistical detection of anthrax outbreaks by tracking over-the-counter medication sales. Proceedings of the National Academy of Sciences (pp. 5237-5249)
WSARE v3.0 Generate the baseline… • “Taking into account recent flu levels…” • “Taking into account that today is a public holiday…” • “Taking into account that this is Spring…” • “Taking into account recent heatwave…” • “Taking into account that there’s a known natural Food-borne outbreak in progress…” Bonus: More efficient use of historical data
Idea: Bayesian Networks Bayesian Network: A graphical model representing the joint probability distribution of a set of random variables “Patients from West Park Hospital are less likely to be young” “On Cold Tuesday Mornings the folks coming in from the North part of the city are more likely to have respiratory problems” “On the day after a major holiday, expect a boost in the morning followed by a lull in the afternoon” “The Viral prodrome is more likely to co-occur with a Rash prodrome than Botulinic”
Obtaining Baseline Data All Historical Data Today’s Environment What should be happening today given today’s environment • Learn Bayesian Network 2. Generate baseline given today’s environment Baseline
Simulation FLU LEVEL DAY OF WEEK SEASON WEATHER Region Anthrax Concentration Has Anthrax AGE Outside Activity Immune System GENDER Region Grassiness Has Flu Has Sunburn Heart Health DATE Region Food Condition Has Cold Has Allergy REGION Has Heart Attack Actions: None, Purchase Medication, ED visit, Absent. If Action is not None, output record to dataset. Disease Has Food Poisoning Actual Symptom REPORTED SYMPTOM ACTION DRUG
Simulation • 100 different data sets • Each data set consisted of a two year period • Anthrax release occurred at a random point during the second year • Algorithms allowed to train on data from the current day back to the first day in the simulation • Any alerts before actual anthrax release are considered a false positive • Detection time calculated as first alert after anthrax release. If no alerts raised, cap detection time at 14 days
Simulation Plot Anthrax release (not highest peak)
Summary • Summarization of what is new and interesting • Key ideas • search many possible findings • compare to past data and expected data • avoid overfitting • focus on actionable changes • Example systems • KEFIR (GTE, 1992-1995) • WSARE (CMU/Pitt, 2002-3)