220 likes | 358 Views
Data Mining of Blood Handling Incident Databases. Costas Tsatsoulis Information and Telecommunication Technology Center Dept. of Electrical Engineering and Computer Science University of Kansas tsatsoul@ittc.ku.edu. Background. Incident reports collected for handling of blood products
E N D
Data Mining of Blood Handling Incident Databases Costas Tsatsoulis Information and Telecommunication Technology Center Dept. of Electrical Engineering and Computer Science University of Kansas tsatsoul@ittc.ku.edu
Background • Incident reports collected for handling of blood products • An initial database was collected to allow experimentation • Goals: • Allow the generation of intelligence from data • Unique events • Event clusters • Event trends • Frequencies • Simplify the job of the QA • Similar reports • Less need for in-depth causal analysis • Allow cross-institutional analysis
Institute of Medicine Recommendation November 1999 • Establish a national focus of research , to enhance knowledge base about patient safety • Identify and learn from errors through both mandatory and voluntary reporting systems • Raising standards and expectations through oversight organizations • Create safety systems through implementation of safe practices at the delivery level
Near Miss Event Reporting • Useful data base to study system’s failure points • Many more near misses than actual bad events • Source of data to study human recovery • Dynamic means of understanding system operations
The Iceberg Model of Near-Miss Events • 1/2,000,000 fatalities • 1/38,000 ABO incompatible txns • 1/14,000 incorrect units transfused 1/2,000,000 1/38,000 1/14,000 Near-Miss Events
Intelligent Systems • Developed two separate systems: • Case-Based Reasoning (CBR) • Information Retrieval (IR) • Goal was to address most of the needs of the users: • Allow the generation of intelligence from data • Unique events • Event clusters • Event trends • Frequencies • Simplify the job of the QA • Similar reports • Less need for in-depth causal analysis • Allow cross-institutional analysis
Case-Based Reasoning • Technique from Artificial Intelligence that solves problems based on previous experiences • Of significance to us: • CBR must identify a similar situation/problem to know what to do and how to solve the problem • Use CBR’s concept of “similarity” to identify: • similar reports • report clusters • frequencies
What is a Case and how do we represent it? • An incident report is a “case” • Cases are represented by: • indexes • descriptive features of a situation • surface or in-depth or both • their values • symbolic “Technician” • numerical “103 rpm” • sets “{Monday, Tuesday, Wednesday}” • other (text, images, …) • weights • indicate the descriptive significance of the index
Finding Similarity • Define degrees of matching between attributes of an event report. For example: • “Resident” and “MD” are similar • “MLT,” “MT,” and “QA/QC” are similar • A value may match perfectly or partially • “MLT” to “MLT” • “MLT” to “MT” • Different attributes of the event report are weighted • The sum of the matching attributes with their degree of match and their weights, defines similarity • Cases matching over some predefined degree of similarity are retrieved and considered similar
Information Retrieval • Index, search and recall text without any domain information • Preprocess document • remove stop words • stemming • Use some representation for documents • vector-space model • vector of terms with their weight = tf * idf • tf = term frequency = (freq of word)/(freq of most frequent word) • idf = inverse document frequency = log10((total docs)/(docs with term)) • Use some similarity metric between documents • vector algebra to find the cosine of angle between vectors
CBR for • From the incident report features selected a subset as indexes • Semantic similarity defined • (OR, ER, ICU, L&D) • (12-4am, 4-8am), (8am-12pm, 12-4pm), (4-8pm, 8pm-12am) • Domain-specific details defined • Weights assigned • fixed • conditional • weight of some causal codes based on whether they were established using a rough or in-depth analysis
IR for • No deletion of stop words • “or” vs. “OR” • No stemming • Use the vector space model and the cosine comparison measure
Experiments • Database of approx. 600 cases • Selected 24 reports to match against case base • CBR retrieval - CBR_match_value EXPERIMENT 1 • IR retrieval - IR_match_value EXPERIMENT 2 • Combined retrieval EXPERIMENTS 3-11 • WCBR*CBR_match_value +WIR*IR_match_value • weights range from 0.9 to 0.1 in increments of 0.1 • (0.9,0.1), (0.8,0.2), (0.7,0.3), …, (0.2,0.8),(0.1,0.9) • CBR retrieval with all weights set to 1 EXPERIMENT 12 • No retrieval threshold set
Evaluation • Collected top 5 cases for each report for each experiment • Because of duplication, each report had 10-20 cases retrieved for all 12 experiments • A random case was added to the set • Results sent to experts to evaluate • Almost Identical • Similar • Not Very Similar • Not Similar At All
Preliminary Analysis • Determine agreement/disagreement with expert’s analysis • is a case similar? • is a case dissimilar? • Establish accuracy (recall is more difficult to measure) • False positives vs. false negatives • What is the influence of the IR component? • Are the weights appropriate? • What is the influence of varying selection thresholds?
Combined Results Increasing selection threshold
Some preliminary conclusions • The weights used in CBR seem to be appropriate and definitely improve retrieval • In CBR, increasing the acceptance threshold improves selection of retrievable cases but also increases the false positives • IR does an excellent job in identifying non-retrievable cases • Even a 10% inclusion of IR to CBR greatly helps in identifying non-retrievable cases
Future work • Plot performance versus acceptance threshold • identify best case selection threshold • Integrate the analysis of the second expert • Examine how CBR and IR can be combined to exploit each one’s strengths: • CBR performs initial retrieval • IR eliminates bad cases retrieved • Look into temporal distribution of retrieved reports and adjust their matching accordingly • Examine a NLU system for incident reports that have longer textual descriptions • Re-run on different datasets • Get our hands on large datasets and perform other types of data mining (rule induction, predictive models, probability networks, supervised and unsupervised clustering, etc.)