160 likes | 364 Views
Mining Episode Rules in STULONG dataset. N. Méger 1 , C. Leschi 1 , N. Lucas 2 & C. Rigotti 1 1 INSA Lyon - LIRIS FRE CNRS 2672 2 Université d’Orsay – LRI CNRS UMR 8623. This work has been partially funded by the European Project AEGIS (IST-2000-26450). Content. Motivation
E N D
Mining Episode Rules in STULONG dataset • N. Méger1, C. Leschi1, N. Lucas2 & C. Rigotti1 • 1 INSA Lyon - LIRIS FRE CNRS 2672 • 2 Université d’Orsay – LRI CNRS UMR 8623 This work has been partially funded by the European Project AEGIS (IST-2000-26450).
Content • Motivation • About WinMiner • Data Mining Effort • Conclusion
Motivation : Data • STULONG Data : A 20 year longitudinal study of risk factors related to atherosclerosis in a population of middle-aged men • Tables ENTRY and CONTROL: • 1216 patients described by: • Identification and social characteristics • Behavior • Health events • Physical and biochemical examinations • From 1 up to 21 control per patients A sequence of controls for each patient
Motivation: Medical issues • identified risks factors • no treatment available • necessity to consider a global risk instead of concentrating prevention efforts on individual ones • risk comportments dramatically increases cardio-vascular disease emergence, but no one knows when Relations between risk factors and clinical demonstration of atherosclerosis? Time intervals over which these relations are valid?
Motivation: WinMiner • WinMiner: a single optimised way to find sequential patterns in data along with their optimal time intervals, under user constraints • WinMiner suggests to experts possible temporal dependencies among occurrences of event types • WinMiner outputs "small" collections of sequential patterns
About WinMiner Mining context • large event sequences • episode & episode rules A B A B C A B C
About WinMiner Selecting patterns • support: how many times an episode/episode rule occurs within an event sequence? A BA B C • confidence: what is the probability of the RHS of an episode rule to occur knowing that its LHS already occured? A B C • patterns are selected using: • a minimum support threshold • a minimum confidence threshold
C1 C2 <= C1 - C1*decRate C2 minimum confidence optimal window span window span such that the episode rule is frequent About WinMiner • Selecting the optimal window span confidence First Local Maximum (FLM) w
About WinMiner • WinMiner : • checks all possible episode rules satisfying to frequency and confidence thresholds • outputs only the FLM-rules, along with their respective optimal window sizes • uses a maximal gap constraint
DM effort: Aims • Give to the medical expert: a mean to follow both the evolution of risk factors and: (1) impact of medical intervention (2)modifications in patients’ behavior in addition: • significant time periods of observation • frequency • probability
DM effort: Data preprocessing • Mainly focused on table CONTROL (1226 patients/10572 examinations) • Joint operations to export information from table ENTRY • Categorization of some factors • Choice of relevant factors according to: • Medical expertise • Mining approach Table Contr_Mod_2
DM Effort: Data preprocessing • Important factors (according to medical experts): • cholesterol • hypertension • smoking • physical activity • age • diabetes • alcohol consumption • BMI • family anamnesis • level of education
DM Effort: Data preprocessing • Contr_mod_2 large event sequence • For each patient: a subsequence containing all his control examinations • Coding guarantees that events corresponding to 2 different patients can not be associated in the same episode rule • Large event sequence: concatenation of all sub sequences constructed for patients.
DM effort: Results • Examples: • "If the patient has no hypercholesterolemia, and if he sometimes follows his diet, then the patient has no hypercholesterolemia with a probability of 0.8 within 40 months. This rule is supported by 201 examples in the event sequence." • " If one eats less of fats and carbohydrates and he has claudication observed some time later, then this claudication does not disappear with a probability of 0.8 over 30 months. This rule is supported by 21 examples. "
DM effort: Results • Well known phenomena: • indication about correctness in pre-processing as well as in mining data • Added-value:suggestion concerning their temporal aspects • To be expected: • with new data and new risk factors put in evidence in the last decade, discovering new phenomena along with their optimal window sizes
Conclusion • With STULONG data: Searching for temporal dependencies between atherosclerosis risk factors and clinical demonstration of atherosclerosis that have an optimal interval/window size • Offers to the medical expert a possibility to explicit impact of a risk factor and to refine its part in comparison with other ones within a time interval • A few episode rules obtained, that allows experts to manually analyse the outputs • Could be applied to other medical data sets to help in finding unknown phenomena New perspectives both for data miners and physicians