1 / 16

Mining Episode Rules in STULONG dataset

Mining Episode Rules in STULONG dataset. N. Méger 1 , C. Leschi 1 , N. Lucas 2 & C. Rigotti 1 1 INSA Lyon - LIRIS FRE CNRS 2672 2 Université d’Orsay – LRI CNRS UMR 8623. This work has been partially funded by the European Project AEGIS (IST-2000-26450). Content. Motivation

morna
Download Presentation

Mining Episode Rules in STULONG dataset

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Mining Episode Rules in STULONG dataset • N. Méger1, C. Leschi1, N. Lucas2 & C. Rigotti1 • 1 INSA Lyon - LIRIS FRE CNRS 2672 • 2 Université d’Orsay – LRI CNRS UMR 8623 This work has been partially funded by the European Project AEGIS (IST-2000-26450).

  2. Content • Motivation • About WinMiner • Data Mining Effort • Conclusion

  3. Motivation : Data • STULONG Data : A 20 year longitudinal study of risk factors related to atherosclerosis in a population of middle-aged men • Tables ENTRY and CONTROL: • 1216 patients described by: • Identification and social characteristics • Behavior • Health events • Physical and biochemical examinations • From 1 up to 21 control per patients  A sequence of controls for each patient

  4. Motivation: Medical issues • identified risks factors • no treatment available • necessity to consider a global risk instead of concentrating prevention efforts on individual ones • risk comportments dramatically increases cardio-vascular disease emergence, but no one knows when  Relations between risk factors and clinical demonstration of atherosclerosis?  Time intervals over which these relations are valid?

  5. Motivation: WinMiner • WinMiner: a single optimised way to find sequential patterns in data along with their optimal time intervals, under user constraints • WinMiner suggests to experts possible temporal dependencies among occurrences of event types • WinMiner outputs "small" collections of sequential patterns

  6. About WinMiner Mining context • large event sequences • episode & episode rules A B A B C A B C

  7. About WinMiner Selecting patterns • support: how many times an episode/episode rule occurs within an event sequence? A  BA  B  C • confidence: what is the probability of the RHS of an episode rule to occur knowing that its LHS already occured? A  B  C • patterns are selected using: • a minimum support threshold • a minimum confidence threshold

  8. C1 C2 <= C1 - C1*decRate C2 minimum confidence optimal window span window span such that the episode rule is frequent About WinMiner • Selecting the optimal window span confidence First Local Maximum (FLM) w

  9. About WinMiner • WinMiner : • checks all possible episode rules satisfying to frequency and confidence thresholds • outputs only the FLM-rules, along with their respective optimal window sizes • uses a maximal gap constraint

  10. DM effort: Aims • Give to the medical expert: a mean to follow both the evolution of risk factors and: (1) impact of medical intervention (2)modifications in patients’ behavior in addition: • significant time periods of observation • frequency • probability

  11. DM effort: Data preprocessing • Mainly focused on table CONTROL (1226 patients/10572 examinations) • Joint operations to export information from table ENTRY • Categorization of some factors • Choice of relevant factors according to: • Medical expertise • Mining approach  Table Contr_Mod_2

  12. DM Effort: Data preprocessing • Important factors (according to medical experts): • cholesterol • hypertension • smoking • physical activity • age • diabetes • alcohol consumption • BMI • family anamnesis • level of education

  13. DM Effort: Data preprocessing • Contr_mod_2  large event sequence • For each patient: a subsequence containing all his control examinations • Coding guarantees that events corresponding to 2 different patients can not be associated in the same episode rule • Large event sequence: concatenation of all sub sequences constructed for patients.

  14. DM effort: Results • Examples: • "If the patient has no hypercholesterolemia, and if he sometimes follows his diet, then the patient has no hypercholesterolemia with a probability of 0.8 within 40 months. This rule is supported by 201 examples in the event sequence." • " If one eats less of fats and carbohydrates and he has claudication observed some time later, then this claudication does not disappear with a probability of 0.8 over 30 months. This rule is supported by 21 examples. "

  15. DM effort: Results • Well known phenomena: • indication about correctness in pre-processing as well as in mining data • Added-value:suggestion concerning their temporal aspects • To be expected: • with new data and new risk factors put in evidence in the last decade, discovering new phenomena along with their optimal window sizes

  16. Conclusion • With STULONG data: Searching for temporal dependencies between atherosclerosis risk factors and clinical demonstration of atherosclerosis that have an optimal interval/window size • Offers to the medical expert a possibility to explicit impact of a risk factor and to refine its part in comparison with other ones within a time interval • A few episode rules obtained, that allows experts to manually analyse the outputs • Could be applied to other medical data sets to help in finding unknown phenomena  New perspectives both for data miners and physicians

More Related