Trend Analysis and Risk Identification

Trend Analysis and Risk Identification Lenka Nováková1, Jiří Kléma1, Michal Jakob1, Simon Rawles2, Olga Štěpánková1 1 The Gerstner laboratory for intelligent decision making and control, Czech Technical University, Prague 2 Department of Computer Science, University of Bristol, Bristol, UK PKDD 2003, Discovery Challenge

Outline • STULONG data, orientation towards CVD • Used tools • SumatraTT, Statistica, Weka • Used techniques • mainly statistical tests - ANOVA, Chi-square, etc. • Exploratory analysis and subgroup discovery • Entry table • Trend analysis • Entry and Control tables • three principal ways of preprocessing • derived aggregated attributes • univariate and multivariate analysis

STULONG Data • Four tables: Entry, Control, Letter, Death • Dependent variable: CVD • CardioVascular Disease • boolean attribute derived of A2 questionnaire (Control table) CVD = false The patient has no coronary disease. CVD = true The patient has one of these attributes true (Hodn1, Hodn2, Hodn3, Hodn11, Hodn13, Hodn14) positive angina pectoris (silent) myocardial infarction ischaemic heart disease cerebrovascular accident We remove patients who have diabetes (Hodn4) or cancer (Hodn15) only.

ENTRY - subgroup discovery • AQ no.6: Are there any differences in the ENTRY examination for different CVD groups? • Statistica 6.0 • module for interactive decision tree induction • two tailed t-test or chi-square test to asses significance of subgroups • Dependencies are relatively weak • Interesting dependencies found • social characteristics: derived attribute AGE_of_ENTRY • alcohol: positive effect of beer, no effect of wine • sugar consumption increases CVD risk • well-known dependencies are not mentioned (smoking, BMI, cholesterol)

ENTRY - general model • General CVD model (in WEKA) • feature selection + modeling (e.g., decision trees) • tends to generate trivial models (always predicting false) • asymmetric error-cost matrix does not help • Predict CVDrisk • Identify principal variables (Chi-squared test) • Naïve Bayes + ROC evaluation • three independent variables • discretized AGE_of_ENTRY • discretized BMI • Cholrisk - derived of CHLST • AUC = 0.66

CONTROL - trend analysis • AQ no.7: Are there any differences in development of risk factors for different CVD groups? ENTRY table CONTR table ICO – primary key Year of birth Year of entry Smoking Alcohol Cholesterol Body Mass Index Blood pressure ICO Risk factors followed during 20 years

ICO Entry Contr1 Contr2 ContrM Aggr1 AggrN ... ... Global Approach • Risk factors to be observed are selected • SYST, DIAST, TRIGL, BMI, CHLSTMG • Selected control examinations are transformed • pivoting • Patients with no control entries are removed • about 60 patients • Trend aggregates are calculated ICO_1 ICO_2

Intercept Correlation coefficient y (observed variable) Mean Gradient Standard deviation x (decimal time ~ year + 1/12 month) referential time (1975) Derived trend attributes

Global Approach - results • The derived aggregates were discretized • e.g., the gradient can be strongly decreasing, decreasing, constant, increasing, strongly increasing • Chi-square test for independence wrt. to CVD • Large number of aggregates proved to be significant including gradients (Chi square test, p=0.05)

Strongly decreasing Decreasing Constant Increasing Strongly increasing 12

ControlCount vs. CVD • ControlCount • number of examinations • strong relation with CVD • AUC = 0.35 • ControlCount  CVD risk  • anachronistic attribute • introduced by the design of the study • ControlCount has influence on the trend aggregates - ControlCount  gradients tend to be more steep etc. • Conclusion: global approach cannot be applied (at least with these aggregates)

Windowing Approach I. • The same risk factors, the same pivotingtransformation and similar trend aggregates • BUT the constant number of examinations • Issues: • window • time period vs. number of examinations • 5 examinations are enough to express trend • patients : records (1 : ControlCount – 3) • entry is used as the first examination • records are dependent • CVD classification • time from the last examination to CVD • yes/no (yes = CVD in the next year or CVD in future)

?? Entry Windowing Approach I. ... Data First vector New vector

T-tests; Grouping: Time_round (Trend_all_nahrady in Trend_analysis.stw) Group 1: 1000 Group 2: 1 Aggregate tests • Trend aggregates approach the normal distribution in all (both) the specified CVD groups • Two groups were selected – CVD never appears in the future (1000) vs. CVD appears at the next exam. (1) • T-test for comparison of the group means can be applied (p<=0.05) • Do the means of the calculated aggregates differ in the different CVD groups? • Just a few of them • two variables (!gradients!) are clearly significant only • SYST and DIAST • two significant intercepts • TRIGL and CHLST

T-tests; Grouping: Time_round (Trend_all_nahrady in Trend_analysis.stw) Group 1: 1000 Group 2: 1 Further tests of SYST, DIAST • Try to test the gradients for all the CVD groups, not only two extreme groups • Repeated ANOVA can be applied – development of SYST/DIAST trend for different CVD groups

Windowing Approach II. • There are missing values of risk factors • Windowing I. • skips missing values • different numbers of rows are generated for different factors • Windowing II. • replaces the missing values • the same numbers of rows are generated for different factors • enables multivariate analysis • combination of different aggregates and their relation with CVD

?? Entry Windowing II. ... Data First vector New vector

27 patients only!

Conclusions • The main scope • AQ no.7: Are there any differences in development of risk factors for different CVD groups? • Contributions • Pitfalls of the global approach revealed • Using windowing – differences proved for SYST and DIAST blood pressures • Other assumptions and ideas: • interesting course of development of risk factors (DIAST is decreasing first then increases and CVD appears) • other trends may have influence under specific conditions (BMITrend and overweight, etc.)

Trend Analysis and Risk Identification