Knowledge discovery with classification rules in a cardiovascular dataset

Knowledge discovery with classification rules in a cardiovascular dataset Advisor : Dr. Hsu Presenter : Zih-Hui Lin Author :Viii Podgorelec a,*, Peter Kokol a, Milojka Molan Sti81ic b, Marjan Heri :ko a, Ivan Rozrnan a Computer Methods and Programs in Biomedicine, Volume 80, Supplement 1, December 2005, Pages S39-S49

Outline • Motivation • Objective • The AREX algorithm • Experiment • Conclusions

Motivation • Modern medicine generates huge amounts of data and there is an acute and widening gap between data collection and data comprehension. • it is very difficult for a human to make use of such amount of information (i.e. hundreds of attributes, thousand of images, several channels of 24 hours of ECG or EEG signals)

Objective • confirm their existing knowledge about medical problem (ie. hundreds of attributes) • enable searching for new facts, which should reveal some new interesting patterns and possibly improve the existing medical knowledge.

The AREX algorithm 2 1 3 1 multi-population self-adapting genetic algorithm for the induction of decision trees. 1.1Build N decision trees upon objects from S Oi 1.2Classify object with nt randomly chosen trees s 1.4From all N decision create M initial classification rules 2 evolution of programs in an arbitrary programming language, which is used to evolve classification N 2.1create m/2+1 rules (randomly) 2.2 S* • 2.3 If s is not empty • Add |s| randomly chosen objects from s* to s • ct=ct+1 • repeat 1.1 1.3if frequency of the most frequent decision class classified by nt trees > nt - ct (ct=nt/2) 3 an optimal set of classification rules is determined with a simple genetic algorithm

root Xi Xi null null Xi null null null null null null Genetic algorithm for the construction of decision trees 1.Number of attribute nodes M that will be in the tree 2. Select an attribute Xi (1)Continuous attributes →split constant (2)Discrete attributes →randomly defined two disjunctive sets M attributes population 3. 選一空節點，(tree深度愈高，選中機率愈低) 4. Randomly select an attribute Xi (還沒被選過的機率較高) • For each empty leaf the following algorithm • determines the appropriate decision class

proGenesys system & Finding the optimal set of rules

Introduction • Advantage • transparency of the classification process that one can easily interpret, understand and criticize. • Disadvantages • poor processing of incomplete, noisy data, • inability to build several trees for the same dataset • inability to use the preferred attributes, etc.

Dataset • contains data of 100 patients from Maribor Hospital. • The attributes include • general data (age, sex, etc.) • health status (data from family history and child's previous illnesses), • general cardiovascular data (blood pressure, pulse, chest pain, etc.) • more specialized cardiovascular data - data from child's cardiac history and clinical examinations (with findings of ultrasound, ECG, etc.). • dataset five different diagnoses are possible: • innocent heart murmur良性雜音 • congenital heart disease with left-to-right shunt先天性心臟病(左向右分流) • aortic valve disease with aorta coarctation,主動脈辨疾病(主動脈縮窄) • arrhythmias心律不整 • chest pain.心悸

Classification result –training set Overfitting

Classification result –testing set

Classification result

Conclusions • One of the most evident advantages of AREX is the simultaneous very good • Generalization → high and similar overall accuracy on both training set and test set • Specialization → high and very similar accuracy of all decision classes, also the least frequent ones. • equip physicians with a powerful technique to • (1) confirm their existing knowledge about some medical problem • (2) enable searching for new facts, which should reveal some new interesting patterns and possibly improve the existing medical knowledge.

My opinion • Advantage: 依屬性給予權重 • Disadvantage: • Apply: 台中醫院？

Knowledge discovery with classification rules in a cardiovascular dataset