280 likes | 499 Views
Constructing Binary Decision Tree for Predicting Deep Venous Thrombosis (DVT). Christopher Nwosisi 1,2 , Sung-Hyuk Cha 1 , Yoo Jung An, Charles C. Tappert 1 , Evan Lipsitz 2. 1 Computer Science Department Pace University New York, USA. 2 Vascular Laboratory Montefiore Medical Center
E N D
Constructing Binary Decision Tree for Predicting Deep Venous Thrombosis (DVT) Christopher Nwosisi1,2, Sung-Hyuk Cha1, Yoo Jung An, Charles C. Tappert1, Evan Lipsitz2 1Computer Science Department Pace University New York, USA 2Vascular Laboratory Montefiore Medical Center New York, USA
Statement of Problem • The use of decision tree algorithms such as ID3 and C4.5 in medical diagnostic application today is promising, but often suffer from excessive complexity and can even be incomprehensible. • Especially in predicting DVTs which have high • mortality, simple and accurate decision model is • preferred for potential patients, Medical • Technologists and Physicians before sending • patients for expensive medical examinations.
Proposed approach • Using the Genetic Algorithm to minimize the complexity (size) and/or maximize the accuracy of the decision tree. • New approach found shorter and/or more accurate • decision trees than ones produced by conventional • the ID3 and C4.5 algorithms.
DVT / VTE Magnitude of the Problem DVT 2 Million Post-thrombotic Syndrome 800,000 PE 600,000 Silent PE 1 Million Pulmonary Hypertension 30,000 Death 60,000 Estimated Cost of VTE Care $1.5 Billion/year Goldhaber SZ, et al. Lancet 1999;353:1386-19.
Patients with deep vein thrombosis have a painful swollen leg which limits their mobility Clinical Problem Montefiore Hospital Vascular Laboratory, 2008
DVT-Duplex Evaluation Criteria for positive diagnosis: - incompressibility of a venous segment - visualization of thrombus absence of flow a v Montefiore Hospital Vascular Laboratory
Database Overview • 515 records from the Laboratory Two datasets are extracted from two databases: • - 350 patients are positive for DVT • - 165 patients are negative for DVT • Medical History • 620 records from the general registry • Physical Exam • - 420 patients are positive for DVT • - 200 patients are negative for DVT • Diagnostic Tests
Table 1- Databases Attributes Medical History
Table 2 – Database Attributes Medical History Physical Exam Diagnostic Tests
Table 2.1.1.1 - DVT sample data set II DVT database (Table 1)
Preprocessing (Binarization) Original table Binary table Heterogeneous type attributes Homogeneous Binary type attributes
Why Binary Attribute? • To use the GA to build a binary decision tree, the • attribute types must be in binary • Applying GA on Non-binary attributes is extremely • difficult and currently an open problem
Decision Tree SR 0 1 PE HF 0 1 0 1 SW CR pos pos (10/10) (17/25) 0 1 0 1 neg pos neg pos (11/12) (12/13) Their representation of acquired knowledge in tree form is intuitive and generally easy to assimilate by humans. In general, DT classifiers have comparable accuracy to other complex classifiers but simple to understand and visualize.
Decision Tree Representation • Decision trees classify instances • by sorting them down from the root to the leaf node, • which provides the classification of the instance. • Each internal node in the tree specifies a test of some attribute of the instance. • Each leaf node assigns a classification • Each branch descending from that node corresponds to one of the possible values of this attribute.
Decision Trees from Dataset I SR DB PN n SB n A6 n p n p SB SS (b) 61.5% by GA GN p PN GN SW PN n GN A6 DB n DB n SS (c) 64.5% by GA DB A6 n DB SR SW p n p n (a) 59.5% by C4.5 p n n p n p n p n p
The Best Measure of Efficiency (shortness) for a DT • Average number of questions required to obtain • a prediction. Other measures: • the depth of the tree • the number of nodes in the tree
Complexity of Decision Trees GA C4.5 ID3
From both a depth and average-number of questions perspective the complexity of the decision tree in Figure 5 (d) can be considered much more efficient (simpler)than the decision tree from the C4.5 algorithm (Figure 5a).
SR Optimal DT PE HF SW CR pos pos (10/10) (17/25) HF A6 pos pos (11/12) (12/13) DB CR SW pos (30/43) SR LP pos pos neg pos (56/79) (13/16) (20/22) neg pos neg pos (6/8) (43/52) This might be the optimal decision tree based on the data and indicates that combining human knowledge and machine speed of processing can often produce a superior result than either the human or machine could produce separately.
Conclusion • Experimental results on two datasets suggest that more • accurate and efficient decision trees can be found by • the GA • The results shown here increase the probability of • predicting whether a patient would develop or have • had DVT, which provides advancement in the diagnosis • of DVT • The decision trees produced by the GA have • significant clinical relevance.
Future Works The decision trees found by using GA tend to be almost full binary trees i.e., the width is large while the depth is short. For future work, the C4.5 pruning mechanism could be applied to decision trees produced by GA to make trees sparse and to further avoid the potential over-fitting problem.