110 likes | 188 Views
Learning Decision Trees. Brief tutorial by M Werner. Medical Diagnosis Example. Goal – Diagnose a disease from a blood test Clinical Use Blood sample is obtained from a patient Blood is tested to measure current expression of various proteins, say by using a DNA microarray
E N D
Learning Decision Trees Brief tutorial by M Werner
Medical Diagnosis Example • Goal – Diagnose a disease from a blood test • Clinical Use • Blood sample is obtained from a patient • Blood is tested to measure current expression of various proteins, say by using a DNA microarray • Data is analyzed to produce a Yes or No answer
Data Analysis • Use a decision tree such as: P1 > K1 Y N P2 > K2 P2 > K2 N Y Y P3 > K3 P4 > K4 P4 > K4 No N Y Y N Y N Yes No Yes No Yes No
How to Build the Decision Tree • Start with samples of blood from patients known to either have the disease or not (training set). • Suppose there are 20 patients and 10 are known to have the disease and 10 not • From the training set get expression levels for all proteins of interest • i.e. if there are 20 patients and 50 proteins we get a 50 X 20 array of real numbers • Rows are proteins • Columns are patients
Choosing the decision nodes 10 have disease 10 don’t • We would like the tree to be as short as possible • Start with all 20 patients in one group • Choose a protein and a level that gains the most information 10/10 Possible splitting condition Mostly diseased Px > Kx 9/3 1/7 Mostly not diseased 10/10 Alternative splitting condition Py > Ky 7/7 3/3
How to determine information gain • Purity – A measure to which the patients in a group share the same outcome. • A group that splits 1/7 is fairly pure – Most patients don’t have the disease • 0/8 is even purer • 4/4 is the opposite of pure. This group is said to have high entropy. Knowing that a patient is in this group does not make her more or less likely to have the disease. • The decision tree should reduce entropy as test conditions are evaluated
Measuring Purity (Entropy) • Let f(i,j)=Prob(Outcome=j in node i) • i.e. If node 2 has a 9/3 split • f(2,0) = 9/12 = .75 • f(2,1) = 3/12 = .25 • Gini impurity: • Entropy:
Goal is to use a test which best reduces total entropy in the subgroups
Links • http://www.ece.msstate.edu/research/isip/publications/courses/ece_8463/lectures/current/lecture_27/lecture_27.pdf • Decision Trees & Data Mining • Andrew Moore Tutorial