140 likes | 250 Views
Classification. supplemental. Scalable Decision Tree Induction Methods in Data Mining Studies. SLIQ (EDBT’96 — Mehta et al.) builds an index for each attribute and only class list and the current attribute list reside in memory SPRINT (VLDB’96 — J. Shafer et al.)
E N D
Classification supplemental
Scalable Decision Tree Induction Methods in Data Mining Studies • SLIQ (EDBT’96 — Mehta et al.) • builds an index for each attribute and only class list and the current attribute list reside in memory • SPRINT (VLDB’96 — J. Shafer et al.) • constructs an attribute list data structure • PUBLIC (VLDB’98 — Rastogi & Shim) • integrates tree splitting and tree pruning: stop growing the tree earlier • RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti) • separates the scalability aspects from the criteria that determine the quality of the tree • builds an AVC-list (attribute, value, class label)
SPRINT For large data sets. Age < 25 Car = Sports H H L
Gini Index (IBM IntelligentMiner) • If a data set T contains examples from n classes, gini index, gini(T) is defined as where pj is the relative frequency of class j in T. • If a data set T is split into two subsets T1 and T2 with sizes N1 and N2 respectively, the gini index of the split data contains examples from n classes, the gini index gini(T) is defined as • The attribute provides the smallest ginisplit(T) is chosen to split the node (need to enumerate all possible splitting points for each attribute).
SPRINT Partition (S) if all points of S are in the same class return; else for each attribute A do evaluate_splits on A; use best split to partition into S1,S2; Partition(S1); Partition(S2);
SPRINT Data Structures Training set Age Car Attribute lists
Splits Age < 27.5 Group2 Group1
Histograms For continuous attributes Associated with node (Cabove, Cbelow) to process already processed
ginisplit0 = 0/6 gini(S1) + 6/6 gini(S2) gini(S2) = 1 - [(4/6)2 +(2/6)2 ] = 0.444 ginisplit1 = 1/6 gini(S1) +5/6 gini(S2) gini(S1) = 1 - [(1/1) 2 ] = 0 gini(S2) = 1 - [(3/4)2 +(2/4)2 ] = 0.1875 ginisplit2 = 2/6 gini(S1) +4/6 gini(S2) gini(S1) = 1 - [(2/2) 2 ] = 0 gini(S2) = 1 - [(2/4)2 +(2/4)2 ] = 0.5 ginisplit3 =3/6 gini(S1) +3/6 gini(S2) gini(S1) = 1 - [(3/3) 2 ] = 0 gini(S2) = 1 - [(1/3)2 +(2/3)2 ] = 0.444 ginisplit4 =4/6 gini(S1) +2/6 gini(S2) gini(S1) = 1 - [(3/4) 2 +(1/4) 2 ] = 0.375 gini(S2) = 1 - [(1/2)2 +(1/2)2 ] = 0.5 ginisplit5 =5/6 gini(S1) +1/6 gini(S2) gini(S1) = 1 - [(4/5) 2 +(1/5) 2 ] = 0.320 gini(S2) = 1 - [(1/1)2 ] = 0 ginisplit5 =6/6 gini(S1) +0/6 gini(S2) gini(S1) = 1 - [(4/6) 2 +(2/6) 2 ] = 0.320 Example ginisplit0 = 0.444 ginisplit1= 0.156 ginisplit2= 0.333 ginisplit3= 0.222 ginisplit4= 0.416 ginisplit5= 0.222 Age <= 18.5 ginisplit6= 0.444
Splitting categorical attributes Single scan through the attribute list collecting counts on count matrix for each combination of class label + attribute value
ginisplit(family)= 3/6 gini(S1) + 3/6 gini(S2) gini(S1) = 1 - [(2/3)2 + (1/3)2] = 4/9 gini(S2) = 1- [(2/3)2 + (1/3)2] = 4/9 ginisplit((sports)= 2/6 gini(S1) + 4/6 gini(S2) gini(S1) = 1 - [(2/2)2] = 0 gini(S2) = 1- [(2/4)2 + (2/4)2] = 0.5 ginisplit(truck)= 1/6 gini(S1) + 5/6 gini(S2) gini(S1) = 1 - [(1/1)2] = 0 gini(S2) = 1- [(4/5)2 + (1/5)2] = 0.32 Example ginisplit(family)= 0.444 ginisplit((sports) )= 0.333 ginisplit(truck) )= 0.266 Car Type = Truck
Example (2 attributes) The winner is Age <= 18.5 Y N H
Example for Bayes Rules • The patient either has a cancer or does not. • A prior knowledge: over the entire population, .008 have cancer • Lab test result + or - is imperfect. It returns • a correct positive result in only 98% of the cases in which the cancer is actually present • a correct negative result in only 97% of the cases in which the cancer is not present • What happens if a new patient for whom the lab test returns +?
Example for Bayes Rules Pr(cancer)=0.008 Pr(not cancer)=0.992 Pr(+|cancer)=0.98 Pr(-|cancer)=0.02 Pr(+|not cancer)=0.03 Pr(-|not cancer)=0.97 Pr(+|cancer)p(cancer) = 0.98* 0.008 = 0.0078 Pr(+|not cancer)Pr(not cancer) = 0.03*0.992=0.0298 Hence, Pr(cancer|+) = 0.0078/(0.0078+0.0298)=0.21