100 likes | 366 Views
C4.5 Demo. Andrew Rosenberg CS4701 11/30/04. What is c4.5?. c4.5 is a program that creates a decision tree based on a set of labeled input data. This decision tree can then be tested against unseen labeled test data to quantify how well it generalizes. Running c4.5. On cunix.columbia.edu
E N D
C4.5 Demo Andrew Rosenberg CS4701 11/30/04
What is c4.5? • c4.5 is a program that creates a decision tree based on a set of labeled input data. • This decision tree can then be tested against unseen labeled test data to quantify how well it generalizes.
Running c4.5 • On cunix.columbia.edu • ~amr2104/c4.5/bin/c4.5 –u –f filestem • On cluster.cs.columbia.edu • ~amaxwell/c4.5/bin/c4.5 –u –f filestem • c4.5 expects to find 3 files • filestem.names • filestem.data • filestem.test
File Format: .names • The file begins with a comma separated list of classes ending with a period, followed by a blank line • E.g, >50K, <=50K. • The remaining lines have the following format (note the end of line period): • Attribute: {ignore, discrete n, continuous, list}.
Example: census.names >50K, <=50K. age: continuous. workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, etc. fnlwgt: continuous. education: Bachelors, Some-college, 11th, HS-grad, Prof-school, etc. education-num: continuous. marital-status: Married-civ-spouse, Divorced, Never-married, etc. occupation: Tech-support, Craft-repair, Other-service, Sales, etc. relationship: Wife, Own-child, Husband, Not-in-family, Unmarried. race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black. sex: Female, Male. capital-gain: continuous. capital-loss: continuous. hours-per-week: continuous. native-country: United-States, Cambodia, England, Puerto-Rico, Canada, etc.
File Format: .data, .test • Each line in these data files is a comma separated list of attribute values ending with a class label followed by a period. • The attributes must be in the same order as described in the .names file. • Unavailable values can be entered as ‘?’ • When creating test sets, make sure that you remove these data points from the training data.
Example: adult.test 25, Private, 226802, 11th, 7, Never-married, Machine-op-inspct, Own-child, Black, Male, 0, 0, 40, United-States, <=50K. 38, Private, 89814, HS-grad, 9, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 50, United-States, <=50K. 28, Local-gov, 336951, Assoc-acdm, 12, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 40, United-States, >50K. 44, Private, 160323, Some-college, 10, Married-civ-spouse, Machine-op-inspct, Husband, Black, Male, 7688, 0, 40, United-States, >50K. 18, ?, 103497, Some-college, 10, Never-married, ?, Own-child, White, Female, 0, 0, 30, United-States, <=50K. 34, Private, 198693, 10th, 6, Never-married, Other-service, Not-in-family, White, Male, 0, 0, 30, United-States, <=50K. 29, ?, 227026, HS-grad, 9, Never-married, ?, Unmarried, Black, Male, 0, 0, 40, United-States, <=50K. 63, Self-emp-not-inc, 104626, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 3103, 0, 32, United-States, >50K. 24, Private, 369667, Some-college, 10, Never-married, Other-service, Unmarried, White, Female, 0, 0, 40, United-States, <=50K. 55, Private, 104996, 7th-8th, 4, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 10, United-States, <=50K. 65, Private, 184454, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 6418, 0, 40, United-States, >50K.36, Federal-gov, 212465, Bachelors, 13, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 40, United-States, <=50K.
c4.5 Output • The decision tree proper. • (weighted training examples/weighted training error) • Tables of training error and testing error • Confusion matrix • You’ll want to pipe the output of c4.5 to a text file for later viewing. • E.g., c4.5 –u –f filestem > filestem.results
Example output capital-gain > 6849 : >50K (203.0/6.2) | capital-gain <= 6849 : | | capital-gain > 6514 : <=50K (7.0/1.3) | | capital-gain <= 6514 : | | | marital-status = Married-civ-spouse: >50K (18.0/1.3) | | | marital-status = Divorced: <=50K (2.0/1.0) | | | marital-status = Never-married: >50K (0.0) | | | marital-status = Separated: >50K (0.0) | | | marital-status = Widowed: >50K (0.0) | | | marital-status = Married-spouse-absent: >50K (0.0) | | | marital-status = Married-AF-spouse: >50K (0.0) Tree saved Evaluation on training data (4660 items): Before Pruning After Pruning ---------------- --------------------------- Size Errors Size Errors Estimate 1692 366( 7.9%) 92 659(14.1%) (16.0%) << Evaluation on test data (2376 items): Before Pruning After Pruning ---------------- --------------------------- Size Errors Size Errors Estimate 1692 421(17.7%) 92 354(14.9%) (16.0%) << (a) (b) <-classified as ---- ---- 328 251 (a): class >50K 103 1694 (b): class <=50K
k-fold Cross Validation • Start with one large data set. • Using a script, randomly divide this data set into k sets. • At each iteration, use k-1 sets to train the decision tree, and the remaining set to test the model. • Repeat this k times and take the average testing error. • The avg. error describes how well the learning algorithm can be applied to the data set.