1 / 10

C4.5 Demo

C4.5 Demo. Andrew Rosenberg CS4701 11/30/04. What is c4.5?. c4.5 is a program that creates a decision tree based on a set of labeled input data. This decision tree can then be tested against unseen labeled test data to quantify how well it generalizes. Running c4.5. On cunix.columbia.edu

elgin
Download Presentation

C4.5 Demo

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. C4.5 Demo Andrew Rosenberg CS4701 11/30/04

  2. What is c4.5? • c4.5 is a program that creates a decision tree based on a set of labeled input data. • This decision tree can then be tested against unseen labeled test data to quantify how well it generalizes.

  3. Running c4.5 • On cunix.columbia.edu • ~amr2104/c4.5/bin/c4.5 –u –f filestem • On cluster.cs.columbia.edu • ~amaxwell/c4.5/bin/c4.5 –u –f filestem • c4.5 expects to find 3 files • filestem.names • filestem.data • filestem.test

  4. File Format: .names • The file begins with a comma separated list of classes ending with a period, followed by a blank line • E.g, >50K, <=50K. • The remaining lines have the following format (note the end of line period): • Attribute: {ignore, discrete n, continuous, list}.

  5. Example: census.names >50K, <=50K. age: continuous. workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, etc. fnlwgt: continuous. education: Bachelors, Some-college, 11th, HS-grad, Prof-school, etc. education-num: continuous. marital-status: Married-civ-spouse, Divorced, Never-married, etc. occupation: Tech-support, Craft-repair, Other-service, Sales, etc. relationship: Wife, Own-child, Husband, Not-in-family, Unmarried. race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black. sex: Female, Male. capital-gain: continuous. capital-loss: continuous. hours-per-week: continuous. native-country: United-States, Cambodia, England, Puerto-Rico, Canada, etc.

  6. File Format: .data, .test • Each line in these data files is a comma separated list of attribute values ending with a class label followed by a period. • The attributes must be in the same order as described in the .names file. • Unavailable values can be entered as ‘?’ • When creating test sets, make sure that you remove these data points from the training data.

  7. Example: adult.test 25, Private, 226802, 11th, 7, Never-married, Machine-op-inspct, Own-child, Black, Male, 0, 0, 40, United-States, <=50K. 38, Private, 89814, HS-grad, 9, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 50, United-States, <=50K. 28, Local-gov, 336951, Assoc-acdm, 12, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 40, United-States, >50K. 44, Private, 160323, Some-college, 10, Married-civ-spouse, Machine-op-inspct, Husband, Black, Male, 7688, 0, 40, United-States, >50K. 18, ?, 103497, Some-college, 10, Never-married, ?, Own-child, White, Female, 0, 0, 30, United-States, <=50K. 34, Private, 198693, 10th, 6, Never-married, Other-service, Not-in-family, White, Male, 0, 0, 30, United-States, <=50K. 29, ?, 227026, HS-grad, 9, Never-married, ?, Unmarried, Black, Male, 0, 0, 40, United-States, <=50K. 63, Self-emp-not-inc, 104626, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 3103, 0, 32, United-States, >50K. 24, Private, 369667, Some-college, 10, Never-married, Other-service, Unmarried, White, Female, 0, 0, 40, United-States, <=50K. 55, Private, 104996, 7th-8th, 4, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 10, United-States, <=50K. 65, Private, 184454, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 6418, 0, 40, United-States, >50K.36, Federal-gov, 212465, Bachelors, 13, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 40, United-States, <=50K.

  8. c4.5 Output • The decision tree proper. • (weighted training examples/weighted training error) • Tables of training error and testing error • Confusion matrix • You’ll want to pipe the output of c4.5 to a text file for later viewing. • E.g., c4.5 –u –f filestem > filestem.results

  9. Example output capital-gain > 6849 : >50K (203.0/6.2) | capital-gain <= 6849 : | | capital-gain > 6514 : <=50K (7.0/1.3) | | capital-gain <= 6514 : | | | marital-status = Married-civ-spouse: >50K (18.0/1.3) | | | marital-status = Divorced: <=50K (2.0/1.0) | | | marital-status = Never-married: >50K (0.0) | | | marital-status = Separated: >50K (0.0) | | | marital-status = Widowed: >50K (0.0) | | | marital-status = Married-spouse-absent: >50K (0.0) | | | marital-status = Married-AF-spouse: >50K (0.0) Tree saved Evaluation on training data (4660 items): Before Pruning After Pruning ---------------- --------------------------- Size Errors Size Errors Estimate 1692 366( 7.9%) 92 659(14.1%) (16.0%) << Evaluation on test data (2376 items): Before Pruning After Pruning ---------------- --------------------------- Size Errors Size Errors Estimate 1692 421(17.7%) 92 354(14.9%) (16.0%) << (a) (b) <-classified as ---- ---- 328 251 (a): class >50K 103 1694 (b): class <=50K

  10. k-fold Cross Validation • Start with one large data set. • Using a script, randomly divide this data set into k sets. • At each iteration, use k-1 sets to train the decision tree, and the remaining set to test the model. • Repeat this k times and take the average testing error. • The avg. error describes how well the learning algorithm can be applied to the data set.

More Related