E N D
Weka: Practical Machine Learning Tools and Techniques with Java Implementations Proceedings of the ICONIP/ANZIIS/ANNES'99 Workshop on Emerging Knowledge Engineering and Connectionist-Based Information Systems, pages 192-196, 1999. Dunedin, New Zealand. Ian H. Witten, Eibe Frank, Len Trigg, Mark Hall, Geoffrey Holmes, and Sally Jo Cunningham. Reporter: Jin-huei Dai
OUTLINE 1. Introduction 2. The command-line interface 3. The Explorer 4. The Knowledge Flow interface 5. The Experimenter 6. Conclusions 7. References
1. Introduction Data mining is an experimental science. Machine learning provides the technical basis of data mining. The Weka workbench is a collection of state-of-the-art machine learning algorithms and data preprocessing tools. It is designed so that users can quickly try out existing methods on new datasets in flexible ways. It provides extensive support for the whole process of experimental data mining, including preparing the input data, evaluating learning schemes statistically, and visualizing the input data and the result of learning. Weka was developed at the University of Waikato in New Zealand, and the name stands for Waikato Environment for Knowledge Analysis.
1. Introduction(cont.) Weka is freely available on the World-Wide Web and accompanies a new text on data mining which documents and fully explains all the algorithms it contains. Applications written using the Weka class libraries can be run on any computer with a Web browsing capability; this allows users to apply machine learning techniques to their own data regardless of computer platform. The Weka software is written entirely in Java to facilitate the availability of data mining tools regardless of computer platform. The primary learning methods in Weka are “classifiers”, and they induce a rule set or decision tree that models the data. Weka also includes algorithms for learning association rules and clustering data.
3.The Explorer p.375 === Run information === Scheme: weka.classifiers.trees.J48 -C 0.25 -M 2 Relation: weather Instances: 14 Attributes: 5 outlook temperature humidity windy play Test mode: 10-fold cross-validation === Classifier model (full training set) === J48 pruned tree outlook = sunny | humidity <= 75: yes (2.0) | humidity > 75: no (3.0) outlook = overcast: yes (4.0) outlook = rainy | windy = TRUE: no (2.0) | windy = FALSE: yes (3.0) Number of Leaves : 5 Size of the tree : 8
Time taken to build model: 0.08 seconds === Stratified cross-validation === === Summary === Correctly Classified Instances 9 64.2857 % Incorrectly Classified Instances 5 35.7143 % Kappa statistic 0.186 Mean absolute error 0.2857 Root mean squared error 0.4818 Relative absolute error 60 % Root relative squared error 97.6586 % Total Number of Instances 14 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure Class 0.778 0.6 0.7 0.778 0.737 yes 0.4 0.222 0.5 0.4 0.444 no === Confusion Matrix === a b <-- classified as 7 2 | a = yes 3 2 | b = no
=== Run information === Scheme: weka.associations.Apriori -N 10 -T 0 -C 0.9 -D 0.05 -U 1.0 -M 0.1 -S -1.0 Relation: weather.symbolic Instances: 14 Attributes: 5 === Associator model (full training set) === Size of set of large itemsets L(1): 12 Size of set of large itemsets L(2): 47 Size of set of large itemsets L(3): 39 Size of set of large itemsets L(4): 6 Best rules found: 1. humidity=normal windy=FALSE 4 ==> play=yes 4 conf:(1) 2. temperature=cool 4 ==> humidity=normal 4 conf:(1) 3. outlook=overcast 4 ==> play=yes 4 conf:(1) 4. temperature=cool play=yes 3 ==> humidity=normal 3 conf:(1) 5. outlook=rainy windy=FALSE 3 ==> play=yes 3 conf:(1) 6. outlook=rainy play=yes 3 ==> windy=FALSE 3 conf:(1) 7. outlook=sunny humidity=high 3 ==> play=no 3 conf:(1) 8. outlook=sunny play=no 3 ==> humidity=high 3 conf:(1) 9. temperature=cool windy=FALSE 2 ==> humidity=normal play=yes 2 conf:(1) 10. temperature=cool humidity=normal windy=FALSE 2 ==> play=yes 2 conf:(1)
=== Run information === Scheme: weka.classifiers.trees.J48 -C 0.25 -M 2 Relation: soybean Instances: 683 Attributes: 36 date plant-stand precip temp hail crop-hist area-damaged severity seed-tmt germination plant-growth leaves leafspots-halo leafspots-marg leafspot-size leaf-shread leaf-malf leaf-mild stem lodging stem-cankers canker-lesion fruiting-bodies external-decay mycelium int-discolor sclerotia fruit-pods fruit-spots seed mold-growth seed-discolor seed-size shriveling roots class Test mode: evaluate on training data J48 pruned tree ------------------ leafspot-size = lt-1/8 | canker-lesion = dna | | leafspots-marg = w-s-marg | | | seed-size = norm: bacterial-blight (21.0/1.0) | | | seed-size = lt-norm: bacterial-pustule (3.23/1.23) | | leafspots-marg = no-w-s-marg: bacterial-pustule (17.91/0.91) | | leafspots-marg = dna: bacterial-blight (0.0) | canker-lesion = brown: bacterial-blight (0.0) | canker-lesion = dk-brown-blk: phytophthora-rot (4.78/0.1) | canker-lesion = tan: purple-seed-stain (11.23/0.23) leafspot-size = gt-1/8
| roots = norm | | mold-growth = absent | | | fruit-spots = absent | | | | leaf-malf = absent | | | | | fruiting-bodies = absent | | | | | | date = april: brown-spot (5.0) | | | | | | date = may: brown-spot (24.0/1.0) | | | | | | date = june | | | | | | | precip = lt-norm: phyllosticta-leaf-spot (4.0) | | | | | | | precip = norm: brown-spot (5.0/2.0) | | | | | | | precip = gt-norm: brown-spot (21.0) | | | | | | date = july | | | | | | | precip = lt-norm: phyllosticta-leaf-spot (1.0) | | | | | | | precip = norm: phyllosticta-leaf-spot (2.0) | | | | | | | precip = gt-norm: frog-eye-leaf-spot (11.0/5.0) | | | | | | date = august | | | | | | | leaf-shread = absent | | | | | | | | seed-tmt = none: alternarialeaf-spot (16.0/4.0) | | | | | | | | seed-tmt = fungicide | | | | | | | | | plant-stand = normal: frog-eye-leaf-spot (6.0) | | | | | | | | | plant-stand = lt-normal: alternarialeaf-spot (5.0/1.0) | | | | | | | | seed-tmt = other: frog-eye-leaf-spot (3.0) | | | | | | | leaf-shread = present: alternarialeaf-spot (2.0) | | | | | | date = september | | | | | | | stem = norm: alternarialeaf-spot (44.0/4.0) | | | | | | | stem = abnorm: frog-eye-leaf-spot (2.0) | | | | | | date = october: alternarialeaf-spot (31.0/1.0) | | | | | fruiting-bodies = present: brown-spot (34.0)
| | | | leaf-malf = present: phyllosticta-leaf-spot (10.0) | | | fruit-spots = colored | | | | fruit-pods = norm: brown-spot (2.0) | | | | fruit-pods = diseased: frog-eye-leaf-spot (62.0) | | | | fruit-pods = few-present: frog-eye-leaf-spot (0.0) | | | | fruit-pods = dna: frog-eye-leaf-spot (0.0) | | | fruit-spots = brown-w/blk-specks | | | | crop-hist = diff-lst-year: brown-spot (0.0) | | | | crop-hist = same-lst-yr: brown-spot (2.0) | | | | crop-hist = same-lst-two-yrs: brown-spot (0.0) | | | | crop-hist = same-lst-sev-yrs: frog-eye-leaf-spot (2.0) | | | fruit-spots = distort: brown-spot (0.0) | | | fruit-spots = dna: brown-stem-rot (9.0) | | mold-growth = present | | | leaves = norm: diaporthe-pod-&-stem-blight (7.25) | | | leaves = abnorm: downy-mildew (20.0) | roots = rotted | | area-damaged = scattered: herbicide-injury (1.1/0.1) | | area-damaged = low-areas: phytophthora-rot (30.03) | | area-damaged = upper-areas: phytophthora-rot (0.0) | | area-damaged = whole-field: herbicide-injury (3.66/0.66) | roots = galls-cysts: cyst-nematode (7.81/0.17) leafspot-size = dna | int-discolor = none | | leaves = norm | | | stem-cankers = absent | | | | canker-lesion = dna: diaporthe-pod-&-stem-blight (5.53)
| | | | canker-lesion = brown: purple-seed-stain (0.0) | | | | canker-lesion = dk-brown-blk: purple-seed-stain (0.0) | | | | canker-lesion = tan: purple-seed-stain (9.0) | | | stem-cankers = below-soil: rhizoctonia-root-rot (19.0) | | | stem-cankers = above-soil: anthracnose (0.0) | | | stem-cankers = above-sec-nde: anthracnose (24.0) | | leaves = abnorm | | | stem = norm | | | | plant-growth = norm: powdery-mildew (22.0/2.0) | | | | plant-growth = abnorm: cyst-nematode (4.3/0.39) | | | stem = abnorm | | | | plant-stand = normal | | | | | leaf-malf = absent | | | | | | seed = norm: diaporthe-stem-canker (21.0/1.0) | | | | | | seed = abnorm: anthracnose (9.0) | | | | | leaf-malf = present: 2-4-d-injury (3.0) | | | | plant-stand = lt-normal | | | | | fruiting-bodies = absent: phytophthora-rot (50.16/7.61) | | | | | fruiting-bodies = present | | | | | | roots = norm: anthracnose (11.0/1.0) | | | | | | roots = rotted: phytophthora-rot (12.89/2.15) | | | | | | roots = galls-cysts: phytophthora-rot (0.0) | int-discolor = brown | | leaf-malf = absent: brown-stem-rot (35.73/0.73) | | leaf-malf = present: 2-4-d-injury (3.15/0.68) | int-discolor = black: charcoal-rot (22.22/2.22)
Number of Leaves : 61 Size of the tree : 93 Time taken to build model: 0.05 seconds === Evaluation on training set === === Summary === Correctly Classified Instances 658 96.3397 % Incorrectly Classified Instances 25 3.6603 % Kappa statistic 0.9598 Mean absolute error 0.0104 Root mean squared error 0.0625 Relative absolute error 10.7981 % Root relative squared error 28.5358 % Total Number of Instances 683
=== Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure Class 1 0.002 0.952 1 0.976 diaporthe-stem-canker 1 0 1 1 1 charcoal-rot 0.95 0 1 0.95 0.974 rhizoctonia-root-rot 1 0.008 0.946 1 0.972 phytophthora-rot 1 0 1 1 1 brown-stem-rot 1 0 1 1 1 powdery-mildew 1 0 1 1 1 downy-mildew 0.978 0.005 0.968 0.978 0.973 brown-spot 1 0.002 0.952 1 0.976 bacterial-blight 0.95 0 1 0.95 0.974 bacterial-pustule 1 0 1 1 1 purple-seed-stain 0.977 0 1 0.977 0.989 anthracnose 0.85 0 1 0.85 0.919 phyllosticta-leaf-spot 0.967 0.017 0.898 0.967 0.931 alternarialeaf-spot 0.89 0.008 0.942 0.89 0.915 frog-eye-leaf-spot 1 0 1 1 1 diaporthe-pod-&-stem-blight 1 0 1 1 1 cyst-nematode 1 0 1 1 1 2-4-d-injury 0.5 0 1 0.5 0.667 herbicide-injury
=== Confusion Matrix === a b c d e f g h i j k l m n o p q r s <-- classified as 20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | a = diaporthe-stem-canker 0 20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | b = charcoal-rot 1 0 19 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | c = rhizoctonia-root-rot 0 0 0 88 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | d = phytophthora-rot 0 0 0 0 44 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | e = brown-stem-rot 0 0 0 0 0 20 0 0 0 0 0 0 0 0 0 0 0 0 0 | f = powdery-mildew 0 0 0 0 0 0 20 0 0 0 0 0 0 0 0 0 0 0 0 | g = downy-mildew 0 0 0 0 0 0 0 90 0 0 0 0 0 0 2 0 0 0 0 | h = brown-spot 0 0 0 0 0 0 0 0 20 0 0 0 0 0 0 0 0 0 0 | i = bacterial-blight 0 0 0 0 0 0 0 0 1 19 0 0 0 0 0 0 0 0 0 | j = bacterial-pustule 0 0 0 0 0 0 0 0 0 0 20 0 0 0 0 0 0 0 0 | k = purple-seed-stain 0 0 0 1 0 0 0 0 0 0 0 43 0 0 0 0 0 0 0 | l = anthracnose 0 0 0 0 0 0 0 3 0 0 0 0 17 0 0 0 0 0 0 | m = phyllosticta-leaf-spot 0 0 0 0 0 0 0 0 0 0 0 0 0 88 3 0 0 0 0 | n = alternarialeaf-spot 0 0 0 0 0 0 0 0 0 0 0 0 0 10 81 0 0 0 0 | o = frog-eye-leaf-spot 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 15 0 0 0 | p = diaporthe-pod-&-stem-blight 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 14 0 0 | q = cyst-nematode 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 16 0 | r = 2-4-d-injury 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 | s = herbicide-injury
Association rules Weka contain an implementation of the Apriori leaner for generating association rules, a commonly use technique in market basket analysis. This algorithm does not seek rules that predict a particular class attribute but rather looks for any rules that capture strong associations between different attribute. Clustering Method of clustering also do not seek rules that predict a particular class, but rather try to divide the data into natural groups or “clusters.” Weka includes an implementation of the EM algorithm, which can be used for unsupervised learning, it makes the assumption that all attributes are independent random variables.
PredictiveApriori Best rules found: 1. outlook=overcast 4 ==> play=yes 4 acc:(0.95323) 2. temperature=cool 4 ==> humidity=normal 4 acc:(0.95323) 3. humidity=normal windy=FALSE 4 ==> play=yes 4 acc:(0.95323) 4. outlook=sunny humidity=high 3 ==> play=no 3 acc:(0.92093) 5. outlook=sunny play=no 3 ==> humidity=high 3 acc:(0.92093) 6. outlook=rainy windy=FALSE 3 ==> play=yes 3 acc:(0.92093) 7. outlook=rainy play=yes 3 ==> windy=FALSE 3 acc:(0.92093) 8. outlook=sunny temperature=hot 2 ==> humidity=high play=no 2 acc:(0.86233) 9. outlook=sunny humidity=normal 2 ==> play=yes 2 acc:(0.86233) 10. outlook=sunny play=yes 2 ==> humidity=normal 2 acc:(0.86233) 11. outlook=overcast temperature=hot 2 ==> windy=FALSE play=yes 2 acc:(0.86233) 12. outlook=overcast windy=FALSE 2 ==> temperature=hot play=yes 2 acc:(0.86233) 13. outlook=rainy humidity=high 2 ==> temperature=mild 2 acc:(0.86233) 14. outlook=rainy windy=TRUE 2 ==> play=no 2 acc:(0.86233) 15. outlook=rainy play=no 2 ==> windy=TRUE 2 acc:(0.86233) 16. temperature=hot play=yes 2 ==> outlook=overcast windy=FALSE 2 acc:(0.86233) 17. temperature=hot play=no 2 ==> outlook=sunny humidity=high 2 acc:(0.86233) 18. temperature=mild humidity=normal 2 ==> play=yes 2 acc:(0.86233) 19. temperature=mild play=no 2 ==> humidity=high 2 acc:(0.86233) 20. temperature=cool windy=FALSE 2 ==> humidity=normal play=yes 2 acc:(0.86233)
Scheme: weka.clusterers.Cobweb -A 1.0 -C 0.0028209479177387815 Relation: weather Number of merges: 1 Number of splits: 0 Number of clusters: 21 node 0 [14] | node 1 [5] | | leaf 2 [1] | node 1 [5] | | leaf 3 [1] | node 1 [5] | | node 4 [2] | | | leaf 5 [1] | | node 4 [2] | | | leaf 6 [1] | node 1 [5] | | leaf 7 [1] node 0 [14]
node 0 [14] | node 8 [6] | | node 9 [2] | | | leaf 10 [1] | | node 9 [2] | | | leaf 11 [1] | node 8 [6] | | leaf 12 [1] | node 8 [6] | | node 13 [3] | | | leaf 14 [1] | | node 13 [3] | | | leaf 15 [1] | | node 13 [3] | | | leaf 16 [1] node 0 [14] | node 17 [3] | | leaf 18 [1] | node 17 [3] | | leaf 19 [1] | node 17 [3] | | leaf 20 [1]
Select Attributes === Run information === Evaluator: weka.attributeSelection.PrincipalComponents -R 0.95 -A 5 Search: weka.attributeSelection.Ranker -T -1.7976931348623157E308 -N -1 Relation: weather Instances: 14 Attributes: 5 outlook temperature humidity windy play Evaluation mode: evaluate on all training data Search Method: Attribute ranking. Attribute Evaluator (unsupervised): Principal Components Attribute Transformer Correlation matrix 1 -0.47 -0.56 0.31 0.03 0.04 -0.47 1 -0.47 0.14 -0.17 -0.09 -0.56 -0.47 1 -0.44 0.13 0.04 0.31 0.14 -0.44 1 0.32 0.33 0.03 -0.17 0.13 0.32 1 0.2 0.04 -0.09 0.04 0.33 0.2 1 eigenvalue proportion cumulative 1.94405 0.32401 0.32401 0.578temperature-0.571outlook=rainy+0.506outlook=sunny+0.227windy+0.164humidity... 1.58814 0.26469 0.5887 -0.68outlook=overcast+0.443humidity+0.424outlook=rainy+0.334windy+0.217outlook=sunny... 1.29207 0.21534 0.80404 0.567outlook=sunny-0.443windy-0.432outlook=overcast-0.414humidity-0.312temperature... 0.79269 0.13212 0.93616 0.738windy-0.667humidity-0.077temperature-0.052outlook=overcast+0.033outlook=rainy... 0.38305 0.06384 1 0.748temperature-0.4humidity+0.348outlook=rainy-0.308windy-0.191outlook=overcast...
Eigenvectors V1 V2 V3 V4 V5 0.5064 0.2166 0.5674 0.0167 -0.1683 outlook=sunny 0.0684 -0.6798 -0.4317 -0.0522 -0.1906 outlook=overcast -0.5709 0.4244 -0.1603 0.0325 0.348 outlook=rainy 0.5785 0.053 -0.3125 -0.0772 0.7476 temperature 0.1639 0.4432 -0.4145 -0.6669 -0.4003 humidity 0.227 0.3341 -0.4433 0.7384 -0.3083 windy Ranked attributes: 0.675990846445273472 1 0.578temperature-0.571outlook=rainy+0.506outlook=sunny+0.227windy+0.164humidity... 0.411301353536642624 2 -0.68outlook = overcast+0.443humidity+0.424outlook = rainy + 0.334windy +0.217outlook=sunny... 0.195956514975330624 3 0.567outlook=sunny-0.443windy-0.432outlook=overcast-0.414humidity-0.312temperature... 0.063841150150769192 4 0.738windy-0.667humidity-0.077temperature-0.052outlook=overcast+0.033outlook=rainy... 0.000000000000000111 5 0.748temperature-0.4humidity+0.348outlook=rainy-0.308windy-0.191outlook=overcast... Selected attributes: 1,2,3,4,5 : 5
Search Method: Attribute ranking. Attribute Evaluator (supervised, Class (nominal): 5 play): Symmetrical Uncertainty Ranking Filter Ranked attributes: 0.196 1 outlook 0.05 4 windy 0 3 humidity 0 2 temperature Selected attributes: 1,4,3,2 : 4 ======================== Search Method: Attribute ranking. OneR feature evaluator. Using 10 fold cross validation for evaluating attributes. Minimum bucket size for OneR: 6 Ranked attributes: 57.143 3 humidity 50 1 outlook 50 2 temperature 42.857 4 windy Selected attributes: 3,1,2,4 : 4 ========================= Search Method: Attribute ranking. Information Gain Ranking Filter Ranked attributes: 0.2467 1 outlook 0.0481 4 windy 0 3 humidity 0 2 temperature Selected attributes: 1,4,3,2 : 4
Search Method: Best first, Exhaustive Search. Selected attributes: 1,4 : 2 outlook windy ============================ Search Method: Genetic search. Initial population merit scaled subset 0 0.03362 2 0 0.03362 2 0.04999 0.0548 4 0.06572 0.06147 1 2 3 4 ... 0.17354 0.10716 1 4 0 0.03362 3 0 0.03362 2 0 0.03362 2 Generation: 20 merit scaled subset 0.19601 0.2076 1 0.19601 0.2076 1 0.19601 0.2076 1 0.19601 0.2076 1 0.17354 0.16236 1 4 0.09292 0 1 3 4 ... 0.19601 0.2076 1 Attribute Subset Evaluator (supervised, Class (nominal): 5 play): CFS Subset Evaluator Including locally predictive attributes Selected attributes: 1,4 : 2 outlook windy