140 likes | 164 Views
W E K A. W aikato E nvironment for K nowledge A quisition. Goals of the workshop. evaluate & interpret the results. identifying a problem. Write seminar work. apply to data. choose appropriate DM technique. transform into data.
E N D
W E K A Waikato Environment for Knowledge Aquisition
Goals of the workshop evaluate & interpret the results identifying a problem Write seminar work apply to data choose appropriate DM technique transform into data • Aquisition of functional knowledge about the WEKA platform • Ability of processing (own) data in WEKA
What is WEKA ? Some basic facts about WEKA: • WEKA(1) = a flightless bird with an inquisitive nature (found only on the islands of New Zealand) • WEKA(2) = a software ‘workbench’ incorporating several standard ML/DM techniques • Authors = Ian H. Witten, Eibe Frank (et. al.) • Programminglanguage = JAVA • Origin = The Universityof Waikato, NewZealand • Literature = Ian H. Witten, Eibe Frank: Practical Machine Learning Tools with JAVA Implementations, Morgan Kaufmann, 1999 • Homepage = http://www.cs.waikato.ac.nz/~ml/weka
Objectives of WEKA • make ML/DM techniques generally available • apply them to practical problems (in agriculture) • develop new ML/DM algorithms • contribute to the theoretical frameworkof the field (ML/DM)
Versions of WEKA • There are several versions of WEKA: • WEKA 3.0: “book version” compatible with description in data mining book • WEKA 3.2: “GUI version” adds graphical user interfaces (book version is command-line only) • WEKA 3.4: “development version” with lots of improvements • This workshop is based on WEKA 3.4(.3)
The input to WEKA ARFF format (“flat” files): • example: Play-tennis domain %this is an example of a knowledge %domain in ARFF format @relation weather @attribute outlook {sunny, overcast, rainy} @attribute temperature real @attribute humidity real @attribute windy {TRUE, FALSE} @attribute play {yes, no} @data sunny,85,85,FALSE,no sunny,80,90,TRUE,no overcast,83,86,FALSE,yes rainy,70,96,FALSE,yes rainy,68,80,FALSE,yes rainy,65,70,TRUE,no overcast,64,65,TRUE,yes sunny,72,95,FALSE,no sunny,69,70,FALSE,yes rainy,75,80,FALSE,yes sunny,75,70,TRUE,yes overcast,72,90,TRUE,yes overcast,81,75,FALSE,yes . . . Conversion to the ARFF format ? • Example: • converting from • MS-EXCEL to ARFF
A quick tour of the “explorer” • Preprocess panel Filters panel Domain info. panel Attribute info. panel Attributes panel Attribute visualization panel Status bar Log file
A quick tour of the “explorer” • Classify panel Output panel Classifier panel Test options panel Class attribute Result panel
A quick tour of the “explorer” • Visualize panel
The command line C:\Temp>java weka.classifiers.trees.J48 Weka exception: No training file and no object input file given. General options: -t <name of training file> Sets training file. -T <name of test file> Sets test file. If missing, a cross-validation will be performed on the training data. -c <class index> Sets index of class attribute (default: last). -x <number of folds> Sets number of folds for cross-validation (default: 10). -s <random number seed> Sets random number seed for cross-validation (default: 1). -m <name of file with cost matrix> Sets file with cost matrix. -l <name of input file> Sets model input file. -d <name of output file> Sets model output file. -v Outputs no statistics for training data. -o Outputs statistics only, not the classifier. -i Outputs detailed information-retrieval statistics for each class. -k Outputs information-theoretic statistics. -p Only outputs predictions for test instances. -r Only outputs cumulative margin distribution. -z <class name> Only outputs the source representation of the classifier, giving it the supplied name. -g Only outputs the graph representation of the classifier. Options specific to weka.classifiers.j48.J48: -U Use unpruned tree. -C <pruning confidence> Set confidence threshold for pruning. (default 0.25) -M <minimum number of instances> Set minimum number of instances per leaf. (default 2) -R Use reduced error pruning. -N <number of folds> Set number of folds for reduced error pruning. One fold is used as pruning set. (default 3) -B Use binary splits only. -S Don't perform subtree raising. -L Do not clean up after the tree has been built. • example:
GUI vs. command line • Command line (-): • only textual visualisation of models • awkward to use • Command line (+): • full functionality • (‘saving the model’) • batch processing GUI (+): • visualisation of data and (some) models GUI (-): • not all the parameters can be set (reduced functionality)
PROs & CONs of WEKA PROs: • open source (GNU licence) • platform-independent (JAVA) • easy to use • (relatively) easy to modify • CONs: • relatively slow (JAVA) • ‘incomplete’documentation • (some GUI features could • be explained better) • some features available • only from command line