140 likes | 168 Views
Dive into the world of WEKA - a versatile software 'workbench' for Machine Learning and Data Mining. Learn to acquire, process, and analyze data using WEKA's ML algorithms. This workshop covers the basics of WEKA, its objectives, versions, input format, and practical usage through GUI and command line. Discover the pros and cons of WEKA for an insightful journey into data analysis.
E N D
W E K A Waikato Environment for Knowledge Aquisition
Goals of the workshop evaluate & interpret the results identifying a problem Write seminar work apply to data choose appropriate DM technique transform into data • Aquisition of functional knowledge about the WEKA platform • Ability of processing (own) data in WEKA
What is WEKA ? Some basic facts about WEKA: • WEKA(1) = a flightless bird with an inquisitive nature (found only on the islands of New Zealand) • WEKA(2) = a software ‘workbench’ incorporating several standard ML/DM techniques • Authors = Ian H. Witten, Eibe Frank (et. al.) • Programminglanguage = JAVA • Origin = The Universityof Waikato, NewZealand • Literature = Ian H. Witten, Eibe Frank: Practical Machine Learning Tools with JAVA Implementations, Morgan Kaufmann, 1999 • Homepage = http://www.cs.waikato.ac.nz/~ml/weka
Objectives of WEKA • make ML/DM techniques generally available • apply them to practical problems (in agriculture) • develop new ML/DM algorithms • contribute to the theoretical frameworkof the field (ML/DM)
Versions of WEKA • There are several versions of WEKA: • WEKA 3.0: “book version” compatible with description in data mining book • WEKA 3.2: “GUI version” adds graphical user interfaces (book version is command-line only) • WEKA 3.4: “development version” with lots of improvements • This workshop is based on WEKA 3.4(.3)
The input to WEKA ARFF format (“flat” files): • example: Play-tennis domain %this is an example of a knowledge %domain in ARFF format @relation weather @attribute outlook {sunny, overcast, rainy} @attribute temperature real @attribute humidity real @attribute windy {TRUE, FALSE} @attribute play {yes, no} @data sunny,85,85,FALSE,no sunny,80,90,TRUE,no overcast,83,86,FALSE,yes rainy,70,96,FALSE,yes rainy,68,80,FALSE,yes rainy,65,70,TRUE,no overcast,64,65,TRUE,yes sunny,72,95,FALSE,no sunny,69,70,FALSE,yes rainy,75,80,FALSE,yes sunny,75,70,TRUE,yes overcast,72,90,TRUE,yes overcast,81,75,FALSE,yes . . . Conversion to the ARFF format ? • Example: • converting from • MS-EXCEL to ARFF
A quick tour of the “explorer” • Preprocess panel Filters panel Domain info. panel Attribute info. panel Attributes panel Attribute visualization panel Status bar Log file
A quick tour of the “explorer” • Classify panel Output panel Classifier panel Test options panel Class attribute Result panel
A quick tour of the “explorer” • Visualize panel
The command line C:\Temp>java weka.classifiers.trees.J48 Weka exception: No training file and no object input file given. General options: -t <name of training file> Sets training file. -T <name of test file> Sets test file. If missing, a cross-validation will be performed on the training data. -c <class index> Sets index of class attribute (default: last). -x <number of folds> Sets number of folds for cross-validation (default: 10). -s <random number seed> Sets random number seed for cross-validation (default: 1). -m <name of file with cost matrix> Sets file with cost matrix. -l <name of input file> Sets model input file. -d <name of output file> Sets model output file. -v Outputs no statistics for training data. -o Outputs statistics only, not the classifier. -i Outputs detailed information-retrieval statistics for each class. -k Outputs information-theoretic statistics. -p Only outputs predictions for test instances. -r Only outputs cumulative margin distribution. -z <class name> Only outputs the source representation of the classifier, giving it the supplied name. -g Only outputs the graph representation of the classifier. Options specific to weka.classifiers.j48.J48: -U Use unpruned tree. -C <pruning confidence> Set confidence threshold for pruning. (default 0.25) -M <minimum number of instances> Set minimum number of instances per leaf. (default 2) -R Use reduced error pruning. -N <number of folds> Set number of folds for reduced error pruning. One fold is used as pruning set. (default 3) -B Use binary splits only. -S Don't perform subtree raising. -L Do not clean up after the tree has been built. • example:
GUI vs. command line • Command line (-): • only textual visualisation of models • awkward to use • Command line (+): • full functionality • (‘saving the model’) • batch processing GUI (+): • visualisation of data and (some) models GUI (-): • not all the parameters can be set (reduced functionality)
PROs & CONs of WEKA PROs: • open source (GNU licence) • platform-independent (JAVA) • easy to use • (relatively) easy to modify • CONs: • relatively slow (JAVA) • ‘incomplete’documentation • (some GUI features could • be explained better) • some features available • only from command line