101 likes | 299 Views
Weka Just do it. Free and Open Source ML Suite Ian Witten & Eibe Frank University of Waikato New Zealand. Overview. Classifiers, Regressors, and clusterers Multiple evaluation schemes Bagging and Boosting Feature Selection: right features and data key to successful learning
E N D
WekaJust do it Free and Open Source ML Suite Ian Witten & Eibe Frank University of Waikato New Zealand
Overview • Classifiers, Regressors, and clusterers • Multiple evaluation schemes • Bagging and Boosting • Feature Selection: • right features and data key to successful learning • Experimenter • Visualizer • Text not up to date. • They welcome additions.
Learning Tasks • Classification: given examples labelled from a finite domain, generate a procedure for labelling unseen examples. • Regression: given examples labelled with a real value, generate procedure for labelling unseen examples. • Clustering: from a set of examples, partitioning examples into “interesting” groups. What scientists want.
Data Format: IRIS @RELATION iris @ATTRIBUTE sepallength REAL @ATTRIBUTE sepalwidth REAL @ATTRIBUTE petallength REAL @ATTRIBUTE petalwidth REAL @ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica} @DATA 5.1,3.5,1.4,0.2,Iris-setosa 4.9,3.0,1.4,0.2,Iris-setosa 4.7,3.2,1.3,0.2,Iris-setosa Etc. General from @atttribute attribute-name REAL or list of values
J48 = Decision Tree petalwidth <= 0.6: Iris-setosa (50.0) : # under node petalwidth > 0.6 # ..number wrong | petalwidth <= 1.7 | | petallength <= 4.9: Iris-versicolor (48.0/1.0) | | petallength > 4.9 | | | petalwidth <= 1.5: Iris-virginica (3.0) | | | petalwidth > 1.5: Iris-versicolor (3.0/1.0) | petalwidth > 1.7: Iris-virginica (46.0/1.0)
Cross-validation • Correctly Classified Instances 143 95.3% • Incorrectly Classified Instances 7 4.67 % • Default 10-fold cross validation i.e. • Split data into 10 equal sized pieces • Train on 9 pieces and test on remainder • Do for all possibilities and average
J48 Confusion Matrix Old data set from statistics: 50 of each class a b c <-- classified as 49 1 0 | a = Iris-setosa 0 47 3 | b = Iris-versicolor 0 3 47 | c = Iris-virginica
Precision, Recall, and Accuracy • Precision: probability of being correct given that your decision. • Precision of iris-setosa is 49/49 = 100% • Specificity in medical literature • Recall: probability of correctly identifying class. • Recall accuracy for iris-setosa is 49/50 = 98% • Sensitity in medical literature • Accuracy: # right/total = 143/150 =~95%
Other Evaluation Schemes • Leave-one-out cross-validation • Cross-validation where n = number of training instanced • Specific train and test set • Allows for exact replication • Ok if train/test large, e.g. 10,000 range.
Bootstrap sampling • Randomly select n with replacement from n • Expect about 2/3 to be chosen for training • Prob of not chosen = (1-1/n)^n ~ 1/e. • Testing on remainder • Repeat about 30 times and average. • Avoids partition bias