260 likes | 273 Views
Learn how to use WEKA, a machine learning software, to prepare data, build classifiers, and interpret results. Explore different learning schemes, handy tools, and resources available in WEKA.
E N D
An Exercise in Machine Learning • http://www.cs.iastate.edu/~cs573x/bbsilab.html • Machine Learning Software • Preparing Data • Building Classifiers • Interpreting Results • Test-driving WEKA
Machine Learning Software • Suites (General Purpose) • WEKA (Source: Java) • MLC++ (Source: C++) • SIPINA • List from KDNuggets (Various) • Specific • Classification: C4.5, SVMlight • Association Rule Mining • Bayesian Net …… • Commercial vs. Free vs. Programming
What does WEKA do? • Implementation of state-of-art learning algorithm • Main strengths in the classification • Regression, Association Rules and clustering algorithms • Extensible to try new learning schemes • Large variety of handy tools (transforming datasets, filters, visualization etc…)
WEKA resources • API Documentation, Tutorial, Source code. • WEKA mailing list • Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations • Weka-related Projects: • Weka-Parallel - parallel processing for Weka • RWeka - linking R and Weka • YALE - Yet Another Learning Environment • Many others…
Getting Started • Installation (Java runtime +WEKA) • Setting up the environment (CLASSPATH) • Reference Book and online API document • Preparing Data sets • Running WEKA • Interpreting Results
ARFF Data Format • Attribute-Relation File Format • Header – describing the attribute types • Data – (instances, examples) comma-separated list • Use the right data format: • Filestem, CSV ARFF format • Use C45Loader and CSVLoader to convert
Data Filters • Useful support for data preprocessing • Removing or adding attributes, resampling the dataset, removing examples, etc. • Creates stratified cross-validation folds of the given dataset, and class distributions are approximately retained within each fold. • Typically split data as 2/3 in training and 1/3 in testing
Building Classifiers • A classifier model - mapping from dataset attributes to the class (target) attribute. Creation and form differs. • Decision Tree and Naïve Bayes Classifiers • Which one is the best? • No Free Lunch!
(1) weka.classifiers.rules.ZeroR • Building and using a 0-R classifier. Predicts the mean (for a numeric class) or the mode (for a nominal class). (2) weka.classifiers.bayes.NaiveBayes • Class for building a Naive Bayesian classifier
(3) weka.classifiers.trees.J48 • Class for generating an unpruned or a pruned C4.5 decision tree.
Test Options • Percentage Split (2/3 Training; 1/3 Testing) • Cross-validation • estimating the generalization error based on resampling when limited data; averaged error estimate. • stratified • 10-fold • leave-one-out (Loo) • 10-fold vs. Loo
Decision Tree Output (1) === Error on training data === Correctly Classified Instance 14 100 % Incorrectly Classified Instances 0 0 % Kappa statistic 1 Mean absolute error 0 Root mean squared error 0 Relative absolute error 0% Root relative squared error 0% Total Number of Instances 14 === Detailed Accuracy By Class === TP FP Precision Recall F-Measure Class 1 0 1 1 1 yes 1 0 1 1 1 no === Confusion Matrix === a b <-- classified as • 0 | a = yes • 0 5 | b = no J48 pruned tree ------------------ outlook = sunny | humidity <= 75: yes (2.0) | humidity > 75: no (3.0) outlook = overcast: yes (4.0) outlook = rainy | windy = TRUE: no (2.0) | windy = FALSE: yes (3.0) Number of Leaves : 5 Size of the tree : 8
Decision Tree Output (2) === Stratified cross-validation === Correctly Classified Instances 9 64.2857 % Incorrectly Classified Instances 5 35.7143 % Kappa statistic 0.186 Mean absolute error 0.2857 Root mean squared error 0.4818 Relative absolute error 60% Root relative squared error 97.6586 % Total Number of Instances 14 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure Class 0.778 0.6 0.7 0.778 0.737 yes 0.4 0.222 0.5 0.4 0.444 no === Confusion Matrix === a b <-- classified as 7 2 | a = yes 3 2 | b = no
Performance Measures • Accuracy & Error rate • Mean absolute error • Root mean-squared root (square root of the average quadratic loss) • Confusion matrix – contingency table • True Positive rate & False Positive rate • Precision & F-Measure
Decision Tree Pruning • Overcome Over-fitting • Pre-pruning and Post-pruning • Reduced error pruning • Subtree raising with different confidence • Comparing tree size and accuracy.
Subtree replacement • Bottom-up: tree is considered for replacement once all its subtrees have been considered
Subtree Raising • Deletes node and redistributes instances • Slower than subtree replacement
Naïve Bayesian Classifier • Output CPT, same set of performance measures • By default, use normal distribution to model numeric attributes. • Kernel density estimator could improve performance if normality assumption is incorrect. (-k option)
Data Sets to work on • Data sets were preprocessed into ARFF format • Three data sets from UCI repository • Two data sets from Computational Biology • Protein Function Prediction • Surface Residue Prediction
Protein Function Prediction • Build a Decision Tree classifier that assign protein sequences into functional families based on characteristic motif compositions • Each attribute (motif) has a Prosite access number: PS#### • Class label use Prosite Doc ID: PDOC#### • 73 attributes (binary) & 10 classes (PDOC). • Suggested method: Use 10-fold CV and Pruning the tree using Sub-tree raising method
Surface Residue Prediction • Prediction is based on the identity of the target residue and its 4 sequence neighbors • Window Size = 5 • Target residue is on Surface or not? • 5 attributes and binary classes. • Suggested method: Use Naïve Bayesian Classifier with no kernels