Data Mining CSCI 307, Spring 2019 Lecture 8

Data MiningCSCI 307, Spring 2019Lecture 8 WEKA

Classification • Predicted target must be categorical/nominal • Implemented methods • Decision trees (J48, etc.) • Rules (ZeroR, OneR, etc.) • Naïve Bayes • Evaluation methods • Test data set • Cross-validation

Classification • Algorithms • ZeroR: Ignores all the attributes, and relies only on the target class. Always predicts the majority value. • OneR: Make one rule for each attribute (based on frequency of outcomes for each value of the attribute). Choose the rule/attribute that gives the smallest error. • Naive Bayes: A probabilistic classifier based on Bayes' Theorem. Assumes all attributes are independent.

Evaluation Methods • Test Data Set • Train on all data; Test on all data (not recommended) • Split the data (E.g. 66% for training, 34% for testing). • Use separate files, one with training instances, one with testing instances. • Cross Validation: • Divide data set into groups (e.g. 10 groups of instances) • Choose one group for testing, use the rest for training • Repeat multiple times with different group for testing each time. (E.g. repeat 10 times using one of the 10 original groups for testing each time, and the rest for training). • Average the results of all the testing.

WEKA Data Formats • Data can be imported from a file in various formats: • ARFF (Attribute Relation File Format) has two sections: • the Header information defines attribute name, type and relations. • the Data section lists the data records (instances). • CSV: Comma Separated Values (text file) • C4.5: A format used by a decision induction algorithm, requires two separate files • Name file: defines the names of the attributes • Data file: lists the records (samples) • binary • Data can also be read from a URL or from an SQL database (using JDBC; Java DataBase Connectivity is an API for Java that defines how a client may access a database)

Attribute Relation File Format (arff) ARFF files consist of two distinct sections: • the Header section defines attribute name, type and relations, start with a keyword. • @relation <data-name> • @attribute <attribute-name> <type> or {range} • the Data section lists the data records, starts with • @data • list of data instances • Comment: Any line starting with %

Breast Cancer data in ARFF % Breast Cancer data*: 286 instances (no-recurrence-events: 201, recurrence- events: 85) % Part 1: Definitions of attribute name, types and relations @relation breast-cancer @attribute age {'10-19','20-29','30-39','40-49','50-59','60-69','70-79','80-89','90-99'} @attribute menopause {'lt40','ge40','premeno'} @attribute tumor-size {'0-4','5-9','10-14','15-19','20-24','25-29','30-34','35-39','40-44','45-49','50-54','55-59'} @attribute inv-nodes {'0-2','3-5','6-8','9-11','12-14','15-17','18-20','21-23','24-26','27-29','30-32','33-35','36-39'} @attribute node-caps {'yes','no'} @attribute deg-malig {'1','2','3'} @attribute breast {'left','right'} @attribute breast-quad {'left_up','left_low','right_up','right_low','central'} @attribute irradiat {'yes','no'} @attribute Class {'no-recurrence-events','recurrence-events'} % Part 2: Data Section @data '40-49','premeno','15-19','0-2','yes','3','right','left_up','no','recurrence-events' '50-59','ge40','15-19','0-2','no','1','right','central','no','no-recurrence-events' '50-59','ge40','35-39','0-2','no','2','left','left_low','no','recurrence-events' …… % source: http://archive.ics.uci.edu/ml/datasets/Breast+Cancer % NOTE: not sure about those single quote marks.

Interpreting the Output:The Confusion Matrix The confusion matrix shows how many of each class value was classified in each classification category. The Confusion Matrix: a b <-- classified as 56 8 | a = no-recurrence-events 23 10 | b = recurrence-events • 56 no-recurrence events (a) were classified correctly as (a) • 8 no-recurrence events (a) were incorrectly classified as (b) • 23 recurrence events (b) were incorrectly classified as (a) • 10 recurrence events (b) were correctly classified as (b) • Items on the main diagonal are correct classifications

Interpreting the Output Text representation of a tree: J48 pruned tree ------------------ node-caps = yes | deg-malig = 1: recurrence-events (1.01/0.4) | deg-malig = 2: no-recurrence-events (26.2/8.0) | deg-malig = 3: recurrence-events (30.4/7.4) node-caps = no: no-recurrence-events (228.39/53.4) Number of Leaves : 4 Size of the tree : 6

WEKA Explorer • Click the Explorer on Weka GUI • On the Explorer window, Click "Open File" • To open a data file, e.g. Breast Cancer data: breast_cancer.arff • Or (if you don’t have this data set), the data folder provided by the WEKA package e.g. iris.arff or weather_nominal.arff

WEKA Explorer: Open Data File Open Breast Cancer data. Click an attribute, e.g. age, then its distribution will be displayed in a histogram.

WEKA Explorer: Classifiers • After loading a data file, click Classify Tab • Choose a classifier, Under Classifier • Click Choose Button • From drop-down menu, Click Trees Folder • Select J48 – a decision tree algorithm • Choose a test option • Select Percentage Split Radio Button • Use default ratio 66% for training and 34% for testing • Click Start Button to train and test the classifier. • The training and testing information will be displayed in classifier output window.

WEKA Explorer: Results 97 cases used in test. Correct: 66 (68%) Wrong: 31 (32%)

Result and Model Options Point to result list window, and right/option click mouse. Menu will display options available about the model.

Choose Visualize Tree

View Classifier Errors • Correctly predicted cases • Wrong cases

Save the Model and Results Right/option click on result. Choose Save model and Save result buffer to save the classifier and the results,

Summary Weka is open source data mining software that offers • GUI interfaces: Explorer, Experimenter, Knowledge Flow • Functions and Tools • Methods for classification: decision trees, rule learners, naive Bayes, etc. • Methods for regression/prediction: linear regression, model tree generators, etc. • Methods for clustering • Methods for feature selection • And More...

Data Mining CSCI 307, Spring 2019 Lecture 8

Data Mining CSCI 307, Spring 2019 Lecture 8

Presentation Transcript

Data Mining CSCI 307 Spring, 2019

Data Mining CSCI 307, Spring 2019 Lecture 13

Data Structures CSCI 132, Spring 2019 Lecture 21 Doubly Linked Lists

Lecture 8: Graph Data Mining

DATA MINING LECTURE 8

DATA MINING LECTURE 8

DATA MINING LECTURE 8

Data Structures CSCI 132, Spring 2014 Lecture 17 Backtracking

Data Structures CSCI 132, Spring 2019 Lecture 14 Review for Exam 1

Data Structures CSCI 132, Spring 2019 Lecture 18 Recursion and Look-Ahead