150 likes | 169 Views
Applied Data Mining Basic Decision Trees in R. Myles Gartland Rockhurst University. To GUI or Not to GUI. CLI IDE GUI. Classification vs. Prediction. Classification predicts categorical class labels (discrete or nominal)
E N D
Applied Data MiningBasic Decision Trees in R Myles Gartland Rockhurst University Data Mining: Concepts and Techniques
To GUI or Not to GUI • CLI • IDE • GUI Data Mining: Concepts and Techniques
Classification vs. Prediction • Classification • predicts categorical class labels (discrete or nominal) • classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data • Prediction • models continuous-valued functions, i.e., predicts unknown or missing values • Typical applications • Credit approval • Target marketing • Medical diagnosis • Fraud detection Data Mining: Concepts and Techniques
Classification—A Two-Step Process • Model construction: describing a set of predetermined classes • Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute • The set of tuples used for model construction is training set • The model is represented as classification rules, decision trees, or mathematical formulae • Model usage: for classifying future or unknown objects • Estimate accuracy of the model • The known label of test sample is compared with the classified result from the model • Accuracy rate is the percentage of test set samples that are correctly classified by the model • Test set is independent of training set, otherwise over-fitting will occur • If the accuracy is acceptable, use the model to classify data tuples whose class labels are not known Data Mining: Concepts and Techniques
Supervised vs. Unsupervised Learning • Supervised learning (classification) • Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations • New data is classified based on the training set • Unsupervised learning(clustering) • The class labels of training data is unknown • Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data Data Mining: Concepts and Techniques
Training Data Classifier (Model) Process (1): Model Construction Classification Algorithms IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’ Data Mining: Concepts and Techniques
Classifier Testing Data Unseen Data Process (2): Using the Model in Prediction (Jeff, Professor, 4) Tenured? Data Mining: Concepts and Techniques
Decision Tree with rpart • Recursive partitioning and regression trees. • Non-trademarked version of CART • Rules based on variables' values are selected to get the best split to differentiate observations based on the dependent variable • Once a rule is selected and splits a node into two, the same process is applied to each "child" node (i.e. it is a recursive procedure) Data Mining: Concepts and Techniques
Splitting stops when CART detects no further gain can be made, or some pre-set stopping rules are met. (Alternatively, the data are split as much as possible and then the tree is later pruned.) • Each branch of the tree ends in a terminal node. Each observation falls into one and exactly one terminal node, and each terminal node is uniquely defined by a set of rules. Data Mining: Concepts and Techniques
CP complexity parameter. • Any split that does not decrease the overall lack of fit by a factor of cp is not attempted. For instance, with anova splitting, this means that the overall R-squared must increase by cp at each step. The main role of this parameter is to save computing time by pruning off splits that are obviously not worthwhile. Essentially ,the user informs the program that any split which does not improve the fit by cp will likely be pruned off by cross-validation, and that hence the program need not pursue it. Data Mining: Concepts and Techniques
What is a Decision Tree Data Mining: Concepts and Techniques
The process • Start with the business question: “what are the characteristics of people who are likely to purchase a personal savings account” • Build the model • Create training and validation samples • Build a functional formula: SA ~ EV • Build the tree rpart(formula, data, control) • Test for accuracy • Build back into business rules Data Mining: Concepts and Techniques
Algorithm for Decision Tree Induction • Basic algorithm (a greedy algorithm) • Tree is constructed in a top-down recursive divide-and-conquer manner • At start, all the training examples are at the root • Attributes are categorical (if continuous-valued, they are discretized in advance) • Examples are partitioned recursively based on selected attributes • Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain) • Conditions for stopping partitioning • All samples for a given node belong to the same class • There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf • There are no samples left Data Mining: Concepts and Techniques
Random Forest • Random forests are an ensemble learning method for classification (and regression) that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes output by individual trees. Data Mining: Concepts and Techniques
Random Forest Limits • No missing data • Predictors cannot have more than 32 levels • rf.model <- randomForest(DV ~ ., data=, importance=TRUE, proximity=TRUE) Data Mining: Concepts and Techniques