340 likes | 358 Views
Learn about two essential classification methods, decision trees & Naïve Bayes, for predictive analytics. Decision trees involve tree structures and predictive modeling, while Naïve Bayes relies on Bayes’ theorem to predict outcomes based on independent features.
E N D
Chap 4: ClassificationChapter Sections • Decision Trees • Naïve Bayes • Diagnostics of Classifiers • Additional Classification Models • Summary
Classification • Classification is widely used for prediction • Most classification methods are supervised • This chapter focuses on two fundamental classification methods • Decision trees • Naïve Bayes
Decision Trees • Tree structure specifies sequence of decisions • Given input X={x1, x2,…, xn}, predict output Y • Input attributes/features can be categorical or continuous • Node = tests a particular input variable • Root node, internal nodes, leaf nodes return class labels • Depth of node = minimum steps to reach node • Branch (connects two nodes) = specifies decision • Two varieties of decision trees • Classification trees: categorical output, often binary • Regression trees: numeric output
Decision TreesOverview of a Decision Tree • Example of a decision tree • Predicts whether customers will buy a product
Decision TreesOverview of a Decision Tree • Example: will bank client subscribe to term deposit?
Decision TreesThe General Algorithm • Construct a tree T from training set S • Requires a measure of attribute information • Simplistic method (data from previous Fig.) • Purity = probability of corresponding class • E.g., P(no)=1789/2000=89.45%, P(yes)=10.55% • Entropy methods • Entropymeasures the impurity of an attribute • Information gain measures purity of an attribute
Decision TreesThe General Algorithm • Entropy methods of attribute information • Hx = the entropy of X • Information gain of an attribute = base entropy – conditional entropy
Decision TreesThe General Algorithm • Construct a tree T from training set S • Choose root node = most informative attribute A • Partition S according to A’s values • Construct subtrees T1, T2… for the subsets of S recursively until one of following occurs • All leaf nodes satisfy minimum purity threshold • Tree cannot be further split with min purity threshold • Other stopping criterion satisfied – e.g., max depth
Decision TreesDecision Tree Algorithms • ID3 Algorithm T=training set, P=output variable, A=attribute
Decision TreesDecision Tree Algorithms • C4.5 Algorithm • Handles missing data • Handles both categorical and sontinuous variables • Uses bottom-up pruning to address overfitting • CART (Classification And Regression Trees) • Also handles continuous variables • Uses Gini diversity index as info measure
Decision TreesEvaluating a Decision Tree • Decision trees are greedy algorithms • Best option at each step, maybe not best overall • Addressed by ensemble methods: random forest • Model might overfit the data Blue = training set Red = test set Overcome overfitting: Stop growing tree early Grow full tree, then prune
Decision TreesEvaluating a Decision Tree • Decision trees -> rectangular decision regions
Decision TreesEvaluating a Decision Tree • Advantages of decision trees • Computationally inexpensive • Outputs are easy to interpret – sequence of tests • Show importance of each input variable • Decision trees handle • Both numerical and categorical attributes • Categorical attributes with many distinct values • Variables with nonlinear effect on outcome • Variable interactions
Decision TreesEvaluating a Decision Tree • Disadvantages of decision trees • Sensitive to small variations in the training data • Overfitting can occur because each split reduces training data for subsequent splits • Poor if dataset contains many irrelevant variables
Decision TreesDecision Trees in R # install packages rpart,rpart.plot # put this code into Rstudio source and execute lines via Ctrl/Enter library("rpart") library("rpart.plot") setwd("c:/data/rstudiofiles/") banktrain <- read.table("bank-sample.csv",header=TRUE,sep=",") ## drop a few columns to simplify the tree drops<-c("age", "balance", "day", "campaign", "pdays", "previous", "month") banktrain <- banktrain [,!(names(banktrain) %in% drops)] summary(banktrain) # Make a simple decision tree by only keeping the categorical variables fit <- rpart(subscribed ~ job + marital + education + default + housing + loan + contact + poutcome,method="class",data=banktrain,control=rpart.control(minsplit=1), parms=list(split='information')) summary(fit) # Plot the tree rpart.plot(fit, type=4, extra=2, clip.right.labs=FALSE, varlen=0, faclen=3)
Naïve Bayes • The naïve Bayes classifier • Based on Bayes’ theorem (or Bayes’ Law) • Assumes the features contribute independently • Features (variables) are generally categorical • Discretization of continuous variables is the process of converting continuous variables into categorical ones • Output is usually class label plus probability score • Log probability often used instead of probability
Naïve BayesBayes Theorem • Bayes’ Theorem where C = class, A = observed attributes • Typical medical example • Used because doctor’s frequently get this wrong
Naïve Bayes Classifier • Conditional independence assumption • And dropping common denominator, we get Find cj that maximizes P(cj|A)
Naïve Bayes Classifier • Example: client subscribes to term deposit? • The following record is from a bank client. Is this client likely to subscribe to the term deposit?
Naïve Bayes Classifier • Compute probabilities for this record
Naïve Bayes Classifier • Compute Naïve Bayes classifier outputs: yes/no • The client is assigned the label subscribed = yes • The scores are small, but the ratio is what counts • Using logarithms helps avoid numerical underflow
Smoothing • A smoothing technique assigns a small nonzero probability to rare events that are missing in the training data • E.g., Laplace smoothing assumes every output occurs once more than occurs in the dataset • Smoothing is essential – without it, a zero conditional probability results in P(cj|A)=0
Diagnostics • Naïve Bayes advantages • Handles missing values • Robust to irrelevant variables • Simple to implement • Computationally efficient • Handles high-dimensional data efficiently • Often competitive with other learning algorithms • Reasonably resistant to overfitting • Naïve Bayes disadvantages • Assumes variables are conditionally independent • Therefore, sensitive to double counting correlated variables • In its simplest form, used only for categorical variables
Naïve Bayes in R • This section explores two methods of using the naïve Bayes Classifier • Manually compute probabilities from scratch • Tedious with many R calculations • Use naïve Bayes function from e1071 package • Much easier – starts on page 222 • Example: subscribing to term deposit
Naïve Bayes in R • Get data and e1071 package > setwd("c:/data/rstudio/chapter07") > sample<-read.table("sample1.csv",header=TRUE,sep=",") > traindata<-as.data.frame(sample[1:14,]) > testdata<-as.data.frame(sample[15,]) > traindata #lists train data > testdata #lists test data, no Enrolls variable > install.packages("e1071", dep = TRUE) > library(e1071) #contains naïve Bayes function
Naïve Bayes in R • Perform modeling > model<-naiveBayes(Enrolls~Age+Income+JobSatisfaction+Desire,traindata) > model # generates model output > results<-predict(model,testdata) > Results # provides test prediction Using a Laplace parameter gives same result
Diagnostics of Classifiers • The book covered three classifiers • Logistic regression, decision trees, naïve Bayes • Tools to evaluate classifier performance • Confusion matrix
Diagnostics of Classifiers • Bank marketing example • Training set of 2000 records • Test set of 100 records, evaluated below
Diagnostics of Classifiers • Evaluation metrics
Diagnostics of Classifiers • Evaluation metrics on bank marketing 100 test set poor poor
Diagnostics of Classifiers • ROC curve: good for evaluating binary detection Bank marketing: 2000 training set + 100 test set > banktrain<-read.table("bank-sample.csv",header=TRUE,sep=",") > drops<-c("balance","day","campaign","pdays","previous","month") > banktrain<-banktrain[,!(names(banktrain) %in% drops)] > banktest<-read.table("bank-sample-test.csv",header=TRUE,sep=",") > banktest<-banktest[,!(names(banktest) %in% drops)] > nb_model<-naiveBayes(subscribed~.,data=banktrain) > nb_prediction<-predict(nb_model,banktest[,-ncol(banktest)],type='raw') > score<-nb_prediction[,c("yes")] > actual_class<-banktest$subscribed=='yes' > pred<-prediction(score,actual_class) # code problem
Diagnostics of Classifiers • ROC curve: good for evaluating binary detection • Bank marketing: 2000 training set + 100 test set
Additional Classification Methods • Ensemble methods that use multiple models • Bagging: bootstrap method that uses repeated sampling with replacement • Boosting: similar to bagging but iterative procedure • Random forest: uses ensemble of decision trees • These models usually have better performance than a single decision tree • Support Vector Machine (SVM) • Linear model using small number of support vectors
Summary • How to choose a suitable classifier among • Decision trees, naïve Bayes, & logistic regression