120 likes | 357 Views
Decision Trees in R. Connecticut R Users Group Illya Mowerman March 26 th , 2013. Summary. Decision trees have many uses: exploratory data analysis, variable selection, modeling and more. In today’s discussion we will cover:
E N D
Decision Trees in R Connecticut R Users Group Illya Mowerman March 26th, 2013
Summary Decision trees have many uses: exploratory data analysis, variable selection, modeling and more. In today’s discussion we will cover: • What are decision trees. Decision trees have many uses, are extremely versatile, easy to interpret, and require little data preparation. • Decision tree packages in R. rpart (package used today), C50, Cubist • Enhancing tree outputs. One of the attractive features of trees is that they are easy to interpret. However, in the rpart package the output could use a little enhancing.
What are Trees • Some packages in R • Enhancing Tree Outputs • References
A decision tree is an algorithm the can have a continuous or categorical dependent (DV) and independent variables (IV).
There are many advantages to using trees1. • Simple to understand and interpret. People are able to understand decision tree models after a brief explanation. • Requires little data preparation. Other techniques often require data normalisation, dummy variables need to be created and blank values to be removed. • Able to handle both numerical and categorical data. • Uses a white box model. If a given situation is observable in a model the explanation for the condition is easily explained by booleanlogic • Possible to validate a model using statistical tests. That makes it possible to account for the reliability of the model. • Performs well with large data in a short time.
Some things to consider when coding the model… • Splits. Gini or information. • Type of DV (method). Classification (class), regression (anova), count (poison), survival (exp). • Minimum of observations for a split (minsplit). • Minimum if observations in a node (minbucket). • Cross validation (xval). Used more in model building rather than in exploration. • Complexity parameter (Cp). This value is used for pruning. A smaller tree is perhaps less detailed, but with less error.
What are Trees • Some packages in R • Enhancing Tree Outputs • References
R has many packages for similar/same endeavors. • rpart. Comes with R. • C50. • Cubists. • rpart.plot. Makes rpart plots much nicer.
What are Trees • Some packages in R • Enhancing Tree Outputs • References
An alternative to the rpart plots is the prp function in the rpart.plotpackage. • extras. Values 1~9 displays extra “stuff” • boxcol. Define colors in the leafs. • xflip. Rotate the tree 180o • nn. Add node numbers for easier interpretation
What are Trees • Some packages in R • Enhancing Tree Outputs • References
References • http://en.wikipedia.org/wiki/Decision_tree_learning • http://www.stanford.edu/class/stats315b/minitech.pdf • http://www.milbo.org/rpart-plot/prp.pdf