370 likes | 389 Views
Dive into supervised learning techniques like multiple regression, penalized regression, tree-based models, and more. Explore cross-validation methods and regression model building strategies.
E N D
A Core Curriculum for Undergraduate Data Science Chris Malone Tisha Hooks Todd Iverson Brant Deppa Silas Bergen April Kerby Winona State University
Supervised LearningDSCI 425 Brant Deppa, Ph.D. Professor of Statistics & Data Science Winona State University bdeppa@winona.edu
Course Topics • Introduction to Supervised Learning Numeric Response • Review of multiple regression (predictors vs. terms) • Cross-Validation for numeric response (discussed throughout) • Automatic term selectors (ACE/AVAS/MARS) • Projection pursuit regression (PPR) ( Neural networks) • Penalized/regularized regression (ridge, LASSO, ElasticNet) • Dimension reduction (PCR and PLS regression) • Tree-based models (CART, bagging, random forests, boosting, treed regression) • Nearest neighbor regression
Course Topics (cont’d) Categorical/Nominal/Ordinal Response • Introduction to classification problems • Nearest neighbor classification • Naïve Baye’s classification • Tree-based models for classification • Discriminant analysis • Neural networks • Support vector machines (SVM) • Multiple logistic regression • Blending/stacking models (in progress)
Supervised Learning – General Problem or Cross-validation (CV) methods are generally used for this purpose.
K-Fold Cross-Validation Data is divided into roughly equal sized subgroups. Each subgroup acts a validation set in turn. The average error across all validation sets is computed and used to choose between rival models.
Bootstrap Cross-Validation Observations not selected (i.e. out-of-bootstrap OOB) constitute the validation set. We can then calculate quality of prediction metrics for iteration. here the are the p-dimensional predictor vectors. where is a random selected observation from the original data drawn with replacement. This process is then repeated a large number of times (B = 500, 1000, 5000, etc.).
Monte Carlo Cross-Validation (MCCV) As each of the cross-validation strategies have an element of randomness to them, we can expect the results will vary from CV to the next. With MCCV we can conduct split-sample CV multiple times and aggregate the results from each to quantify predictive performance for a candidate model.
Multiple Regression Suppose we have a numeric response and a set predictors . The multiple regression model is given by, where term. Terms are functions of predictors.
Hockey Stick” function or These types of terms are used in fitting Multivariate Adaptive Regression Spline (MARS) models. Types of Terms Predictor terms • Terms can be the predictors themselves as long as the predictor is meaningfully numeric(i.e. count or a measurement). Polynomial terms • These terms are integer powers of numeric predictors, e.g. Transformation terms • Here is the Tukey familyfit here Dummy terms
Types of Terms Factor terms • Suppose the predictor is a nominal/ordinalvariable with levels (. Then we chose oneof the levels as the reference groupand create dummy terms for the remaining levels. Interaction terms Linear combination terms PCR/PLS use these terms as basic building blocks.Spline basis terms Nonparametric terms where the are estimated by smoothing an appropriate scatterplot
Types of Terms Trignometric terms The periodicities used can be based on trial and error, knowledge of the physical phenomenon being studied (sunspots in this case), or using tools like spectral analysis to identify important periodicities. Will not delve into harmonic analysis in this course, this is only presented to illustrate that by using appropriate terms we can develop models capable of fitting complex relationships between the response and the predictors. Rob Hyndman – great forecasting online text with supporting R library (fpp2)
Multivariate Adaptive Regression Splines (MARS) Hockey Stick Functions Interactions If we have factor terms those are handled in the usual way, dummy variables for all but one level of the factor. These dummy terms can be involved in interactions as well.
Multivariate Adaptive Regression Splines (MARS) (Friedman, 1991) The package earth contains functions for building and plotting MARS models. mod = earth(y~.,data=yourdata,degree= 1) Activity: Build a MARS model for the diamond price data.
Tree-Based Models: CART (Breiman, et al. 1991) For regression trees, Using assuming we are minimizing the RSS, the fitted value is the mean response in each terminal node region . .
Tree-based Models: CART The task of determining neighborhoods is solved by determining a split coordinate or variate, i.e. which variable to split on, and split point. A split coordinate and split point define the rectangles as The residual sum of squares (RSS) for a split determined by is The goal at any given stage is to find the pair such that is minimal or the overall RSS is maximally reduced.
Ensemble Tree-based Models Ensemble models combine multiple models by averaging their predictions. For trees the main most common approaches or methods for ensembling models are: • Bagging (Breiman, 1996) • Random or Bootstrap Forests (Breiman, 2001) • Boosting (Friedman, 1999)
Tree-based Models: Bagging Suppose we are interested in predicting a numeric response variable and For example, , might come from a MLR model or from an RPART model. Letting = , where the expectation is with respect to the distribution underlying the training sample (since, viewed as a random variable, is a function of training sample, which can be viewed as a high-dimensional random variable) and not (which is considered fixed).
Bagging If we could base our prediction on instead of , we would shrink the MSE(prediction) and improve the predictive performance of our model. How can we approximate ?
Bagging We will now use bagging to hopefully arrive at an even better model the price of a diamond. For simplicity we will first use smaller and simpler trees to to illustrate the idea of bagging. Below are four different trees fit to bootstrap samples drawn from the full diamonds dataset. For each tree fit I used cp =.005 and minsplit = 5. The bagged estimate of the diamond price is the mean of the predictions from the trees fit the bootstrap samples.
These trees do not vary much. Thus the benefit of averaging their predictions will not produce a reasonable estimate of .
Tree-based Models: Random Forests (Breiman, 2001) This is the key feature of the random forest that breaks the correlation between trees fit to the bootstrap samples.
Random Forests How to control the growth of your random forest model: ntree – number of trees to grow in your forest, like B or nbagg in bagging mtry – number of predictors to choose randomly for each split (default = /3 for regression problems and for classification problems.) nodesize – minimum size of the terminal nodes in terms of the number of observations contained in them, default is 1 for classification problems and 5 for regression problems. Larger values here speed of the fitting process because trees in the forest will not be as big.maxnodes – maximum number of terminal nodes a tree can have in the forest. Smaller values will speed up fitting.
Activity: Diamonds in the Forest Diamond.RF = randomForest(logPrice~., data=DiaNew.train, mtry=??, ntree=??,nodesize=??, maxnodes=??)
Tree-based Models: Gradient Boosting Key Features: Each tree in the sequence is trying to explain variation left over from the previous tree! Trees at each stage are simpler (less terminal nodes) More “knobs” to adjust
Training a Gradient Boosted Model Number of Layers (n.trees) interaction.depth = # of terminal nodes in each layer – 1 n.minobsinnode= minimum number of observations that can be in node and still be split. shrinkage = small, more run layers bag.fraction= fraction of cases used in fitting the next layer. train.fraction= allows for a training/validation split
Tree-based Models: Gradient Boosting Activity: Can you train a GBM that does better than our previous models?
Treed Regression Rather than split the our data into disjoint regions in the predictor/term space then using the mean the response in these regions as the basis of a prediction, treed regression fits a linear model in the “terminal nodes”. What? Show me please…
Treed Regression This approach seems like it might work well for these data as we have different qualities of diamonds based on color, clarity, cut, and we know that price is strongly associated with carat size. Thus a tree that breaks up diamonds in terms of quality first, then examines the relationship between price and carat size might work great?!?
Classification Problems Many of the methods we have considered can be used for classification problems as well, e.g. any of the tree-based methods can be used for classification. Others that are unique to classification include discriminant analysis, nearest neighbor classification, Naïve Baye’s classifiers, support vector machines (SVM), etc. (Script file with a few examples in Block 4 folder on Github)