350 likes | 626 Views
Random Forest 101. Using Random Forest as a Tool for Policy Analysis. Reuben Ternes November 2012. Overview. Part 1: Policy Analysis – Test Optional Part 2: The Weaknesses of Parametric Statistics Part 3: Data Mining: Random Forest as an Alternative Part 4: Real world example. Part 1.
E N D
Random Forest 101 Using Random Forest as a Tool for Policy Analysis Reuben Ternes November 2012
Overview Part 1: Policy Analysis – Test Optional Part 2: The Weaknesses of Parametric Statistics Part 3: Data Mining: Random Forest as an Alternative Part 4: Real world example
Part 1 Policy Analysis – Test Optional Reuben Ternes November 2012
Test Optional? • Lots of institutions are considering going ‘test-optional’ these days. Should yours? • How can we use IR data to figure out a reasonable policy recommendation? • Just as a caveat: OU is not considering changing our admissions rules, this is more of a theoretical exercise. • We’re already partially test-optional
Test-Optional Literature • 2003 (Geiser with Studley) suggested: • HSGPA was better than SAT in predicting FY GPAs. • 80,000+ sample size, University of Cali Data • Used regression, logistic regression, and some HLM to reach their conclusions. • Fairly rigorous methodology. • 2007 follow up study (Gesier & Satnelices) found same pattern with 4-year outcomes. • Very influential
Test Optional Literature Since • The literature on the topic is vast. • Most of it supports the notion that HSGPA is a better predictor than SAT/ACT. • Many find that ACT/SATs add predictive validity. • Some do not, or find that the addition is trivial. • Almost all of the literature uses a parametric regression (of some kind) to estimate SAT/ACT’s predictive validity.
Part 2 The Weaknesses of Parametric Statistics Reuben Ternes November 2012
What’s Wrong with Regression? • OLS Regression is a fantastic tool. • But, it’s failings as far as a predictive tool are well known: • Missing data is difficult to deal with • Categorical data is difficult to deal with • Interactions must be modeled by hand • Non-linearities must be modeled by hand • Poorly handles data sets with lots of variables • Overfitting is common • It is not a good tool to understand the predictive contribution of ACT scores.
Regression is a Parametric Technique • All parametric statistical techniques make certain assumptions about the data. • In Regression: • Normality • Heteroscedasticity • Linearity • multicollinearity • Among others…
Parametric Assumptions • In practice, these assumptions are often incorrect. • We still use parametric statistics because they are useful. • But they are not perfect estimators of the predictive contributions of different variables! • And, they don’t always make good predictions!
Regression: Categorical Data • Imagine that you have 1 categorical variable with 10 categories. • In regression, you have to code this as 10 dummy variables (0,1). • If you have 10 such variables, then you have 100 additional variables in your regression model. • This reduces your degrees of freedom! • Now, imagine that you have interaction terms with 10 other potential continuous variables. • That’s 1000 different variables!
Regression: Interaction & Non-Linearities • Now imagine that you have 10 continuous variables. • You should, at the very least, include quadratic and cubic versions of these variables in your model, just in case they are not linearly related. • Now you have 30 variables • Don’t forget your interaction terms!
Regression: Overfitting • But if you actually model all of this: • You’ve probably went to far. • Eventually, you’ll start modeling noise, not ‘real’ patterns. • If is difficult to figure out when you’ve overfitted your data when using regression. • What will happen? • Your test data will look good. Actual predictions will be low.
Regression: Missing Data • Must model missing data, or data is lost. • Common to add median or mean value for continuous data. • Or most common response for categorical data. • Or code them as missing. • If you don’t impute, then every case that has missing data, even if it is mostly accurate, won’t be used in the final analysis. • If the data isn’t missing at random, you could be in series trouble. • Often, you don’t know why data is missing.
Part 3 Data Mining: Random Forest as an Alternative Reuben Ternes November 2012
Recent History of Data Mining • Netflix Prize • Target • Yahoo • Amazon • Etc. • All are using prediction algorithms to match customers with products. • The prediction tools they are using are much more sophisticated than simple regression!
How Data Mining Can Help Inform Policy • There are other ways to understand predictive contributions • Data Mining/Machine Learning Algorithms • Have improved greatly over the past decade • Are now recognized to be much better predictors than many standard regression techniques • Random Forest, in particular, stands out.
Random Forest • Random Forest • Deals with missing data well • Robust to over-fitting • Relatively easy to use • Can handle hundreds of different variables • Categorical (i.e. non-numerical) data is OK • Makes no assumptions (non-parametric) • Overall good performance
How Does It Work? • It builds lots of (decision) trees • Randomly • (That’s why it’s called Random Forest)
How Random Forest Works: Overview • Step 1) Build tree from a random subset of predictor variables. • Size of tree = sqrt (classification outcome) or 1/3 (continuous outcome) of the number of predictor variables • Step 2) Use N random cases from the dataset, drawing with replacement • For each tree, approx. 1/3 of the dataset isn’t used • (Bootstrapping) • Step 3) Record the result of each unused case • After building the tree, ‘run’ the unused cases: record result. • Step 4) Repeat this process 500-1000 times • Probabilities are generated by the total proportion of yes votes. • Regression generated by average prediction.
The Random Part Is Important • You could build a giant decision tree with dozens of variables. • But it would be big. Too big. • Suffers from some of the same problems as standard regression techniques (it overfits, poorly models interactions effects, etc.) • Instead, Random Forest uses random elements to its advantage. • 1) It builds many smaller trees (500-1000) using a random sample of the predictors. • 2) It samples N cases with replacement.
Why Make Many Random Trees? • The trees are smaller. • Smaller trees are easier to deal with. • That means you can make a lot of them • Aggregating lots of small trees do a better job of capturing interaction effects without overfitting • Ditto with non-linearities • (The split on any continuous predictor will be different for every tree)
Why Sample with Replacement • It keeps N high, but creates a hold out set. • This hold out set is used to create an (unbiased) estimate of the error rate. • This means you don’t need training data! • (Essentially, every tree is both a test data set and training data set rolled into one). • There are known issues with sampling with replacement. • Does not affect raw predictions. • Does affect variable importance data
Pause for Questions Let me pause for questions before continuing.
Random Forest: Results • Random Forest results are not like regression • Variable important list • Based on node purity measures (Gini coefficient) • Numbers are pretty much un-interpretable • No explanation of how variables interact with outcome • No established method to create p-values • You really only get: • Prediction results • Vague sense of how important each variable is • Either an error rate (categorical outcomes) or a percent of total variance explained (continuous outcomes).
Testing Policy with Random Forest • If you can’t get p-vales, how can you do policy analysis with Random Forest? • What you can do, is run various sets of predictions, and look at the accuracy of those predictions. • Systematically exclude the variables that you are interested in examining.
Part 4 Real World Example of Random Forest Reuben Ternes November 2012
The Question • Should your institution go test-optional? • Another way to ask this question is: • How much do admissions test tell us about future student outcomes? • We will test just first year GPA. • But you could test anything (retention, graduation, etc.)
Admissions Models • I consider three models • 1) Saturated • An extreme (unrealistic) amount of data on incoming students. • Obtained late during the admissions cycle. • More information than a human could process to make a decision. • About 50 variables we collect during the admissions cycle. • 2) Just HS GPA and ACT scores. • 3) HS GPA, ACT scores, and one measure of SES. • Obtained by aggregating the % of Pell students by zip code for OU over the last 10 years. • I test this model because one of the common complaints against standardized tests is that they only measure SES.
Results: Saturated Model • Saturated model • Averaged over 5 trails (500 trees per trial) • All – 29.9% of total variance explained • Exclude ACT scores – 29.7% of total variance explained. • Conclusion: ACT scores do not add much information to the total model. • But it probably does add something. • But this is an unrealistic model for admissions decisions, so it doesn’t answer our question.
Results: HS GPA + ACT Scores Only Model • Saturated model • Averaged over 5 trails (500 trees per trial) • HS GPA + ACT Scores – 21.2% of total variance explained • Exclude ACT scores – 20.2% of total variance explained. • Conclusion: ACT scores improve predictions by a noticeable, but still small amount at OU.
Results: HS GPA, ACT Scores, SES Model • HS GPA, ACT, + SES model • Averaged over 5 trails (500 trees per trial) • HS GPA, ACT, SES – 25.0% of total variance explained • Exclude ACT scores – 21.6% of total variance explained. • Conclusion 1: ACT scores improve predictions noticeably. • Conclusion 2: There are some very important and non-trivial interaction effects going on with ACT scores and SES. • If our goal is to develop predictive decision rules that correlate with academic success, we are leaving a lot of useful information out by not considering SES data.