380 likes | 534 Views
1 st Annual USD Mathematics Project Day May 7, 2004, University of San Diego, San Diego, CA. Determining the Offensive Value of a Major League Baseball Player Using Linear Regression Analysis and Bootstrap. By Adam Carney. Outline. Problem Formulation Previous Work Bootstrap Method
E N D
1st Annual USD Mathematics Project Day May 7, 2004, University of San Diego, San Diego, CA Determining the Offensive Value of a Major League Baseball Player Using Linear Regression Analysis and Bootstrap By Adam Carney
Outline • Problem Formulation • Previous Work • Bootstrap Method • Regression Results • Evaluation Tools • Player Rankings & Comparisons • Conclusions & Future Plans
Outline • Problem Formulation • Previous Work • Bootstrap Method • Regression Results • Evaluation Tools • Player Rankings & Comparisons • Conclusions & Future Plans
Outline • Problem Formulation • Previous Work • Bootstrap Method • Regression Results • Evaluation Tools • Player Rankings & Comparisons • Conclusions & Future Plans
Outline • Problem Formulation • Previous Work • Bootstrap Method • Regression Results • Evaluation Tools • Player Rankings & Comparisons • Conclusions & Future Plans
Outline • Problem Formulation • Previous Work • Bootstrap Method • Regression Results • Evaluation Tools • Player Rankings & Comparisons • Conclusions & Future Plans
Outline • Problem Formulation • Previous Work • Bootstrap Method • Regression Results • Evaluation Tools • Player Rankings & Comparisons • Conclusions & Future Plans
Outline • Problem Formulation • Previous Work • Bootstrap Method • Regression Results • Evaluation Tools • Player Rankings & Comparisons • Conclusions & Future Plans
Problem Formulation • Given: MLB Season Statistics • Objective: Find offensive value of each player • Limitations • One semester • Use season data rather than game data • Lack data for certain statistics such as GIDP, ROE, SF and IBB.
Problem Formulation (cont.) • How to Model Mathematically? • Use regression analysis on team data for runs scored per game (RPG) • Apply this model to individual players • Hidden Assumptions/Constraints • Zero constant term in regression model • Seven assumptions of ordinary least squares regression, including • No Serial Correlation • No Perfect Multicollinearity • Normally distributed error term
Problem Formulation (cont.) • Based on this we propose, • Expected Magnitudes and Signs
Pete Palmer’s Linear Weights Bill James’ Runs Created & Previous Work
Pete Palmer’s Linear Weights Bill James’ Runs Created & Previous Work • Computer Simulation • Probability Theory • Runs Above Average
Pete Palmer’s Linear Weights Bill James’ Runs Created & Previous Work • Simplified Version • Complex Version • Over 60 lines of calculations • Takes almost everything into account
Bootstrap Method • Resampling Technique • Used it to estimate bias and variance of a random sample of RPG data • Repeated with 99 and 999 resamples
Bootstrap Method • Estimate Bias and Variance R=99 R=999
Bootstrapping Residuals • Run OLS regression to find • Resample residuals for each y • Find • Run OLS regression on (x , y*) • Repeat many times • Create prediction interval using ‘s a) Resample out of sample errors b) Find c) Build prediction interval with and
Advantages of Bootstrap • No assumption on error terms other than independence • Estimates actual distribution of error terms allowing more accurate prediction intervals • Could also bootstrap cases rather than residuals • Does not assume constant variance • Simulated samples have different designs
Regression Results • Used SPSS • Ran separate regression analysis for each era without a constant term • Looked at several possible models for each era • Used best model to apply to individual players
Evaluation Tools • Ran Chow Test to see if two sets of regression coefficients are equivalent df = Degrees of Freedom Prob. Value = Probability the two (or three) data sets have the same coefficients
Evaluation Tools (cont.) • Error Analysis (used last 3 years of each era) • Calculated Mean Absolute Deviation and Mean Absolute Percentage Error
Evaluation Tools (cont.) • 90% Prediction Intervals from Bootstrap • Calculated for each out of sample data point using matrices • gave an N by 1000 matrix with each row containing for a different data point
Evaluation Tools (cont.) • Found confidence interval for each data point • Counted how many correctly predicted
Linear Run Values (LRV) • Applied regression results to individuals’ season statistics • Used model that correlated with the era of that season • Computed adjusted LRV by dividing by the Batter Park Factor • Adjusted each stint separately if player played for more than one team in a season
Conclusions • Split data into five different eras • Ran regression analysis with and without bootstrap • Applied model to individual players • Compared results to those of Pete Palmer’s Linear Weights and Bill James’ Runs Created
Conclusions • Split data into five different eras • Ran regression analysis with and without bootstrap • Applied model to individual players • Compared results to those of Pete Palmer’s Linear Weights and Bill James’ Runs Created
Conclusions • Split data into five different eras • Ran regression analysis with and without bootstrap • Applied model to individual players • Compared results to those of Pete Palmer’s Linear Weights and Bill James’ Runs Created
Conclusions • Split data into five different eras • Ran regression analysis with and without bootstrap • Applied model to individual players • Compared results to those of Pete Palmer’s Linear Weights and Bill James’ Runs Created
Future Plans • Use game-by-game data • Find a more complete data set • Separate American & National Leagues • Forecast future LRV • Evaluate LRV vs. Salary • Find Pitchers’ and Fielders’ LRV
Future Plans • Use game-by-game data • Find a more complete data set • Separate American & National Leagues • Forecast future LRV • Evaluate LRV vs. Salary • Find Pitchers’ and Fielders’ LRV
Future Plans • Use game-by-game data • Find a more complete data set • Separate American & National Leagues • Forecast future LRV • Evaluate LRV vs. Salary • Find Pitchers’ and Fielders’ LRV
Future Plans • Use game-by-game data • Find a more complete data set • Separate American & National Leagues • Forecast future LRV • Evaluate LRV vs. Salary • Find Pitchers’ and Fielders’ LRV
Future Plans • Use game-by-game data • Find a more complete data set • Separate American & National Leagues • Forecast future LRV • Evaluate LRV vs. Salary • Find Pitchers’ and Fielders’ LRV
Future Plans • Use game-by-game data • Find a more complete data set • Separate American & National Leagues • Forecast future LRV • Evaluate LRV vs. Salary • Find Pitchers’ and Fielders’ LRV
Leftover Statistics • Estimated time spent on project = 400 hours • # of weeks spent on project = 12 • # of hours per week = 33.33 • Predicted # of hours per week = 20 • Prediction error = (33.33-20)/20 = 66.7% • # of data points after bootstrapping = 1,700,000 • Size of data files after completion = 2,199,141,179 bytes (2.04 GB)