By Adam Carney

1st Annual USD Mathematics Project Day May 7, 2004, University of San Diego, San Diego, CA Determining the Offensive Value of a Major League Baseball Player Using Linear Regression Analysis and Bootstrap By Adam Carney

Outline • Problem Formulation • Previous Work • Bootstrap Method • Regression Results • Evaluation Tools • Player Rankings & Comparisons • Conclusions & Future Plans

Problem Formulation • Given: MLB Season Statistics • Objective: Find offensive value of each player • Limitations • One semester • Use season data rather than game data • Lack data for certain statistics such as GIDP, ROE, SF and IBB.

Problem Formulation (cont.) • How to Model Mathematically? • Use regression analysis on team data for runs scored per game (RPG) • Apply this model to individual players • Hidden Assumptions/Constraints • Zero constant term in regression model • Seven assumptions of ordinary least squares regression, including • No Serial Correlation • No Perfect Multicollinearity • Normally distributed error term

Problem Formulation (cont.) • Based on this we propose, • Expected Magnitudes and Signs

Pete Palmer’s Linear Weights Bill James’ Runs Created & Previous Work

Pete Palmer’s Linear Weights Bill James’ Runs Created & Previous Work • Computer Simulation • Probability Theory • Runs Above Average

Pete Palmer’s Linear Weights Bill James’ Runs Created & Previous Work • Simplified Version • Complex Version • Over 60 lines of calculations • Takes almost everything into account

Bootstrap Method • Resampling Technique • Used it to estimate bias and variance of a random sample of RPG data • Repeated with 99 and 999 resamples

Bootstrap Method • Estimate Bias and Variance R=99 R=999

Bootstrapping Residuals • Run OLS regression to find • Resample residuals for each y • Find • Run OLS regression on (x , y*) • Repeat many times • Create prediction interval using ‘s a) Resample out of sample errors b) Find c) Build prediction interval with and

Advantages of Bootstrap • No assumption on error terms other than independence • Estimates actual distribution of error terms allowing more accurate prediction intervals • Could also bootstrap cases rather than residuals • Does not assume constant variance • Simulated samples have different designs

Regression Results • Used SPSS • Ran separate regression analysis for each era without a constant term • Looked at several possible models for each era • Used best model to apply to individual players

Regression Results

Evaluation Tools • Ran Chow Test to see if two sets of regression coefficients are equivalent df = Degrees of Freedom Prob. Value = Probability the two (or three) data sets have the same coefficients

Evaluation Tools (cont.) • Error Analysis (used last 3 years of each era) • Calculated Mean Absolute Deviation and Mean Absolute Percentage Error

Evaluation Tools (cont.) • 90% Prediction Intervals from Bootstrap • Calculated for each out of sample data point using matrices • gave an N by 1000 matrix with each row containing for a different data point

Evaluation Tools (cont.) • Found confidence interval for each data point • Counted how many correctly predicted

Linear Run Values (LRV) • Applied regression results to individuals’ season statistics • Used model that correlated with the era of that season • Computed adjusted LRV by dividing by the Batter Park Factor • Adjusted each stint separately if player played for more than one team in a season

Top LRV Rankings - Season

Top LRV Rankings - Career

Conclusions • Split data into five different eras • Ran regression analysis with and without bootstrap • Applied model to individual players • Compared results to those of Pete Palmer’s Linear Weights and Bill James’ Runs Created

Future Plans • Use game-by-game data • Find a more complete data set • Separate American & National Leagues • Forecast future LRV • Evaluate LRV vs. Salary • Find Pitchers’ and Fielders’ LRV

Leftover Statistics • Estimated time spent on project = 400 hours • # of weeks spent on project = 12 • # of hours per week = 33.33 • Predicted # of hours per week = 20 • Prediction error = (33.33-20)/20 = 66.7% • # of data points after bootstrapping = 1,700,000 • Size of data files after completion = 2,199,141,179 bytes (2.04 GB)

By Adam Carney

By Adam Carney

Presentation Transcript

Tanzania by Adam

By Adam Koltunski

By Adam Stowell

By: Adam Arnold

By Adam Anderson

By: Nick Becton, Ivy Cox, and Riley Carney

Painted Lady by Adam

By Adam Nolan

By Adam Percey

Roger Arliner Young By Adam

By: Adam Usher

By Adam Courson

Cobra by Adam L

William Carney Letter

Professor Stuart Carney

Presented By: Mike Carney

By: Adam Smith

Game Evaluation by Adam Woitulewicz

Carney Garage Door Repair

By Adam

Tanzania by Adam

Presented by Dave Carney (Director of Carney Consultancy Ltd)