220 likes | 413 Views
Baseball Statistics. By Krishna Hajari Faraz Hyder William Walker. Objective. Our goal is to find out if, over the past 10 years, there is a consistent factor that affects the winning percentage of the 30 teams in the Major League Baseball. Explanatory Variables. Team Batting Average
E N D
Baseball Statistics By Krishna Hajari Faraz Hyder William Walker
Objective • Our goal is to find out if, over the past 10 years, there is a consistent factor that affects the winning percentage of the 30 teams in the Major League Baseball.
Explanatory Variables • Team Batting Average • Baseball Stadium Dimensions • Team Payroll • Average Game Attendance • ERA (earned run average)
Explanation of Variables • Team Batting Average This is the statistic used to evaluate the batter’s performance. Hits/Official At Bats • Stadium Dimensions Each Stadium has a different field size, so we will be testing the distance from home plate to the left, center, and right wall to see if it has an impact on a team’s performance.
Explanation of Variables • Team Payroll Each teams’ payroll in the MLB is different. In 2004, the highest paying team, Yankees, had a payroll of $184 million more then 6 times as much as the lowest paying team, the Devil Rays, at $27.5 million. We propose that higher paying teams perform better. The average payroll for the 2004 season was approximately $69 million, with a standard deviation of $33 million. • Average Game Attendance The average game attendance for 2004 was approximately 30.3k people with a standard deviation of 8.9k.
Explanation of Variables • Earned Run Average This is the statistic used to evaluate a pitcher’s performance. This is calculated using the following formula Number of runs allowed*9 Innings Pitched
Response Variable • Winning Percentage for Each Team • Games won / Games played
2004 Data • In this presentation we will take one year and will show you how we intend to analyze all of the data over the past 10 years, year by year.
Hypotheses • H0: None of these variables have an affect on winning percentage • Ha: At least one of the variables have an affect on winning percentage
Initial Summary This initial summary shows that the p value is very small therefore we must conclude that at least one of the variables is significant. This is the summary of the most general linear model with all five explanatory variables present.
ANOVA Table This ANOVA table shows that at least three variables are significant because their p value is less then 0.05
Variance Inflation Factor The VIF for all five of the explanatory variables is less than 10 therefore we will not exclude any of them from the regression
Correlation Matrix The correlation matrix is showing a somewhat high correlation between attendance and payroll, however this is to be expected since teams with higher attendance would generate more revenue, and therefore have higher payrolls.
All Possible Regressions According to all the goodness criteria, the best model seems to be the one with ERA, Payroll, and Batting Average.
The residuals seem to be distributed evenly above and below the 0 line. However the residuals seem to be more negative as the predicted winning percentage goes below .45. The Q-Q Plot indicates that the model is a not nearly a perfect fit, but is still close to a straight line.
Variance Test The variance test shows that most of the variances are very close to each other. This validates the assumption that the variances are approximately equal.
The only influential outlier, 19, is the New York Yankees. This is understandable given their astronomical payroll.
The Box-Cox Plot is indicating that a Box-Cox Transformation can be used with p = 2 to improve the model.
The Box-Cox Transformation has improved the model, and it can be seen in these graphs. The residuals appear to be much more normally distributed, and the line is much closer to 0 when the outlier is removed. The Q-Q plot is also closer to a straight line, indicating an improved model.
Conclusion •The Box-Cox Transformation improved the model •Unexpectedly, payroll was determined to play a comparatively minor role in the 2004 season. It also does not appear in the stepwise regression models for 5 of the past 10 years. •The two explanatory variables that were consistent factors over the past 10 years were ERA and Batting Average