Moneyball in the Classroom Using Baseball to Teach Statistics Josh Tabor

Moneyball in the Classroom Using Baseball to Teach Statistics Josh Tabor Canyon del Oro High School joshtabor@hotmail.com Moneyball in the Classroom

Objectives • By the end of the session, participants will: • Obtain several classroom-tested examples that promote the real-world applications of mathematics and help students meet the Common Core State Standards • Understand that the goal of a model should be minimize the size of prediction errors • Understand the properties of least-squares regression lines and how to interpret the slope and intercept • Understand the concept of regression to the mean and what it reveals about future performances Moneyball in the Classroom

Move over Brad Pitt, here is the real star of Moneyball: Predicted winning percentage = Created by Bill James and called the “Pythagorean” expected winning percentage, this formula uses a team’s runs scored (RS) and runs allowed (RA) to predict their winning percentage. Does it work? Why did he use 2 for the exponent instead of some other value?? Moneyball in the Classroom

In 2012, the Oakland A’s scored 713 runs, allowed 614 runs, and won 94 games. According to the Pythagorean formula, a team with this many runs scored and runs allowed would be expected to win about 57.4% of their games. In a 162-game season, this is 0.574(162) = 92.99 expected wins. This means that Oakland won 94 – 92.99 = 1.01 more games than expected, based on their runs scored and allowed. Moneyball in the Classroom

The difference between an actual value and a predicted value is called a residual. residual = actual value – predicted value In the Common Core State Standards, our students are expected to “informally assess the fit of a function by plotting and analyzing residuals” (S-ID-6.b). Moneyball in the Classroom

Here is a partial table showing how the formula worked for other teams: Moneyball in the Classroom

So, why did Bill James use 2 for the exponent? Will another value for the exponent work better? Here is a partial table using 1 for the exponent. Does this model work better? Moneyball in the Classroom

Which model is better? In general, we prefer models that produce smaller residuals. To compare these two models, we can compare the sum of squared residuals (SSR). For an exponent of 2, SSR = (-5.2)2 + (0.6)2 + … = 411 For an exponent of 1, SSR = (2.6)2 + (6.8)2 + … = 1300 Moneyball in the Classroom

The best model is the one that produces the smallest sum of squared residuals (SSR). This is called the least-squares criterion. Here is a scatterplot showing different exponents from 1 to 3 along with their corresponding SSR. Which exponent looks best? Moneyball in the Classroom

Interestingly, there is a different “ideal” exponent for each sport. (Class activity alert!) For example, here is a scatterplot showing different exponents and SSR for NBA teams in 2009: Moneyball in the Classroom

Part 2: Modeling Runs Scored Now that we understand how to use runs scored and runs allowed to model predicted winning percentage, how can we model runs scored and runs allowed? Using team data from the 2012 season, we can look for variables that have a strong relationship with runs scored. Here is a scatterplot showing hits vs. runs scored for the 30 teams: Moneyball in the Classroom

Because the association appears linear, we should use a line to model the relationship between hits and runs scored. But, which line is best? Time for Fathom…. Moneyball in the Classroom

The “best” line is the one that makes the sum of squared residuals the least. Not surprisingly, it is called the least-squares regression line. Here is the scatterplot again, along with the least-squares regression line: predicted RS= -79 + 0.556(hits) Moneyball in the Classroom

CCSS: S-ID-7: Interpret the slope (rate of change) and the intercept (constant term) of a linear model in the context of the data. The slope of the least-squares regression line is 0.556. How do we interpret this value? What about the intercept? Slope: For each additional hit, the predicted number of runs increases by 0.556. Intercept: If a team had 0 hits for the season, the predicted number of runs scored is -79. Realistic? Why not?? Moneyball in the Classroom

Suppose that Oakland has a chance to improve at one position and can expect to have 40 more hits. How many wins is that worth, assuming the performances of other players stay the same? For each additional hit, we predict 0.556 more runs. So, 40 additional hits is worth 40(0.556) = 22.24 more runs. This means Oakland would score 735.24 runs instead of 713. Using the Pythagorean formula: 58.9% of 162 is 95.42 wins. This means 2.43 additional expected wins (95.42 – 92.99 = 2.43). Moneyball in the Classroom

Which variable does the best job of modeling runs scored? Here are some scatterplots: Moneyball in the Classroom

The best model is the one with the smallest sum of squared residuals (SSR). Here is a table showing the SSR when predicting runs scored using the following variables: Moneyball in the Classroom

Part 3: Modeling Runs Allowed Modeling runs allowed is much more difficult. However, sabermatricians have been making good progress in the last decade after a revolutionary discovery by Voros McCracken. He demonstrated that a pitcher has very little (if any) control over what happens to a ball once it is hit. BABIP (batting average on balls in play) is a measure of what happens during at-bats that don’t end in strikeouts, walks, or home runs. Voros showed that BABIP is essentially random from year to year. Moneyball in the Classroom

Here is a scatterplot showing the BABIP for pitchers in two consecutive years (2008 and 2009): Moneyball in the Classroom

Because the outcome of batted balls is basically random, McCracken suggested that the best way to model runs allowed is to use variables that pitchers do have control over. For example, strikeout rate, walk rate, and home run rate. Here is a scatterplot of strikeout rate in 2008 and 2009 for these same pitchers: Moneyball in the Classroom

Part 4: Regression to the Mean It’s difficult to make predictions, especially about the future. –Yogi Berra So far, we have been investigating relationships between variables within the same season. What teams really want to know is how to make predictions about what will happen next year. Before we do that, let’s flip some coins… Moneyball in the Classroom

Here is a scatterplot showing the outcomes of two sets of 10 coin flips, along with the line y = x. If we know a flipper did well the first time, what should we predict will happen the second time? What if a flipper did poorly the first time? Moneyball in the Classroom

Here again is the scatterplot of BABIP for two consecutive years, including the line y = x. If a pitcher had a bad (high) BABIP in 2008, what can we expect to happen the following year? Which players should a poor team like Oakland try to sign? Moneyball in the Classroom

Now, let’s look at hitters in two consecutive years. Here is a scatterplot showing batting average in 2008 and 2009, along with the line y = x. Do we see the same thing? Moneyball in the Classroom

Now, here is the same scatterplot with the least-squares regression line added as well. The line predicts that players who were above average in 2008 will be good, but not quite as good in 2009. Likewise, it predicts that players who were below average in 2008 will be bad, but not quite as bad in 2009. This is regression to the mean. Moneyball in the Classroom

What causes regression to the mean? In sports, performance = ability + random chance. A good performance is usually a combination of good ability and good luck. In future performances, the good luck is unlikely to continue, even if his ability is the same. This explains the SI Jinx and the Madden Curse. Moneyball in the Classroom

This also applies to student performance on tests, especially MC tests—a good performance one year is likely due to good ability and good luck. What is likely to happen next year? What about an intervention class for students with low scores the previous year?? Understanding regression to the mean is vital for making predictions about the future. Evaluations: Session #466 Moneyball in the Classroom

Moneyball in the Classroom Using Baseball to Teach Statistics Josh Tabor

Moneyball in the Classroom Using Baseball to Teach Statistics Josh Tabor

Presentation Transcript

Using Picture Books to Teach Literary Terms in the High School English Classroom

Using Technology to Teach

Mining Baseball Statistics

Using in the Classroom

Using Poetry to Teach Writing in the Primary Classroom

A Virtual Classroom: Using Online Tutorials to Teach Students

The Moneyball Approach

Using StatCrunch to Teach Statistics Using Resampling Techniques

MONEYBALL

Storytelling in the Statistics Classroom

Game On! Using Video Games to Teach STEM in the Classroom

Using “RIME” to Teach

The Moneyball Effect: Teaching AP Statistics through Sports

Using Project Based Learning to Teach Social Studies in the Fifth Grade Classroom

Using Low Budget Classroom Exercises to Teach Sampling Techniques

Using a Digital Library to Teach Statistics

Using in the Classroom

Baseball: The Game of Statistics

Moneyball

A New Look to Statistics in the Classroom

Virtual Moneyball: A Baseball League Simulator