Mining Baseball Statistics

Mining Baseball Statistics Data Mining – CSE881 Paul Cornwell Kajal Miyan Mojtaba Solgi Project URL: http://kmp-cse881.appspot.com/

Overview of Baseball • Baseball is a team sport • There are two major leagues: AL (American), NL (National) • Many statistics characterizing player performance are published yearly • Each league names one player MVP (Most Valuable Player) each year according to a vote • People place bets on who will be MVP 2

Overview • Application: (motivation) • Can we predict who will be named MVP? • Learn how to do data mining • Learn about baseball • Impress sabermetricians • Baseball: it’s not diseases, crime, or pollution • Baseball statistics • Main task: predict MVPs for a given year • Use SVM to rank players 3

playerID yearID stint teamID lgID Gbat AB R H 2B 3B HR RBI SB SO aasedo01 1985 1 BAL AL 54 0 0 0 0 0 0 0 0 0 abregjo01 1985 1 CHN NL 6 9 0 0 0 0 0 1 0 2 ackerji01 1985 1 TOR AL 61 0 0 0 0 0 0 0 0 0 adamsri02 1985 1 SFN NL 54 121 12 23 3 1 2 10 1 23 agostju01 1985 1 CHA AL 54 0 0 0 0 0 0 0 0 0 aguaylu01 1985 1 PHI NL 91 165 27 46 7 3 6 21 1 26 aguilri01 1985 1 NYN NL 22 36 1 10 2 0 0 2 0 5 Overview of Data and Mining • Data: 5 CSV files (Batting, Fielding, Master, Awards, Salaries)‏ • Data Mining: • Ranking (similar to classification)‏ • Anomaly detection (maybe)‏ 4

Methodology - Preprocessing • Initial Data: ~90,000 rows in Batting table, 1871-2007 • One row: one player/year/stint/team • Cut to 1985-2007, ~28,000 rows, b/c Salary begin, rule changes • Perl script to merge tables by playerID/yearID/stint • BattingFieldingAwards(MVP)SalariesMaster = 48 columns • ~14 hours, but I got to relearn Perl! • Discovered: infeasible to use WEKA, need to use SVM-Light • Reformatted from CSV to space-delimited SVM-Light format • replace every “value” with “attribute:value” • replace commas, spaces • deleted 131 w/out fielding record (3-max: 26, 21, 16 at-bats)‏ • create (binary) rank value based on MVP status • replace all MM/DD/YYYY with YYYY • insert “qid” column according to year/league (46 qids)‏ • ... 5

Methodology – Data Mining • Classification not apt to get good results, hence ranking with‏ • SVM-Light (Cornell University)‏ • Training generates a model which can rank input • Training phase Leave one (year) out • Testing Rank the players for that year • Postprocessing • SVM-Light returns only ranks of the players as integers • match ranks with corresponding players • Reformat data for visualization • Ranked the data for each attribute • Anomaly detection (in progress) • KNN on 4 attributes (Gbat, R, HR, RBI)‏ for players in >= 10 games • Compute z-scores for each attribute/year • Rank players by distance from nearest neighbor • Compare ranks in various attributes for detecting anomalies 6

Methodology - Visualization • Bar charts of top 20 ranked players for various attributes • Python • Google App Engine • Google Charts tool • U.S. map of player birthState density 7

Team Roles • Roles of team members • Planning - Everyone • Preprocessing – Paul Cornwell • Data Mining – Kajal Miyan • Visualization – Mojtaba Solgi 8

Related Work • No apparent academic work on predicting MLB MVPs • PECOTA • Baseball Prospectus • www.baseballprospectus.com/pecota/ • Baseball “forecasting” • Makes statistical predictions about players • No MVP prediction evident • subscription service • Books are available with baseball forecasts • apparently for one year only 9

Experimental Setup • Raw data downloaded from http://baseball1.com/content/view/58/82/ • Preprocessing done using Perl, Nano, Excel, OOo, TextPad • Preprocessing yields a table with ~28K rows and 45 columns • Experiments were conducted on a 2 GHz P4 machine running Kubuntu 8.04 with 1GB RAM • Data Mining and postprocessing with SVM-Light, Visual C#, Matlab • Visualization done using Python, Google App 10

Experimental Evaluation • Preliminary results • SVM-Light trained on 1985-2006 data • tested on 2007 • ranked actual MVPs #1 and #11 (out of 1242 players) (2nd NL, #2)‏ • (there is one MVP for each league each year: AL, NL)‏ • 2006: ranks 7, 16 (1371 players) • 2005: ranks 1, 4 (1322 players) • 2004: ranks 1, 3 (1342 players) • 2003: ranks 3, 32 (1341 players) • 2002: ranks 1, 11 (1316 players) • Final evaluation (pending)‏ • Leave-one-out 11

Visualization Demo • http://kmp-cse881.appspot.com/ 12

Conclusions • MVP ranking was surprisingly successful • Early results suggest that it is feasible to predict MVPs with some accuracy • Lessons learned • Data mining is hard work • Baseball statistics are actually sort of interesting • Future work • Leave-one-out validation • Incorporate team statistics in player evaluations (expert advice)‏ 13

Mining Baseball Statistics

Mining Baseball Statistics

Presentation Transcript

A Baseball Statistics Class

A Baseball Statistics Class

baseball

Baseball

Baseball

Baseball

BASEBALL

Baseball

Baseball

Probability and Statistics for Data Mining

Baseball

The Statistics of Baseball and Politics

Baseball Statistics: Just for Fun!

Statistics and Winning Baseball

Baseball

Baseball

Baseball: The Game of Statistics

Baseball

Baseball

Baseball