140 likes | 401 Views
Mining Baseball Statistics. Data Mining – CSE881. Paul Cornwell Kajal Miyan Mojtaba Solgi Project URL: http://kmp-cse881.appspot.com/. Overview of Baseball. Baseball is a team sport There are two major leagues: AL (American), NL (National)
E N D
Mining Baseball Statistics Data Mining – CSE881 Paul Cornwell Kajal Miyan Mojtaba Solgi Project URL: http://kmp-cse881.appspot.com/
Overview of Baseball • Baseball is a team sport • There are two major leagues: AL (American), NL (National) • Many statistics characterizing player performance are published yearly • Each league names one player MVP (Most Valuable Player) each year according to a vote • People place bets on who will be MVP 2
Overview • Application: (motivation) • Can we predict who will be named MVP? • Learn how to do data mining • Learn about baseball • Impress sabermetricians • Baseball: it’s not diseases, crime, or pollution • Baseball statistics • Main task: predict MVPs for a given year • Use SVM to rank players 3
playerID yearID stint teamID lgID Gbat AB R H 2B 3B HR RBI SB SO aasedo01 1985 1 BAL AL 54 0 0 0 0 0 0 0 0 0 abregjo01 1985 1 CHN NL 6 9 0 0 0 0 0 1 0 2 ackerji01 1985 1 TOR AL 61 0 0 0 0 0 0 0 0 0 adamsri02 1985 1 SFN NL 54 121 12 23 3 1 2 10 1 23 agostju01 1985 1 CHA AL 54 0 0 0 0 0 0 0 0 0 aguaylu01 1985 1 PHI NL 91 165 27 46 7 3 6 21 1 26 aguilri01 1985 1 NYN NL 22 36 1 10 2 0 0 2 0 5 Overview of Data and Mining • Data: 5 CSV files (Batting, Fielding, Master, Awards, Salaries) • Data Mining: • Ranking (similar to classification) • Anomaly detection (maybe) 4
Methodology - Preprocessing • Initial Data: ~90,000 rows in Batting table, 1871-2007 • One row: one player/year/stint/team • Cut to 1985-2007, ~28,000 rows, b/c Salary begin, rule changes • Perl script to merge tables by playerID/yearID/stint • BattingFieldingAwards(MVP)SalariesMaster = 48 columns • ~14 hours, but I got to relearn Perl! • Discovered: infeasible to use WEKA, need to use SVM-Light • Reformatted from CSV to space-delimited SVM-Light format • replace every “value” with “attribute:value” • replace commas, spaces • deleted 131 w/out fielding record (3-max: 26, 21, 16 at-bats) • create (binary) rank value based on MVP status • replace all MM/DD/YYYY with YYYY • insert “qid” column according to year/league (46 qids) • ... 5
Methodology – Data Mining • Classification not apt to get good results, hence ranking with • SVM-Light (Cornell University) • Training generates a model which can rank input • Training phase Leave one (year) out • Testing Rank the players for that year • Postprocessing • SVM-Light returns only ranks of the players as integers • match ranks with corresponding players • Reformat data for visualization • Ranked the data for each attribute • Anomaly detection (in progress) • KNN on 4 attributes (Gbat, R, HR, RBI) for players in >= 10 games • Compute z-scores for each attribute/year • Rank players by distance from nearest neighbor • Compare ranks in various attributes for detecting anomalies 6
Methodology - Visualization • Bar charts of top 20 ranked players for various attributes • Python • Google App Engine • Google Charts tool • U.S. map of player birthState density 7
Team Roles • Roles of team members • Planning - Everyone • Preprocessing – Paul Cornwell • Data Mining – Kajal Miyan • Visualization – Mojtaba Solgi 8
Related Work • No apparent academic work on predicting MLB MVPs • PECOTA • Baseball Prospectus • www.baseballprospectus.com/pecota/ • Baseball “forecasting” • Makes statistical predictions about players • No MVP prediction evident • subscription service • Books are available with baseball forecasts • apparently for one year only 9
Experimental Setup • Raw data downloaded from http://baseball1.com/content/view/58/82/ • Preprocessing done using Perl, Nano, Excel, OOo, TextPad • Preprocessing yields a table with ~28K rows and 45 columns • Experiments were conducted on a 2 GHz P4 machine running Kubuntu 8.04 with 1GB RAM • Data Mining and postprocessing with SVM-Light, Visual C#, Matlab • Visualization done using Python, Google App 10
Experimental Evaluation • Preliminary results • SVM-Light trained on 1985-2006 data • tested on 2007 • ranked actual MVPs #1 and #11 (out of 1242 players) (2nd NL, #2) • (there is one MVP for each league each year: AL, NL) • 2006: ranks 7, 16 (1371 players) • 2005: ranks 1, 4 (1322 players) • 2004: ranks 1, 3 (1342 players) • 2003: ranks 3, 32 (1341 players) • 2002: ranks 1, 11 (1316 players) • Final evaluation (pending) • Leave-one-out 11
Visualization Demo • http://kmp-cse881.appspot.com/ 12
Conclusions • MVP ranking was surprisingly successful • Early results suggest that it is feasible to predict MVPs with some accuracy • Lessons learned • Data mining is hard work • Baseball statistics are actually sort of interesting • Future work • Leave-one-out validation • Incorporate team statistics in player evaluations (expert advice) 13