280 likes | 441 Views
Predicting the winner of C.Y. award. 指導教授:黃三益博士 組員: 尹川 陳隆賢 陳偉聖. Introduction. Baseball sport in Taiwan CPBL (Chinese Professional Baseball League) MLB (Major League Baseball) Baseball sport in USA Cy Young Award since 1956
E N D
Predicting the winner of C.Y. award 指導教授:黃三益博士 組員: 尹川 陳隆賢 陳偉聖
Introduction • Baseball sport in Taiwan • CPBL (Chinese Professional Baseball League) • MLB (Major League Baseball) • Baseball sport in USA • Cy Young Award since 1956 • Baseball Writers Association of America • Weighted scores • Each league has one winner per year.
Measurements • There are no definite rules be used to judge. • Nevertheless, many measurements could be used to judge whether a pitcher is good or not. • Wins • ERA • WHIP • G/F etc.
Aim of the study • To analysis the historical statistics of pitchers. • Building a predictive model. • To predict the Cy Young Award winner of the year in the future.
Data mining procedure • Ten data mining methodology steps
Step 1:Translate the Problem • Directed data mining problem • Target variable: Cy Young Award • Classification • Decision tree • Purposes • Gambling game • Predictive activities
Step 2:Select Appropriate Data • Just MLB statistics data (1871 ~ 2006) • Cy Young Award: 1956 ~ 2006 • total 21456 records • List of Cy Young Award winners • “Time” factor • 1999 as the dividing year. • Because of the emerging items. • Variables: to remove the items that are not representative of a pitcher.
Step 3:Get to know the data • The materials that we used all come from MLB official site • These data have already been disclosed for a lot of years • The quality of data is very good • some attributes has value since 1999
Step 4:Create a model set • We divide the data into training data and testing data • We do not create a balanced sample • The record of MLB is not the seasonal materials • we will pick the materials since 1999
Step 5:Fix problems with the data • These data are taken from MLB official side • No missing values • single source
Step 6:Transform data to bring information to the surface • There are no combinations of attributes • We delete some attributes • We add a attribute-Year • We add a attribute (CyYoungAward_Winner) for classification
Step 7:Build Models • Tools Used • Weka Crash Problem • Blank Attributes • Build Model • Handling Blank Attributes
Weka Crash Problem • Raw data • 21456 data instances • 42 attributes • Weka crashed during model construction • Give Weka more memory
Build Model • MLB 1956~2006 • with blank attributes • ADTree • MLB 1956~2006 • without blank attributes • ADTree • MLB 1999~2006 • ADTree
Step 8:Assess Models(1/2) • Not good enough for gambling
Step 8:Assess Models(2/2) • Some attributes are more important
Step 9:Deploy Models • To implement a computer program with the built model. • To predict the Cy Young Award winner more easily.
Step 10:Assess Results • To compare the predictive and the final Cy Young Award winner directly. • Not “business” but “interest”. • Assessment from the judgment of the person.
Conclusions • We have used the classification technology to set up the model of predicting • We find the accuracy of the built model is not high • Some factors that we are not to consider • It can not use in the place with essential benefits • Just for fun