10 likes | 105 Views
Senior Project – Computer Science - 2008 Machine Learning in Football Andrew Finley Advisor – Prof. Striegnitz. Classification using Decision Trees:
E N D
Senior Project – Computer Science - 2008Machine Learning in FootballAndrew FinleyAdvisor – Prof. Striegnitz • Classification using Decision Trees: • The idea behind this project is to use classification algorithms to train a program to predict NFL stats when given collegiate stats. Classification is the process of training a program on a set of known instances, to predict unknown ones. I am using a Decision Tree algorithm to train the program. A decision tree algorithm: • Creates a graph (tree) from the training data. • The leaves are the classes, and branches are attribute values • Goal is to make the smallest tree possible that covers all instances • Uses the tree to make a set of classification rules. Research Question: Every year there are players who move from collegiate football to professional football with high expectations and never meet them. Likewise, there are players with low expectations who exceed them. This leads me to question, is it possible to accurately predict the success of NFL players based on their collegiate performance? A player is generally considered successful if he is starting a majority of his games by his third season. The goal of this project is to build a program that will predict a player’s professional statistics, given their collegiate statistics. For the sake of time, I am only looking at quarterbacks and running backs. • Data: • Step 1: Gather data by parsing it off websites (NFL.com, NCAA.org) with Python scripts, and through Collegio Football (database program). • Step 2: Use more Python scripts to combine data into two large .csv files for quarterbacks and running backs • Step 3: Fix any left over formatting errors, and fill in any missing statistics possible. • Step 4: Input into Weka (ML software), and predict desired statistics • Step 5: Evaluate accuracy using cross validation - Sample input for running back data, blue are inputs, red are possible outputs • Next Step: • Continue with different feature selections to improve accuracy to beat baseline • Preliminary Results: • Difficulty building trees with large sets of training data, better trees made when attributes are selected by hand. • Baseline for accuracy is 68%, this is given if all predictions for “starting third season” are set to false and no tree is constructed • Accuracy of the program varies significantly with different feature sets, feature selection is very important