90 likes | 246 Views
Data Mining – Algorithms: Linear Models. Chapter 4, Section 4.6. Numeric Attributes. Numeric prediction and/ or numeric attributes as predictors Linear regression is well established statistical technique Designed to predict numeric value based on numeric attributes
E N D
Data Mining – Algorithms: Linear Models Chapter 4, Section 4.6
Numeric Attributes • Numeric prediction and/ or numeric attributes as predictors • Linear regression is well established statistical technique • Designed to predict numeric value based on numeric attributes • Determines optimal set of coefficients for linear equation: • pred = w0 + w1a1 + w2a2 + … + wnan • Optimal means prediction errors squared is minimized • For data mining, this would be done on training data so that it can be tested on test data • I hope that a CSC major could read a statistics book and then write the code to do this • However, there is no need to do this, since this method is so available, unless you are seeking to create an improved version of it
Example • <Show Basketball Spreadsheet – Baskball sheet • NOTE – input values, weights, prediction vs actual • <Show testReg sheet – test on separate instances • NOTE – how it did – prediction vs actual – difference, correlation
Using Regression for Classification • Perform regression for each class • Set output to be predicted = 1 for training instances that belong to a class • Set output to be predicted = 0 for training instances that do NOT belong to the class • Do this for each class, and you will have an “membership function” equation for each class • On test, plug new instance into each equation, and highest value produced will be the prediction to make
Example • <Show discretized sheet • NOTE – prep of data – into low, medium, high • NOTE – Weights for 3 regressions, high, med, low • <Show Test sheet • NOTE – Calcs Hi, Med, Low • (doesn’t do that well, suspect that the data may not be from same source (NBA), and that the discretization was a bit of a problem (very few low)
More sophisticated • Do as many pairwise competitions as necessary • Training – two classes against each other: • temporarily toss training instances that are not one of the two • Set output = 1 for class to be predicted and –1 for other • Test – do all pairwise competitions, winner of each gets a vote • E.g. say – • Medium beats High • Medium beats Low • High beats Low • Medium wins 2-1-0 • Conservative approach would be to predict nothing if no prediction dominates
In Context • Has been used for decades for various applications (e.g. social science research) • Bias – only searches for linear equations – no squares, cubes etc • To work well, data must fit a linear model – e.g must be “linearly separable” – be able to divide with a line (in 2D, a plane in 3D, a hyperplane in multi-D) • To work well, attributes should not be highly correlated with each other • Depends on numeric attributes
Let’s Look at WEKA • Linear Regression with Basketball data • No Correctness measures • Correlations • Error • Discretize Points per minute • Try logistic regression – a categorical prediction approach