Estimation As A Data Mining Task

Estimation As A Data Mining Task Theoretical Understanding Vinh Ngo & Mike Ellis

Introduction • Estimation • Predicting values not in predetermined categories • Three main techniques • Regression • Decision trees • Neural networks

Regression • Linear regression • Method of Least Squares • Easy technique to use • Excel “Data Analysis” • Excel chart example

Excel Linear Regression Example

Other Regression Forms • Multiple regression • Polynomial equation • Define new variables • Equation becomes

Decision Trees • Regression tree • Leaves → average values • Model tree • Leaves → linear regression models • Discretizing the data • Convert continuous data into discrete partitions • Threshold values

Threshold Values • Entropy • Measure of purity • Information gain • Expected reduction in entropy due to partitioning • Maximize for best threshold

CART Algorithm • Classification and Regression Tree • Grow a tree that overfits data • Prune the tree • Select best subtree

Decision Trees • Strengths • Understandable • Which fields are most important • Weaknesses • Intended for discrete data • Time to grow and prune tree

Comparison Example

Linear Regression Result PRP = - 56.1 + 0.049 MYCT + 0.015 MMIN + 0.006 MMAX + 0.630 CACH - 0.270 CHMIN + 1.46 CHMAX

Regression Tree

Model Tree

Side-By-Side Linear regression Regression tree PRP = • - 56.1 • + 0.049 MYCT • + 0.015 MMIN • + 0.006 MMAX • + 0.630 CACH • - 0.270 CHMIN • + 1.46 CHMAX Model tree

Simple Neural Network

Building the Neural Net • Recursive process • Assign initial weights • Run training values through network • Compare results to actual value • Backpropagation • Pass errors back through net • Incorrect node gets less influence • Military metaphor • Recurrent networks • Genetic algorithms • Simulated annealing

Neural Networks • Strengths • Accurate • Fast to use • Handle missing or corrupt data well • Weaknesses • Not intuitive • Don’t handle large numbers of predictors well • Data preprocessing

Neural Net Example • Four years of 30-minute trading data, 1985-1988 • 1986 & 1987 for training • 1985 for testing • USD/CHF • Single layer model • Input nodes: 7 • Hidden nodes: 7 • Two layer model • Input nodes: 7 • Hidden nodes: 5/2 • Output • Value between -1.0 and 1.0 • Rise or fall?

Accuracy of Models • Overfitting • Applies to all three • Independent test data • Statistical measures

Statistical Measures

Any questions?

Special Bonus Slide

Estimation As A Data Mining Task

Estimation As A Data Mining Task

Presentation Transcript

Data Mining: Data

Data Mining: Data

Data Mining Techniques for Software Effort Estimation: a Comparative Study

A Data Mining Tutorial

Data Mining: Data

Data Mining: P enelitian Data Mining

Data Mining as Pre-EDD Investigatory Tool

What Defines a Data Mining Task ?

Data Mining: Data

Data Mining: Data

Data Mining as a BI Tool

Data Mining: Data

Data Mining: Data

Estimation as a check

Data Mining: Data

Your Task As a Group

Optimal distance estimation on compressed data (the data mining perspective)

Data Mining as an Engine of Personalization

Data Mining: Data

Data Mining: Data