220 likes | 380 Views
Estimation As A Data Mining Task. Theoretical Understanding. Vinh Ngo & Mike Ellis. Introduction. Estimation Predicting values not in predetermined categories Three main techniques Regression Decision trees Neural networks. Regression. Linear regression Method of Least Squares
E N D
Estimation As A Data Mining Task Theoretical Understanding Vinh Ngo & Mike Ellis
Introduction • Estimation • Predicting values not in predetermined categories • Three main techniques • Regression • Decision trees • Neural networks
Regression • Linear regression • Method of Least Squares • Easy technique to use • Excel “Data Analysis” • Excel chart example
Other Regression Forms • Multiple regression • Polynomial equation • Define new variables • Equation becomes
Decision Trees • Regression tree • Leaves → average values • Model tree • Leaves → linear regression models • Discretizing the data • Convert continuous data into discrete partitions • Threshold values
Threshold Values • Entropy • Measure of purity • Information gain • Expected reduction in entropy due to partitioning • Maximize for best threshold
CART Algorithm • Classification and Regression Tree • Grow a tree that overfits data • Prune the tree • Select best subtree
Decision Trees • Strengths • Understandable • Which fields are most important • Weaknesses • Intended for discrete data • Time to grow and prune tree
Linear Regression Result PRP = - 56.1 + 0.049 MYCT + 0.015 MMIN + 0.006 MMAX + 0.630 CACH - 0.270 CHMIN + 1.46 CHMAX
Side-By-Side Linear regression Regression tree PRP = • - 56.1 • + 0.049 MYCT • + 0.015 MMIN • + 0.006 MMAX • + 0.630 CACH • - 0.270 CHMIN • + 1.46 CHMAX Model tree
Building the Neural Net • Recursive process • Assign initial weights • Run training values through network • Compare results to actual value • Backpropagation • Pass errors back through net • Incorrect node gets less influence • Military metaphor • Recurrent networks • Genetic algorithms • Simulated annealing
Neural Networks • Strengths • Accurate • Fast to use • Handle missing or corrupt data well • Weaknesses • Not intuitive • Don’t handle large numbers of predictors well • Data preprocessing
Neural Net Example • Four years of 30-minute trading data, 1985-1988 • 1986 & 1987 for training • 1985 for testing • USD/CHF • Single layer model • Input nodes: 7 • Hidden nodes: 7 • Two layer model • Input nodes: 7 • Hidden nodes: 5/2 • Output • Value between -1.0 and 1.0 • Rise or fall?
Accuracy of Models • Overfitting • Applies to all three • Independent test data • Statistical measures