Improving Data Mining Utility with Projective Sampling

Improving Data Mining Utility with Projective Sampling Mark Last Department of Information Systems Engineering Ben-Gurion University of the Negev, Beer-Sheva, Israel E-mail: mlast@bgu.ac.il Home Page: http://www.bgu.ac.il/~mlast/ Mark Last (mlast@bgu.ac.il)

Agenda • Introduction • Learning Curves and Progressive Sampling • The Projective Sampling Strategy • Empirical Results • Conclusions and Future Research Mark Last (mlast@bgu.ac.il)

Motivation: Data is not “born” free • The training data is often scarce and costly • Real-world examples • A limited number of patient records stored by a hospital • Results of a costly engineering experiment • Seasonal records in an agricultural database • Even when the raw data is free, its preparation may still be labor intensive! • Critical question • Should we spend our resources (time and/or money) on acquiring more examples? Mark Last (mlast@bgu.ac.il)

Total Cost of the Classification Process (based on Weiss and Tian, 2008) Training Set Score Set • Total Cost = n·Ctr + err(n)·|S|·Cerr + CPU(n)·Ctime • Ctr - cost of acquiring and labeling each new training example • Cerr - cost of each misclassified example from the score set • Ctime– cost per one unit of CPU time • n –number of training set examples used to induce the model • S - the score setof future examples to be classified by the model • err (n) – the model error rate measured on the score set • CPU(n) – CPU time required to induce the model Used to induce the classification model Future examples to be classified by the model Mark Last (mlast@bgu.ac.il)

What is this research about? • Problem Statement • Find the best training set size n* that is expected to maximize the overall utility (minimize the Total Cost) • Basic Idea - Projective Sampling • Estimate the optimal training set size using learning and run-time curves projectedfrom a small subset of potentially available data • Research Objectives • Calculate the optimal training set size for a variety of learning curve equations (with and without CPU costs) • Improve the utility of the data mining process using the best fitting curves for a given dataset and an algorithm Mark Last (mlast@bgu.ac.il)

Some Learning Curves for a Decision-Tree Algorithm Slow rise Rapid rise with oscillations Rapid rise Plateau Rapid rise Slow rise Mark Last (mlast@bgu.ac.il)

The Best Fit for a Learning Curve • Frey and Fisher (1999) • The power law is the best fit for modeling the C4.5 error rates • Last (2007) • The power law is the best fit for modeling the error rates of an oblivious decision-tree algorithm (Information Network) • Singh (2005) • The power law is only second best to the logarithmic regression for ID3, k-Nearest Neighbors, Support Vector Machines, and Artificial Neural Networks Mark Last (mlast@bgu.ac.il)

Progressive Sampling Strategy(Provost et al., 1999, Weiss and Tian, 2008) • General strategy • Start with some initial amount of training data n0 • Iteratively increase the training set until there is an increase in total cost • Popular schedules • Uniform (arithmetic) sampling • n0, n0+ , n0+ 2 ,… • Geometric Sampling • n0, a∙n0,a2∙n0,… Mark Last (mlast@bgu.ac.il)

Limitations of Progressive Sampling • Overfitting some local perturbations in the error rate • Progressive sampling costs may exceed the optimal ones by 10%-200% (Weiss and Tian, 2008) • Potential overhead associated with purchasing and pre-processing each sampling increment (especially with uniform sampling). • Our expectation • The projective sampling strategy should reduce data mining costs by estimating the optimal training set size from a small subset of potentially available data Mark Last (mlast@bgu.ac.il)

The Projective Sampling Strategy • Set a fixed sampling increment  • Each acquired sample = one data point • Do • Acquire a new data point • Compute Pearson's correlation coefficient for each candidate fitting function (given at least three data points) • Dependent variable: err(n) • Independent variable: training set size n • Find a function with a minimal correlation coefficient Best_Corr • Why minimal • While ((Best_Corr ≥ 0) and (n < nmax)) • Estimate the regression coefficients of the selected function • Estimate the optimal training set size n* • Induce the classification model M (n*) from n* examples Mark Last (mlast@bgu.ac.il)

Candidate Fitting Functions • Learning Curves • Logarithmic : errLog (n) = a + b logn • Weiss and Tian: errWT (n) = a + bn / (n + 1) • Power Law : errPL (n) = a·nb • Exponential : errExp (n) = abn • Run-time Curves • Linear: CPUL (n) = dn • Power law: CPUPL (n) = c·nd Mark Last (mlast@bgu.ac.il)

Converting Learning Curves into the Linear Form y = a’ + b’x Mark Last (mlast@bgu.ac.il)

Pearson's Correlation Coefficient k – number of data points Mark Last (mlast@bgu.ac.il)

Linear Regression Coefficientsy = a + bx • The least squares estimate of the slope: • The least squares estimate of the intercept k – number of data points Mark Last (mlast@bgu.ac.il)

Total Cost Functions • Total CostLog (n) = n·Ctr + d∙n∙Ctime + |S|·Cerr · (a + b logn) • Total CostWT (n) = n·Ctr + d∙n∙Ctime + |S|·Cerr·( a+bn / (n+1)) • Total CostPL (n) = n·Ctr + d∙n∙Ctime + |S|·Cerr ·anb • Total CostExp (n) = n·Ctr + d∙n∙Ctime + |S|·Cerr ·abn Mark Last (mlast@bgu.ac.il)

Optimizing the Training Set Size • Let • R = Cerr / Ctr • Ctr = 1 • CPUL (n) = dn • Logarithmic: • Total CostLog (n) = n + d∙n∙Ctime + |S|·R · (a + b logn) • Weiss and Tian: • Total CostWT (n) = n + d∙n∙Ctime + |S|·R ·( a+bn / (n+1)) • Power Law: • Total CostPL (n) = n + d∙n∙Ctime + |S|·R ·anb • Exponential: • Total CostExp (n) = n + d∙n∙Ctime + |S|·R ·abn Mark Last (mlast@bgu.ac.il)

Experimental Settings • Ten benchmark datasets (see next slide) • Each dataset was randomly partitioned into 25%-50% of test examples and 50%-75% of examples potentially available for training . • The sampling increment  was set to 1% of the maximum possible training set size • The error rate of each increment was averaged over 10 random partitions of the training set. • Sampling schedules: Uniform, Geometric (a=2), Straw Man, Projective, Optimal • Cost Ratios (R): 1 – 50,000 • CPU Factors: 0 and 1 (per one millisecond of CPU time) Mark Last (mlast@bgu.ac.il)

Datasets Description Mark Last (mlast@bgu.ac.il)

Projected Fitting Functions Mark Last (mlast@bgu.ac.il)

Projected and Actual Learning Curves – Small Datasets Mark Last (mlast@bgu.ac.il)

Projected and Actual Learning Curves – Medium and Large Datasets Mark Last (mlast@bgu.ac.il)

Comparison of Sampling Schedules R = Cerr / Ctr Mark Last (mlast@bgu.ac.il)

Detailed Sampling Schedules without Induction CostsSmall Datasets Uniform Geometric, Straw Man, Projected, Optimal Mark Last (mlast@bgu.ac.il)

Detailed Sampling Schedules without Induction Costs Medium and Large Datasets Geometric Uniform, Straw Man, Projected, Optimal Geometric Optimal Uniform, Straw Man, Projected Mark Last (mlast@bgu.ac.il)

Conclusions • The projective sampling strategy estimates the optimal training set size by fitting an analytical function to a partial learning curve • The proposed methodology was evaluated on 10 benchmark datasets of variable size using a decision-tree algorithm. • The results show that under negligible induction costs and high data acquisition costs, the projective sampling outperforms, on average, the alternative, progressive sampling techniques. Mark Last (mlast@bgu.ac.il)

Future Research • Further optimization of projective sampling schedules, especially under substantial CPU costs • Improving utility of cost-sensitive data mining algorithms • Modeling learning curves for non-random (“active”) sampling and labeling techniques Mark Last (mlast@bgu.ac.il)

Thank you! Merci Beaucoup! Mark Last (mlast@bgu.ac.il)

Improving Data Mining Utility with Projective Sampling

Improving Data Mining Utility with Projective Sampling

Presentation Transcript

Data Mining with Clementine

Issues with Data Mining

Mining data with PolyAnalyst

Data Mining With Decision Trees

Comparing Sequential Sampling Models With Standard Random Utility Models

Data Mining with UAI Proceedings

Data Mining with Big data

Data Mining with BioMart

Data Mining with AURA

Improving Parallelism in Structural Data Mining

Improving Data Mining of FFATA Awards Data Base

Data Mining with DB

Data Mining with JDM API

Data Mining with CANape 9.0

Data Mining with Big Data

Data mining with DataShop

Data Mining with Neural Networks

Data Mining with BioMart

Data Mining Processing with Rayvat

BENIFITS ASSOCIATED WITH DATA MINING

Mining High Utility Dataset

Data mining with DataShop