140 likes | 160 Views
Adaptive Sampling Methods for Scaling up Knowledge Discovery Algorithms. From Ch 8 of Instace selection and Costruction for Data Mining (2001) By Carlos Domingo et.al., Kruwer Academic Publishers ( Summarized by Jinsan Yang, SNU Biointelligence Lab) . Abstract
E N D
Adaptive Sampling Methods for Scaling up Knowledge Discovery Algorithms From Ch 8 of Instace selection and Costruction for Data Mining (2001) By Carlos Domingo et.al., Kruwer Academic Publishers (Summarized by Jinsan Yang, SNU Biointelligence Lab)
Abstract • Methods for large amounts of data • Adaptive sampling method instead of random sampling • Keywords Data Mining, Knowledge Discovery, Scalibility, Adaptive sampling, Concentration Bounds
Outline • Introduction • General Rule Selection Problem • Adaptive Sampling Algorithm • An Application of Adaselect • Problem and Algorithm • Experiments • Concluding Remarks
Introduction (1) • Analysis of Large data • Redesign a known algorithm • Reduce the data size • A typical task in data mining • Finding or selecting some rules or laws (General Rule Selection) • General Rule Selection: by random sampling (Batch Sampling) • Proper sample size: by Concentration Bounds or Deviation bounds (Chernoff, Hoeffding bounds) • Problems • Immense sample size is needed for good accuracy and confidence • For the batch sampling, the sample size should be determined a priori as the worst size and it is overestimated for most of the situations
Introduction (2) • Overcoming • Sampling in online sequential fashion (one by one or block by block) • Adaptive sample sizes (adaptive sampling)
General Rule Selection Problem • Given Date D (discrete, categorical ?) and model set H, Select a model h with maximum value of Utility U(h) (supervised learning)
Adaptive Sampling Algorithm (1) • Extension of Hoeffding bound • Reliability of Algorithm
An Application of Adaselect (1) • Canapply as a tool for the General rule selection problem • Example chosen: A boosting based classification algorithm that uses a simple decision stump learner as a base learner. • Decision stump: a single-split decision tree. • AdaBoost for boosting by sub-sampling or re-weighting. • Apply adaptive sampling to base learner (boosting by filtering). • Use MadaBoost by controlling the initial weight as bounded.
An Application of Adaselect (2) • Algorithm • Data: discrete instance vector with labels • Classification rule: decision stump • 0-1 error measure, U: Utility Function Average Prediction
An Application of Adaselect (3) • Experiments • Discretize by 5 intervals and treat missing value as another value. • Artificial inflation (100 copies) of original UCI data • Only for 2 classes • 10 fold cross validation and the results are averaged over 10 runs • Computer: cpu alpha 600MHz, 250Mb memory, 4.3 Gb Hard under Linux • C4.5 and Naïve Bayes classifier for comparison • Boosting round: 10 • Number of all possible decision stumps: (set of weighted majority of ten depth-1 decision tree)
An Application of Adaselect (5) • AdaSel is faster than C4.5 • faster in large sample size.
Concluding Remarks • Justification and efficiency analysis • Applied in the design of a base learner for a boosting algorithm