150 likes | 162 Views
Progressive Sampling. Instance Selection and Construction for Data Mining Ch 9. F. Provost, D. Jensen, and T. Oates 2001.5.16 신수용. Introduction. Increasing the amount of data leads to greater computational cost
E N D
Progressive Sampling Instance Selection and Construction for Data Mining Ch 9. F. Provost, D. Jensen, and T. Oates 2001.5.16 신수용
Introduction • Increasing the amount of data leads to greater computational cost • Progressive sampling attempts to maximize accuracy as efficiently as possible, starting with a small sample and using progressively larger ones until model accuracy no longer improves. • A central component of progressive sampling is a sampling schedule S = {n0, n1, …, nk} • ni : the size of a sample (C) 2001, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Three fundamental question for progressive sampling • What is an efficient sampling schedule? • How can converge be detected effectively and efficiently? • As sampling progresses, can the schedule be adapted to be more efficient? (C) 2001, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Learning curves • Depicts the relationship between sample size and model accuracy (C) 2001, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Def. 1. • Given a data set, a sampling procedure, and an induction algorithm, nmin is the size of the smallest sufficient training set. Models built with smaller training sets have lower accuracy than models built with from training sets of size nmin, and models built with larger training sets have no higher accuracy. (C) 2001, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Progressive Sampling (C) 2001, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Determining an efficient schedule • Static sampling • Computes without progressive sampling, based on a subsample’s statistical similarity to the entire sample. • Arithmetic sampling (John & Langley 1996) • (Drawback) if nmin is large multiple of n, then the approach will require many runs of the underlying induction algorithm (C) 2001, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Determining an efficient schedule • Geometric sampling • Escapes the limitations of arithmetic sampling (C) 2001, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Asymptotic optimality of geometric sampling • For induction algorithms with polynomial time complexity (f(n)), no better than O(n), if convergence also can be detected in O(f(n)), then geometric progressive sampling is asymtotically optimal among progressive sampling methods in terms of run time. (C) 2001, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Optimality with Respect to Expectations of Convergence • In many cases there may be no prior information about the likelihood of convergence occurring for any given n. • But, since also in many cases nmin << N, it would be more reasonable to assume a more concentrated distribution. (roughly log-normal) • Identification of the optimal schedule in terms of dynamic programming, requires O(N2) space and O(N3) time. (C) 2001, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Comparison of cost • The costs for three different schedules (C) 2001, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Comparison of cost • Dynamic programming with various f(n) given a uniform prior. • Note that the optimal schedule depends on f(n) (C) 2001, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Comparison of cost • Dynamic programming with various f(n) given a log-normal prior. (C) 2001, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Detecting convergence • Linear regression with local sampling (LRLS) • Begins at the latest scheduled sample size ni and samples l additional points in the local neighborhood of ni. • These points are then used to estimate a linear regression line, whose slope is compared to zero. • If the slope is sufficiently close to zero, convergence is detected. • LRLS takes advantage of a common property of learning curves. (C) 2001, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Empirical Comparison (C) 2001, SNU Biointelligence Lab, http://bi.snu.ac.kr/