240 likes | 383 Views
Parameter tuning based on response surface models An update on work in progress. EARG, Feb 27 th , 2008 Presenter: Frank Hutter. Motivation. Parameter tuning is important Recent approaches (ParamILS, racing, CALIBRA) “only” return the best parameter configuration
E N D
Parameter tuning based on response surface modelsAn update on work in progress EARG, Feb 27th, 2008 Presenter: Frank Hutter
Motivation • Parameter tuning is important • Recent approaches (ParamILS, racing, CALIBRA) “only” return the best parameter configuration • Extra information would be nice, e.g. • The most important parameter is X • The effect of parameters X and Y is largely independent • For parameter X options 1 and 2 are bad, 3 is best, 4 is decent • ANOVA is one tool for that, but has limitations (e.g. discretization of parameters, linear model)
More motivation • Support the actual design process by providing feedback about parameters • E.g. parameter X should always be i (code gets simpler!!) • Predictive models of runtime are widely applicable • Prediction can be updated based on new information (such as “the algorithm has been unsuccessfully running for X seconds”) • (True) portfolios of algorithms • Once we can learn a function f:Q! runtime, learning a function g:Q, X! runtime should be a simply extension (X=inst. charac., Lin learns h: X! runtime)
The problem setting • For now: static algorithm configuration, i. e. find the best fixed parameter setting across instances • But as mentioned above this approach extends to PIAC (per instance algorithm configuration) • Randomized algorithms: variance for a single instance (runtime distributions) • High inter-instance variance in hardness • We focus on minimizing runtime • But the approach also applies to other objectives • (Special treatment of censoring and cost for gathering a data point is then simply not necessary) • We focus on optimizing averages across instances • Generalization to other objectives may not be straight-forward
Learning a predictive model • Supervised learning problem, regression • Given training data (x1, o1), …, (xn, on), learn function f such that f(xi) ¼ oi • What is a data point xi ? • 1) Predictive model of average cost • Average of how many instances/runs ? • Not too many data points, but each one very costly • Doesn’t have to be average cost, could be anything • 2) Predictive model of single costs, get average cost by aggregation • Have to deal with ten thousands of data points • If predictions are Gaussian, the aggregates are Gaussian (means and variances add)
Desired properties of model • 1) Discrete and continuous inputs • Parameters are discrete/continuous • Instances features (so far) all continuous • 2) Censoring • When a run times out we only have a lower bound on its true runtime • 3) Scalability: tens of thousands of points • 4) Explicit predictive uncertainties • 5) Accuracy of predictions • Considered models: • Linear regression (basis functions? especially for discrete inputs) • regression trees (no uncertainty estimates) • Gaussian processes (4&5 ok, 1 done, 2 almost done, hopefully 3)
Coming up • 1) Implemented: model average runtimes, optimize based on that model • Censoring “almost” integrated • 2) Further TODOs: • Active learning criterion under noise • Scaling: Bayesian committee machine
Active learning for function optimization • EGO [Jones, Schonlau & Welch, 1998] • Assumes deterministic functions • Here: averages over 100 instances • Start with a Latin hypercube design • Run the algorithm, get (xi,oi) pairs • While not terminate • Fit the model (kernel parameter optimization, all continuous) • Find best point to sample (optimization in the space of parameter configurations) • Run the algorithm at that point, add new (x,y) pair
Active learning criterion • EGO uses maximum expected improvement • EI(x) = s p(y|mx, s2x) max(0, f_min-y) dy • Easy to evaluate (can be solved in closed form) • Problem in EGO: sometimes not the actual runtime y is modeled, but a transformation, e.g. log(y) • Expected improvement then needs to be adapted: • EI(x) = s p(y|mx, s2x) max(0, f_min-exp(y)) dy • Easy to evaluate (can still be solved in closed form) • Take into account cost of sample: • EI(x) = s p(y|mx, s2x) 1/exp(y) max(0, f_min-exp(y)) dy • Easy to evaluate (can still be solved in closed form) • Not implemented yet (the others are implemented)
Which kernel to use? • Kernel: distance measure between two data points • Low distance ! high correlation • Squared exponential, Matern, etc: • SE: k(x, x’) = ss exp(- å li(xi-xi’)2 ) • For discrete parameters: new Hamming distance kernel • ss epx(- å li(xi ¹ xi’) ) • Positive definite by reduction to String kernels • “Automatic relevance determination” • One length scale parameter li per dimension • Many kernel parameters lead to • Problems with overfitting • Very long runtimes for kernel parameter optimization • For CPLEX: 60 extra parameters, about 15h for a single kernel parameter optimization using DIRECT, without any improvement • Thus: no length scale parameters.Only two parameters: noise sn, and overall variability of the signal, ss
How to optimize kernel parameters? • Objective • Standard: maximizing marginal likelihood • Doesn’t work under censoring • Alternative: maximizing likelihood of unseen data using cross-validation • Efficient when not too many folds k are used: • Marginal likelihood requires inversion of N by N matrix • Cross validation with k=2 requires inversions of two N/2 by N/2 matrices. In practice still quite a bit slower (some algebra tricks may help) • Algorithm • Using DIRECT (DIviding RECTangles), global sampling-based method (does not scale to high dim)
How to optimize exp. improvement? • Currently only 3 algorithms to be tuned: • SAPS (4 continuous params) • SPEAR(26 parameters, about half of them discrete) • For now continuous ones are discretized • CPLEX(60 params, 50 of them discrete) • For now continuous ones are discretized • Purely continuous/purely discrete optimization • DIRECT / multiple restart local search
TODO: integrate censoring • Learning with censored data almost done • (needs solid testing since it’ll be central later) • Active selection of censoring threshold ? • Something simple might suffice, such as picking cutoff equal to predicted runtime or to the best runtime so far • Integration bounds in expected improvement would change, but nothing else • Runtime • With censoring about 3 times slower than without (Newton’s method takes about 3 steps) • „Good“ scaling • 42 points: 19 seconds; 402 points: 143 seconds • Maybe Newton does not need as many steps with more points
Anecdotal: Lin’s original implementation of Schmee & Hahn, on my machine – beware of normpdf
A counterintuitive example from practice(same hyperparameters in same rows)
Preliminary results and demo • Experiments with noise-free kernel • Great cross-validation results for SPEAR & CPLEX • Poor cross-validation results for SAPS • Explanation • Even when averaging 100 instances, the response is NOT noise-free • SAPS is continuous: • can pick configurations arbitrarily close to each other • if results differ substantially SE kernel must have huge variance ! very poor results • Matern kernel works better for SAPS
TODOs • Finish censoring • Consider noise (even possible when averaging over instances), change active learning criterion • Scaling • Efficiency of implementation: reusing work for multiple predictions
TODO: Active learning under noise • [Williams, Santner, and Notz, 2000] • Very heavy on notation • But there is good stuff in there • 1) Actively choose a parameter setting • Best setting so far is not known ! fmin is now a random variable • take joint samples of performance from predictive distributions for all settings tried so far, take min of those samples, compute expected improvement as if that min was the deterministic fmin • Average the exp. improvements computed for 100 independent samples • 2) Actively choose an instance to run for that parameter setting: minimizing posterior variance
TODO: scaling • Bayesian committee machine • More or less a mixture of GPs, each of them on a small subset of data (cluster data ahead of time) • Fairly straight-forward wrapper around GP code (or really any code that provides Gaussian predictions) • Maximizing cross-validated performance is easy • In principle could update by just updating one component at a time • But in practice once we re-optimize hyperparameters we’re changing every component anyways • Likewise we can do rank-1 updates for the basic GPs, but a single matrix inversion is really not the expensive part (rather the 1000s of matrix inversions for kernel parameter optimization)
Future work • We can get main effects and interaction effects, much like in ANOVA • The integrals seem to be solvable in closed form • We can get plots of predicted mean and variance as one parameter is varied, marginalized over all others • Similarly as two or three are varied • This allows for plots of interactions