560 likes | 1.03k Views
Bayesian Optimization (BO). Javad Azimi Fall 2010 http://web.engr.oregonstate.edu/~azimi/. Outline. Formal Definition Application Bayesian Optimization Steps Surrogate Function(Gaussian Process) Acquisition Function PMAX IEMAX MPI MEI UCB GP-Hedge. Formal Definition. Input:
E N D
Bayesian Optimization(BO) JavadAzimi Fall 2010 http://web.engr.oregonstate.edu/~azimi/
Outline • Formal Definition • Application • Bayesian Optimization Steps • Surrogate Function(Gaussian Process) • Acquisition Function • PMAX • IEMAX • MPI • MEI • UCB • GP-Hedge
Formal Definition • Input: • Goal:
Fuel Cell Application This is how an MFC works Nano-structure of anode significantly impact the electricity production. e- e- SEM image of bacteria sp. on Ni nanoparticle enhanced carbon fibers. Fuel (organic matter) O2 bacteria H+ H2O Oxidation products (CO2) Cathode Anode • We should optimize anode nano-structure to maximize power by selecting a set of experiment.
Big Picture • Since Running experiment is very expensive we use BO. • Select one experiment to run at a time based on results of previous experiments. Our Current Model Select Single Experiment Current Experiments Run Experiment
BO Main Steps • Surrogate Function(Response Surface , Model) • Make a posterior over unobserved points based on the prior. • Its parameter might be based on the prior. Remember it is a BAYESIAN approach. • Acquisition Criteria(Function) • Which sample should be selected next.
Surrogate Function • Simulates the unknown function distributionbased on the prior. • Deterministic (Classical Linear Regression,…) • There is a deterministic prediction for each point x in the input space. • Stochastic (Bayesian regression, Gaussian Process,…) • There is a distribution over the prediction for each point x in the input space. (i.e Normal distribution) • Example • Deterministic: f(x1)=y1, f(x2)=y2 • Stochastic: f(x1)=N(y1,2) f(x2)=N(y2,5)
Gaussian Process(GP) • A Gaussian process is a collection number of random variables, any finite number of which have a joint Gaussian distribution. • Consistency requirement or marginalization property. • Marginalization property:
Gaussian Process(GP) • Formal prediction: • Interesting points: • Squared exponential function corresponds to Bayesian linear regression with an infinite number of basis function. • Variance is independent from observation • The mean is a linear combination of observation. • If the covariance function specifies the entries of covariance matrix, marginalization is satisfied!
Gaussian Process(GP) • Gaussian Process is: • An exact interpolating regression method. • Predict the training data perfectly. (not true in classical regression) • A natural generalization of linear regression. • Nonlinear regression approach! • A simple example of GP can be obtained from Bayesian regression. • Identical results • Specifies a distribution over functions.
Gaussian process(2):distribution over functions 95% confidence interval for each point x. Three sampled functions
Gaussian process(2):GP vs Bayesian regression • Bayesian regression: • Distribution over weight • The prior is defined over the weights. • Gaussian Process • Distribution over function • The prior is defined over the function space. • These are the same but from different view.
Short Summary • Given any unobserved point z, we can define a normal distribution of its prediction value such that: • Its means is the linear combination of the observed value. • Its variance is related to its distance from observed value. (closer to observed data, less variance)
BO Main Steps • Surrogate Function(Response Surface , Model) • Make a posterior over unobserved points based on the prior. • Its parameter might be based on the prior. Remember it is a BAYESIAN approach. • Acquisition Criteria(Function) • Which sample should be selected next.
Bayesian Optimization:(Acquisition criterion) • Remember: we are looking for: • Input: • Set of observed data. • A set of points with their corresponding mean and variance. • Goal: Which point should be selected next to get to the maximizer of the function faster. • Different Acquisition criterion(Acquisition functions or policies)
Policies • Maximum Mean (MM). • Maximum Upper Interval (MUI). • Maximum Probability of Improvement (MPI). • Maximum Expected of Improvement (MEI).
Policies:Maximum Mean (MM). • Returns the point with highest expected value. • Advantage: • If the model is stable and has been learnt very good, performs very good. • Disadvantage: • There is a high chance to fall in local minimum(just exploit). • Can converge to global optimum finally? • No
Policies:Maximum Upper Interval (MUI). • Returns the point with highest 95% upper interval. • Advantage: • Combination of mean and variance(exploitation and exploration). • Disadvantage: • Dominated by variance and mainly explore the input space. • Can converge to global optimum finally? • Yes. • But needs almost infinite number of samples.
Policies:Maximum Probability of Improvement (MPI) • Selects the sample with highest probability of improving the current best observation (ymax) by some marginsm.
Policies:Maximum Probability of Improvement (MPI) • Advantage: • Considers mean and variance and ymax in policy(smarter than MUI) • Disadvantage: • Ad-hoc parameter m • Large value of m? • Exploration • Small value of m? • Exploitation
Policies:Maximum Expected of Improvement (MEI) • Maximum Expected of improvement. • Question: Expectation over which variable? • m
Policies:Upper Confidence Bounds • Select based on the variance and mean of each point. • The selection of k left to the user. • Recently, a principle approach to select this parameter has been proposed.
Summary • We introduced several approaches, each of which has advantage and disadvantage. • MM • MUI • MPI • MEI • GP-UCB • Which one should be selected for an unknown model?
GP-Hedge • GP-Hedge(2010) • It select one of the baseline policy based on the theoretical results of multi-armed bandit problem, although the objective is a bit different! • They show that they can perform better than (or as well as) the best baseline policy in someframework.
Future Works • Method selection smarter than GP-Hedge with theoretical analysis. • Batch Bayesian optimization. • Scheduling Bayesian optimization.