Bayesian Optimization (BO)

Bayesian Optimization(BO) JavadAzimi Fall 2010 http://web.engr.oregonstate.edu/~azimi/

Outline • Formal Definition • Application • Bayesian Optimization Steps • Surrogate Function(Gaussian Process) • Acquisition Function • PMAX • IEMAX • MPI • MEI • UCB • GP-Hedge

Formal Definition • Input: • Goal:

Fuel Cell Application This is how an MFC works Nano-structure of anode significantly impact the electricity production. e- e- SEM image of bacteria sp. on Ni nanoparticle enhanced carbon fibers. Fuel (organic matter) O2 bacteria H+ H2O Oxidation products (CO2) Cathode Anode • We should optimize anode nano-structure to maximize power by selecting a set of experiment.

Big Picture • Since Running experiment is very expensive we use BO. • Select one experiment to run at a time based on results of previous experiments. Our Current Model Select Single Experiment Current Experiments Run Experiment

BO Main Steps • Surrogate Function(Response Surface , Model) • Make a posterior over unobserved points based on the prior. • Its parameter might be based on the prior. Remember it is a BAYESIAN approach. • Acquisition Criteria(Function) • Which sample should be selected next.

Surrogate Function • Simulates the unknown function distributionbased on the prior. • Deterministic (Classical Linear Regression,…) • There is a deterministic prediction for each point x in the input space. • Stochastic (Bayesian regression, Gaussian Process,…) • There is a distribution over the prediction for each point x in the input space. (i.e Normal distribution) • Example • Deterministic: f(x1)=y1, f(x2)=y2 • Stochastic: f(x1)=N(y1,2) f(x2)=N(y2,5)

Gaussian Process(GP) • A Gaussian process is a collection number of random variables, any finite number of which have a joint Gaussian distribution. • Consistency requirement or marginalization property. • Marginalization property:

Gaussian Process(GP) • Formal prediction: • Interesting points: • Squared exponential function corresponds to Bayesian linear regression with an infinite number of basis function. • Variance is independent from observation • The mean is a linear combination of observation. • If the covariance function specifies the entries of covariance matrix, marginalization is satisfied!

Gaussian Process(GP) • Gaussian Process is: • An exact interpolating regression method. • Predict the training data perfectly. (not true in classical regression) • A natural generalization of linear regression. • Nonlinear regression approach! • A simple example of GP can be obtained from Bayesian regression. • Identical results • Specifies a distribution over functions.

Gaussian process(2):distribution over functions 95% confidence interval for each point x. Three sampled functions

Gaussian process(2):GP vs Bayesian regression • Bayesian regression: • Distribution over weight • The prior is defined over the weights. • Gaussian Process • Distribution over function • The prior is defined over the function space. • These are the same but from different view.

Short Summary • Given any unobserved point z, we can define a normal distribution of its prediction value such that: • Its means is the linear combination of the observed value. • Its variance is related to its distance from observed value. (closer to observed data, less variance)

BO Main Steps • Surrogate Function(Response Surface , Model) • Make a posterior over unobserved points based on the prior. • Its parameter might be based on the prior. Remember it is a BAYESIAN approach. • Acquisition Criteria(Function) • Which sample should be selected next.

Bayesian Optimization:(Acquisition criterion) • Remember: we are looking for: • Input: • Set of observed data. • A set of points with their corresponding mean and variance. • Goal: Which point should be selected next to get to the maximizer of the function faster. • Different Acquisition criterion(Acquisition functions or policies)

Policies • Maximum Mean (MM). • Maximum Upper Interval (MUI). • Maximum Probability of Improvement (MPI). • Maximum Expected of Improvement (MEI).

Policies:Maximum Mean (MM). • Returns the point with highest expected value. • Advantage: • If the model is stable and has been learnt very good, performs very good. • Disadvantage: • There is a high chance to fall in local minimum(just exploit). • Can converge to global optimum finally? • No 

Policies:Maximum Upper Interval (MUI). • Returns the point with highest 95% upper interval. • Advantage: • Combination of mean and variance(exploitation and exploration). • Disadvantage: • Dominated by variance and mainly explore the input space. • Can converge to global optimum finally? • Yes. • But needs almost infinite number of samples. 

Policies:Maximum Probability of Improvement (MPI) • Selects the sample with highest probability of improving the current best observation (ymax) by some marginsm.

Policies:Maximum Probability of Improvement (MPI) • Advantage: • Considers mean and variance and ymax in policy(smarter than MUI) • Disadvantage: • Ad-hoc parameter m  • Large value of m? • Exploration • Small value of m? • Exploitation

Policies:Maximum Expected of Improvement (MEI) • Maximum Expected of improvement. • Question: Expectation over which variable? • m 

Policies:Upper Confidence Bounds • Select based on the variance and mean of each point. • The selection of k left to the user. • Recently, a principle approach to select this parameter has been proposed.

Summary • We introduced several approaches, each of which has advantage and disadvantage. • MM • MUI • MPI • MEI • GP-UCB • Which one should be selected for an unknown model?

GP-Hedge • GP-Hedge(2010) • It select one of the baseline policy based on the theoretical results of multi-armed bandit problem, although the objective is a bit different! • They show that they can perform better than (or as well as) the best baseline policy in someframework.

Future Works • Method selection smarter than GP-Hedge with theoretical analysis. • Batch Bayesian optimization. • Scheduling Bayesian optimization.

Bayesian Optimization (BO)

Bayesian Optimization (BO)

Presentation Transcript

How to Complete Travel Forms

Dynamic Batch Bayesian Optimization

Bayesian Optimization with Experimental Constraints

Budgeted Optimization with Concurrent Stochastic-Duration Experiments

BOA (Bayesian Optimization Algorithm)

Bayesian Optimization Algorithm, Decision Graphs, and Occam’s Razor

Bo Jackson

Hierarchical Bayesian Optimization Algorithm (hBOA)

Call Admission Control Optimization in WiMAX Networks

Bayesian Sparse Sampling for On-line Reward Optimization