Gaussian Process Optimization in the Bandit Setting: No Regret & Experimental Design

Gaussian Process Optimization in the Bandit Setting: No Regret & Experimental Design Niranjan Srinivas Andreas Krause Caltech Caltech Sham Kakade Matthias Seeger Wharton Saarland rsrg@caltech ..where theory and practice collide

Optimizing Noisy, Unknown Functions Given: Set of possible inputs D; black-box access to unknown function f Want: Adaptive choice of inputs from D maximizing Many applications: robotic control [Lizotte et al. ’07], sponsored search [Pande & Olston, ’07], clinical trials, … Sampling is expensive Algorithms evaluated using “regret” Goal: minimize

Running example: Noisy Search • How to find the hottest point in a building? • Manynoisysensors available but sampling is expensive • D: set of sensors; : temperature at chosen at step i Observe • Goal: Find with minimal number of queries

Key insight: Exploit correlation Temperature isspatially correlated Sampling f(x) at one point x yields information about f(x’) for points x’ near x In this paper: Model correlation using a Gaussian process (GP) prior for f 4

Gaussian Processes to model payoff f + + + + Normal dist.(1-D Gaussian) Multivariate normal(n-D Gaussian) Gaussian process(∞-D Gaussian) Gaussian process (GP) = normal distribution over functions Finite marginals are multivariate Gaussians Closed form formulae for Bayesian posterior update exist Parameterized by covariance function K(x,x’) = Cov(f(x),f(x’)) 5

Example of GPs Squared exponential kernelK(x,x’) = exp(-(x-x’)2/h2) Samples from P(f) Bandwidth h=.1 Bandwidth h=.3 -3 -2 -1 0 1 2 3 Distance |x-x’| 6

f(x) x Gaussian process optimization[e.g., Jones et al ’98] Goal: Adaptively pick inputs such that Key question: how should we pick samples? • So far, only heuristics: • Expected Improvement [Močkus et al. ‘78] • Most Probable Improvement [Močkus ‘89] • Used successfully in machine learning [Ginsbourger et al. ‘08, Jones ‘01, Lizotte et al. ’07] No theoretical guarantees on their regret!

Simple algorithm for GP optimization f(x) x • In each round t do: • Pick • Observe • Use Bayes’ rule to get posterior mean Can get stuck in local maxima! 8

Uncertainty sampling f(x) x Pick: That’s equivalent to (greedily) maximizing information gain Popular objective in Bayesian experimental design (where the goal is pure exploration of f) But…wastes samples by exploring f everywhere! 9

f(x) x Avoiding unnecessary samples Key insight: Never need to sample where Upper Confidence Bound (UCB) < best lower bound! Best lowerbound

Upper Confidence Bound (UCB) Algorithm f(x) x Pick input that maximizes Upper Confidence Bound (UCB): How shouldwe choose ¯t? Need theory! Naturally trades off explore and exploit; no samples wasted Regret bounds: classic [Auer ’02] & linearf [Dani et al. ‘07] But none in the GP optimization setting! (popular heuristic)

How well does UCB work? Bandwidth h=.1 Bandwidth h=.3 “Easy” “Hard” The quicker confidence bands collapse, the easier to learn Key idea: Rate of collapse  growth of information gain Intuitively, performance should depend on how “learnable” the function is 12

Learnability and information gain T We show that regret bounds depend on how quickly we can gain information Mathematically: Establishes a novel connection between GP optimization and Bayesian experimental design 13

Performance of optimistic sampling Maximal information gaindue to sampling! Theorem If we choose ¯t = £(log t), then with high probability, Hereby The slower °T grows, the easier f is to learn Key question: How quickly does °T grow? 14

Learnability and information gain Little diminishing returns Returns diminish fast 15 Information gain exhibits diminishing returns (submodularity) [Krause & Guestrin ’05] Our bounds depend on “rate” of diminishment

Dealing with high dimensions Theorem: For various popular kernels, we have: Linear: ; Squared-exponential: ; Matérn with , ; Smoothness of f helps battle curse of dimensionality! Our bounds rely on submodularity of 16

What if f is not from a GP? In practice, f may not be Gaussian Theorem: Let f lie in the RKHS of kernel K with , and let the noise be bounded almost surely by . Choose .Then with high probab., Frees us from knowing the “true prior” Intuitively, the bound depends on the “complexity” of the function through its RKHS norm 17

Experiments: UCB vs. heuristics • Temperature data • 46 sensors deployed at Intel Research, Berkeley • Collected data for 5 days (1 sample/minute) • Want to adaptively find highest temperature as quickly as possible • Traffic data • Speed data from 357 sensors deployed along highway I-880 South • Collected during 6am-11am, for one month • Want to find most congested (lowest speed) area as quickly as possible 18

Comparison: UCB vs. heuristics GP-UCB compares favorably with existing heuristics 19

Conclusions First theoretical guarantees and convergence rates for GP optimization Both true prior and agnostic case covered Performance depends on “learnability”, captured by maximal information gain Connects GP Bandit Optimization & Experimental Design! Performance on real data comparable to other heuristics 20

Relating the two problems Adaptive choice of from decision set D Min regret: Easy to see that We bound ; also applies to

Performance of Optimistic sampling Theorem [Srinivas, Krause, Kakade, Seeger] Let and be compact and convex. Pick and . If K satisfies weak regularity conditions, we have where is independent of T. is the maximal information-gain due to sampling First performance guarantee for GP optimization! “It pays to be optimistic”  22

Bounding Information Gain Theorem: For finite D, submodularity and the greedy algorithm yield : #samples of eigenvalues of the covariance matrix Greedy algorithm samples eigenvectors Maximal info-gain depends on spectral properties of kernel Faster the eigenvalues decay, better the performance

What if f is not from a GP? • In practice, f may not be Gaussian Theorem: Let . Assume that f lies in the RKHS of kernel K. Let be a noise process that has zero mean conditioned on history and is bounded by almost surely. Assume and pick Then for , we have • Frees us from knowing the “true prior” • For f ~ GP, ; sample paths rougher • Therefore neither theorem subsumes the other 24

Examples of GPs Exponential kernelK(x,x’) = exp(-|x-x’|/h) Bandwidth h=.3 Bandwidth h=1 -3 -2 -1 0 1 2 3 Distance |x-x’| 25

Multi-armed bandits At each time pick arm i; get independent payoff with probability pi Classic model for exploration – exploitation tradeoff Extensively studied (Robbins ’52, Gittins ’79) Typically assume each arm is tried multiple times … p1 p2 p3 pk 26

Infinite-armed bandits … … … p∞ p1 p2 p1 p2 p3 pk In many applications, number of arms is huge (sponsored search, sensor selection) Cannot try each arm even once Assumptions on payoff function f essential 27

Thinking about GPs Kernel function K(x, x’) specifies covariance Encodes smoothness assumptions f(x) f(x) x P(f(x)) 28

Examples of GPs Linear kernel with features:K(x,x’) = (x)T(x’) E.g., (x) = sin(x) E.g., (x) = [0,x,x2] 29

Assumptions on f Linear?[Dani et al, ’07] Lipschitz-continuous(bounded slope)[Kleinberg ‘08] Fast convergence; But strong assumption Very flexible, but 30

Gaussian Process Optimization in the Bandit Setting: No Regret &amp; Experimental Design

Gaussian Process Optimization in the Bandit Setting: No Regret &amp; Experimental Design

Presentation Transcript

Gaussian Process Optimization in the Bandit Setting: No Regret & Experimental Design

Gaussian Process Optimization in the Bandit Setting: No Regret & Experimental Design