230 likes | 465 Views
9/10/07. Tutorial on Gaussian Processes DAGS ’07 Jonathan Laserson and Ben Packer. Outline. Linear Regression Bayesian Inference Solution Gaussian Processes Gaussian Process Solution Kernels Implications. Linear Regression. Task: Predict y given x. Linear Regression.
E N D
9/10/07 Tutorial onGaussian Processes DAGS ’07 Jonathan Laserson and Ben Packer
Outline • Linear Regression • Bayesian Inference Solution • Gaussian Processes • Gaussian Process Solution • Kernels • Implications
Linear Regression • Task: Predict y given x
Linear Regression • Predicting Y given X
L2 Regularized Lin Reg • Predicting Y given X
Bayesian Instead of MAP • Instead of using wMAP =argmaxP(y,w|X) to predict y*, why don’t we use entire distribution P(y,w|X) to estimate P(y*|X,y,x*)? • We have P(y|w,X) and P(w) • Combine these to get P(y,w|X) • Marginalize to get P(y|X) • Same as P(y,y*|X,x*) • Conditional Gaussian->Joint to get P(y*|y,X,x*)
Bayesian Inference • We have P(y|w,X) and P(w) • Combine these to get P(y,w|X) • Marginalize to get P(y|X) • Same as P(y,y*|X,x*) • Joint Gaussian->Conditional Gaussian Error bars!
Gaussian Process • We saw a distribution over Y directly • Why not start from here? • Instead of choosing a prior over w and defining fw(x), put your prior over f directly • Since y = f(x) + noise, this induces a prior over y • Next… How to put a prior on f(x)
What is a random process? • It’s a prior over functions • A stochastic process is a collection of random variables, f(x), indexed by x • It is specified by giving the joint probability of every finite subset of variables f(x1), f(x2), …, f(xk) • In a consistent way!
What is a Gaussian process? • It’s a prior over functions • A stochastic process is a collection of random variables, f(x), indexed by x • It is specified by giving the joint probability of every finite subset of variables f(x1), f(x2), …, f(xk) • In a consistent way! • The joint probability of f(x1), f(x2), …, f(xk) is a multivariate Gaussian
What is a Gaussian Process? • It is specified by giving the joint probability of every finite subset of variables f(x1), f(x2), …, f(xk) • In a consistent way! • The joint probability of f(x1), f(x2), …, f(xk) is a multivariate Gaussian • Enough to specify mean and covariance functions • μ(x) = E[f(x)] • C(x,x’) = E[ (f(x)- μ(x)) (f(x’)- μ(x’)) ] • f(x1), …, f(xk) ~ N( [μ(x1) … μ(xk)], K) Ki,j = C(xi, xj) • For simplicity, we’ll assume μ(x) = 0.
Back to Linear Regression • Recall: Want to put a prior directly on f • Can use a Gaussian Process to do this • How do we choose μ and C? • Use knowledge of prior over w • w ~ N(0, σ2I) • μ(x) = E[f(x)] = E[wTx] = E[wT]x = 0 • C(x,x’) = E[ (f(x)- μ(x)) (f(x’)- μ(x’)) ] = E[f(x)f(x’)] = xTE[wwT]x’ = xT(σ2I)x’ = σ2xTx’ Can have f(x) = WTΦ(x)
Back to Linear Regression • μ(x) = 0 • C(x,x’) = σ2xTx’ • f ~ GP(μ,C) • It follows that • f(x1),f(x2),…,f(xk) ~ N(0, K) • y1,y2,…,yk ~ N(0,ν2I + K) • K = σ2XXT • Same as Least Squares Solution! • If we use a different C, we’ll have a different K
Kernels • If we use a different C, we’ll have a different K • What do these look like? • Linear • Poly • Gaussian C(x,x’) = σ2xTx’
Kernels • If we use a different C, we’ll have a different K • What do these look like? • Linear • Poly • Gaussian C(x,x’) = (1+xTx’)2
Kernels • If we use a different C, we’ll have a different K • What do these look like? • Linear • Poly • Gaussian C(x,x’) = exp{-0.5*(x-x’)2}
Learning a kernel • Parameterize a family of kernel functions using θ • Learn K using gradient of likelihood
Starting point • For details, see • Rasmussen’s NIPS 2006 Tutorial • http://www.kyb.mpg.de/bs/people/carl/gpnt06.pdf • Williamson’s Gaussian Processes paper • http://www.dai.ed.ac.uk/homes/ckiw/postscript/hbtnn.ps.gz • GPs for classification (approximation) • Sparse methods • Connection to SVMs