570 likes | 854 Views
PATTERN RECOGNITION AND MACHINE LEARNING. CHAPTER 3: LINEAR MODELS FOR REGRESSION. Outline. Discuss tutorial. Regression Examples. The Gaussian distribution. Linear Regression. Maximum Likelihood estimation. Polynomial Curve Fitting . Academia Example.
E N D
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION
Outline Discuss tutorial. Regression Examples. The Gaussian distribution. Linear Regression. Maximum Likelihood estimation.
Academia Example Predict: final percentage mark for student. Features: 6 assignment grades, midterm exam, final exam, project, age. Questions we could ask. I forgot the weights of components. Can you recover them from a spreadsheet of the final grades? I lost the final exam grades. How well can I still predict the final mark? How important is each component, actually? Could I guess well someone’s final mark given their assignments? Given their exams?
Central Limit Theorem The distribution of the sum of N i.i.d. random variables becomes increasingly Gaussian as N grows. Example: N uniform [0,1] random variables.
Reading exponential prob formulas In infinite space, cannot just form sumΣx p(x) grows to infinity. Instead, use exponential, e.g.p(n) = (1/2)n Suppose there is a relevant feature f(x) and I want to express that “the greater f(x) is, the less probable x is”. Use p(x) = exp(-f(x)).
Example: exponential form sample size Fair Coin: The longer the sample size, the less likely it is. p(n) = 2-n. ln[p(n)] Sample size n
Exponential Form: Gaussian mean The further x is from the mean, the less likely it is. ln[p(x)] 2(x-μ)
Smaller variance decreases probability The smaller the variance σ2, the less likely x is (away from the mean). Or: the greater the precision, the less likely x is. ln[p(x)] 1/σ2 = β
Minimal energy = max probability The greater the energy (of the joint state), the less probable the state is. ln[p(x)] E(x)
Linear Basis Function Models (1) Generally where Áj(x) are known as basis functions. Typically, Á0(x) = 1, so that w0 acts as a bias. In the simplest case, we use linear basis functions : Ád(x) = xd.
Linear Basis Function Models (2) Polynomial basis functions: These are global; a small change in x affect all basis functions.
Linear Basis Function Models (3) Gaussian basis functions: These are local; a small change in x only affect nearby basis functions. ¹j and s control location and scale (width). Related to kernel methods.
Linear Basis Function Models (4) Sigmoidal basis functions: where Also these are local; a small change in x only affect nearby basis functions. ¹j and s control location and scale (slope).
Maximum Likelihood and Least Squares (1) Assume observations from a deterministic function with added Gaussian noise: which is the same as saying, Given observed inputs, , and targets, , we obtain the likelihood function where
Maximum Likelihood and Least Squares (2) Taking the logarithm, we get where is the sum-of-squares error.
Maximum Likelihood and Least Squares (3) Computing the gradient and setting it to zero yields Solving for w, we get where The Moore-Penrose pseudo-inverse, .
Linear Algebra/Geometry of Least Squares Consider S is spanned by . wML minimizes the distance between t and its orthogonal projection on S, i.e. y. N-dimensional M-dimensional
Maximum Likelihood and Least Squares (4) Maximizing with respect to the bias, w0, alone, we see that We can also maximize with respect to ¯, giving
Over-fitting Root-Mean-Square (RMS) Error:
Data Set Size: 9th Order Polynomial
Data Set Size: 9th Order Polynomial
Quadratic Regularization Penalize large coefficient values
Regularized Least Squares (1) Consider the error function: With the sum-of-squares error function and a quadratic regularizer, we get which is minimized by Data term + Regularization term ¸ is called the regularization coefficient.
Regularized Least Squares (2) With a more general regularizer, we have Lasso Quadratic
Regularized Least Squares (3) Lasso tends to generate sparser solutions than a quadratic regularizer.
Bayesian Linear Regression (1) Define a conjugate shrinkage prior over weight vector w: p(w|α) = N(w|0,α-1I) Combining this with the likelihood function and using results for marginal and conditional Gaussian distributions, gives a posterior distribution. Log of the posterior = sum of squared errors + quadratic regularization.
Bayesian Linear Regression (3) 0 data points observed Data Space Prior
Bayesian Linear Regression (4) 1 data point observed Data Space Likelihood Posterior
Bayesian Linear Regression (5) 2 data points observed Data Space Likelihood Posterior
Bayesian Linear Regression (6) 20 data points observed Data Space Likelihood Posterior
Predictive Distribution (1) Predict t for new values of x by integrating over w. Can be solved analytically.
Predictive Distribution (2) Example: Sinusoidal data, 9 Gaussian basis functions, 1 data point
Predictive Distribution (3) Example: Sinusoidal data, 9 Gaussian basis functions, 2 data points
Predictive Distribution (4) Example: Sinusoidal data, 9 Gaussian basis functions, 4 data points
Predictive Distribution (5) Example: Sinusoidal data, 9 Gaussian basis functions, 25 data points
Limitations of Fixed Basis Functions M basis function along each dimension of a D-dimensional input space requires MD basis functions: the curse of dimensionality. In later chapters, we shall see how we can get away with fewer basis functions, by choosing these using the training data.