320 likes | 336 Views
Parameter Estimation: Bayesian Estimation Chapter 3 (Duda et al.) – Sections 3.3-3.7. CS479/679 Pattern Recognition Dr. George Bebis. Parameter Estimation: Main Methods. Maximum Likelihood (ML) Views the parameters q as quantities whose values are fixed but unknown.
E N D
Parameter Estimation:Bayesian Estimation Chapter 3 (Duda et al.) – Sections 3.3-3.7 CS479/679 Pattern RecognitionDr. George Bebis
Parameter Estimation:Main Methods • Maximum Likelihood (ML) • Views the parameters q as quantities whose values are fixed but unknown. • Estimates by maximizing the likelihood of obtaining the samples observed. • Bayesian Estimation (BE) • Views the parameters q as random variables having some known prior distribution p(q). • Observing new samples D, converts the prior p(q) to a posterior density p(q/ D) (i.e., the samples D revise our estimate over the parameters).
Parameter Estimation:Main Methods (cont’d) • Before we observe the data, the parameters are described by a prior density p(𝜃). • Once we obtain data, we make use of Bayes theorem to find the posterior p(𝜃|D). • Ideally we want the data to sharpen the posterior p(𝜃|D), that is, reduce our uncertainty about the parameters. p(D/θ) p(θ/D) p(θ/D)
Role of Training Examplesin Classification • The Bayes’ rule allows us to compute the posterior probabilities P(ωi /x): • Consider the role of the training examples D by introducing them in the computation of the posterior probabilities:
Role of Training Examples (cont’d) marginalize Using only the samples from class i/j
Role of Training Examples (cont’d) • The training examples are important in determining both the class-conditional densities and the prior probabilities: • For simplicity, replace P(ωi /Di) withP(ωi):
Bayesian Estimation (BE) • Need to estimate p(x/ωi,Di) for every class ωi • If the samples in Dj give no information about qi, we need to solve cindependent problems: “Given D, estimate p(x/D)”
BE Approach • Estimate p(x/D) as follows: • Since , we have: marginalize model assumed (i.e., Gaussian)
BE vs ML/MAP • ML/MAP makes a point estimate • BE estimates a distribution: • Note that the BE solution might not be of the exact parametric form assumed.
Interpretation of BE Solution • The BE solution implies that if we are less certain about the exact value of θ, consider a weighted average of p(x / θ) over the possible values of θ: • Samples D exert their influence on p(x / D) through p(θ / D).
Relation to ML solution • Ifp(D/θ) peaks sharply at (i.e., ML solution) then p(θ /D) will, in general, peak sharply at too (assuming p(θ) is broad and smooth) • Therefore, ML is a special case of BE! p(θ /D) p(θ /D)
BE Main Steps (1) Compute p(θ/D) : (2) Computep(x/D) :
Case 1: Univariate Gaussian,Unknown μ (known ) D={x1,x2,…,xn} (independently drawn) (1)
Case 1: Univariate Gaussian, Unknown μ(cont’d) • It can be shown that p(μ/D) has the following form: X c • p(μ/D) peaks at μn
Case 1: Univariate Gaussian, Unknown μ(cont’d) (i.e., lies between them) as(ML estimate) as 0
Case 1: Univariate Gaussian, Unknown μ(cont’d) Bayesian Learning
Case 1: Univariate Gaussian, Unknown μ(cont’d) independent on μ (2) Note that we assumed p(x/μ)~N(μ,σ2); however, p(x/D)~N(μn, σ2+ σ2n); As the number of samples increases, p(x/D)converges to p(x/μ)
Case 2: Multivariate Gaussian, Unknown μ Assume p(x/μ)~N(μ,Σ) and p(μ)~N(μ0, Σ0) D={x1,x2,…,xn} (independently drawn) Computep(μ/D): (1)
Case 2: Multivariate Gaussian, Unknown μ (cont’d) • It can be shown that p(μ/D) has the following form: where:
Case 2: Multivariate Gaussian, Unknown μ (cont’d) (2) Computep(x/D): Note that we assumed p(x/μ)~N(μ,Σ); however, p(x/D)~N(μn, Σ+Σn); As the number of samples increases, p(x/D)converges to p(x/μ)
Recursive Bayes Learning • Idea: develop an incremental learning algorithm: Dn: (x1, x2, …., xn-1, xn) • Rewrite as follows: • Substitute in Dn-1
Recursive Bayes Learning (cont’d) marginalize substitute cond. prob. n=1,2,…
Example p(θ)
Example (cont’d) (x4=8) In general:
Example (cont’d) p(θ/D4) peaks at p(θ)= p(θ/D0) Iterations ML estimate: Bayesian estimate:
ML vs Bayesian Estimation • Number of training data • The two methods become equivalent assuming infinite number of training data (and prior distributions that do not exclude the true solution). • For small training data sets, they give different results in most cases. • Computational complexity • ML uses differential calculus or gradient search for maximizing the likelihood. • Bayesian estimation requires complex multidimensional integration techniques.
ML vs Bayesian Estimation (cont’d) • Solution interpretation • Easier to interpret ML solutions (i.e., must be of the assumed parametric form). • A Bayesian estimation solution might not be of the parametric form assumed. • Prior distribution • If the prior distribution p(θ)is uniform, Bayesian estimation solutions are equivalent to ML solutions. • In general, the two methods will give different solutions.
Computational Complexity ML estimation dimensionality: d # training data: n # classes: c • Learning complexity O(dn) O(d2n) O(d2) O(n) O(d3) O(1) These computations must be repeated c times (once for each class) (n>d)
Computational Complexity dimensionality: d # training data: n # classes: c • Classification complexity O(1) O(d2) These computations must be repeated c times and take max
Computational Complexity Bayesian Estimation • Learning complexity: higher than ML • Classification complexity: same as ML
Main Sources of Error in Classifier Design • Bayes error • The error due to overlapping densities p(x/ωi) • Model error • The error due to choosing an incorrect model. • Estimation error • The error due to incorrectly estimated parameters