170 likes | 342 Views
Bayesian Learning VC Dimension . Jahwan Kim 2000. 5. 24 AIPR Lab. CSD., KAIST. Contents. Bayesian learning General idea, & an example Parametric vs. nonparametric statistical inference Model capacity and generalizability Further readings. Bayesian learning.
E N D
Bayesian Learning VC Dimension Jahwan Kim 2000. 5. 24 AIPR Lab. CSD., KAIST
Contents • Bayesian learning • General idea, & an example • Parametric vs. nonparametric statistical inference • Model capacity and generalizability • Further readings Jahwan Kim, Dept. of CS, KAIST
Bayesian learning • Conclude from hypothesis constructed from the given data. • Predictions are made from the hypotheses, weighted by their posterior probabilities. Jahwan Kim, Dept. of CS, KAIST
Bayesian learningFormulation • X is the prediction, H’s the hypotheses, and D the give data. • Requires calculation of P(H|D) for all H’s, and this is intractable in many cases. Jahwan Kim, Dept. of CS, KAIST
Bayesian learningMaximum a posteriori hypothesis • Take H that maximizes the a posteriori probability P(H|D). • How do we find such H? Use Bayes’ rule: Jahwan Kim, Dept. of CS, KAIST
Bayes learningcontinued • P(D) remains fixed for all H. • P(D|H) is the likelihood the given data is observed given H. • P(H), the prior probability, has been the source of debate. • If too biased, we get underfitting. • Sometimes a uniform prior is appropriate. In that case, we choose the maximum likelihood hypothesis. Jahwan Kim, Dept. of CS, KAIST
Bayesian learningParameter estimation • Problem: Find p(x|D) when • We know the form of pdf, i.e., the pdf is parametrized by , written as p(x|). • A priori pdf p() is known. • Data D is given. • We only have to find p(|D), since then we may use Jahwan Kim, Dept. of CS, KAIST
Parameter estimation, continued • By Bayes’ rule, • Assume also each sample in D is drawn independently with identical pdf, i.e., it is i.i.d. Then • This gives the formal solution to the problem Jahwan Kim, Dept. of CS, KAIST
Parameter estimationExample • One-dimensional normal distribution • Two parameters, and . • Assume that p() is normal with known mean m and variance s. • Assume also that is also known. • Then Jahwan Kim, Dept. of CS, KAIST
Example, continued • squared term appears in the exponent of the expression (or compute it) • Namely, p(|D) is also normal. • Its variance and mean are given by where is the sample mean. Jahwan Kim, Dept. of CS, KAIST
Estimation of mean • As n goes to infinity , p(|D) approaches the Dirac delta function centered at the sample mean. Jahwan Kim, Dept. of CS, KAIST
Two main approaches of (statistical) inference • Parametric inference • Investigator should know the problem well. • The model contains finite number of unknown parameters. • Nonparametric inference • No reliable a priori info about the problem. • Number of samples required is too large. Jahwan Kim, Dept. of CS, KAIST
Capacity of models • Well known fact: • If a model is too complicated, it doesn’t generalize well; • if too simple, it doesn’t represent well. • How do we measure model capacity? • In classical statistics, by the number of parameter, or degree of freedom • In the (new) statistical learning theory, by VC dim. Jahwan Kim, Dept. of CS, KAIST
VC dimension • Vapnik-Chervonenkis dimension is a measure of capacity of a model. Jahwan Kim, Dept. of CS, KAIST
VC dimensionExamples • It’s not always equal to the number of parameters: • A line of the form {ax+by+c} in 2D plane has VC dimension 3, but • One parameter family {sgn(sin ax)} (in one dimension) has VC dimension infinity! Jahwan Kim, Dept. of CS, KAIST
Theorem from STL onVC dimension and generalizability Jahwan Kim, Dept. of CS, KAIST
Further readings • Vapnik, Statistical Learning Theory, Ch. 0 & sections 1.1-1.3 • Haykin, Neural Networks, sections 2.13-2.14 • Duda & Hart, Pattern Classification and Scene Analysis, sections 3.3-3.5 Jahwan Kim, Dept. of CS, KAIST