450 likes | 551 Views
CS b553: Algorithms for Optimization and Learning. Parameter Learning (From data to distributions) . Agenda. Learning probability distributions from example data Generative vs. discriminative models Maximum likelihood estimation (MLE) Bayesian estimation. Motivation.
E N D
CS b553: Algorithms for Optimization and Learning Parameter Learning(From data to distributions)
Agenda • Learning probability distributions from example data • Generative vs. discriminative models • Maximum likelihood estimation (MLE) • Bayesian estimation
Motivation • Past lectures have studied how to infer characteristics of a distribution, given a fully-specified Bayes net • Next few lectures: where does the Bayes net come from? • Setting for this lecture: • Given a set of examples drawn from a distribution • Each example is complete (fully observable) • BN structure is known, but the CPTs are unknown
Density Estimation • Given dataset D={d[1],…,d[M]} drawn from underlying distribution P* • Find a distribution that matches P* as close as possible • High-level issues: • Usually, not enough data to get an accurate picture of P*, which forces us to approximate. • Even if we did have P*, how do we measure closeness? • How do we maximize closeness? • Two approaches: learning problems => • Optimization problems,or • Bayesian inference problems
Kullback-Liebler Divergence • Definition: given two probability distributions P and Q over X, the KL divergence (or relative entropy) from P to Q is given by: • Properties: • iff P=Q “almost everywhere” • Not a true “metric” – non-symmetric
Applying KL Divergence to Learning • Approach: given underlying distribution P*, find P (within a class of distributions) so KL divergence is minimized • If we approximate P* with draws from D, we get • Minimizing KL-divergence to the empirical distribution is the same as maximizing the empirical log-likelihood
Another approach: Discriminative Learning • Do we really want to model P*? We may be more concerned with predicting the values of some subset of variables • E.g., for a Bayes net CPT, we want P(Y|PaY) but may not care about the distribution of PaY • Generative model: estimate P(X,Y) • Discriminative model: estimate P(Y|X), ignore P(X)
Training Discriminative Models • Define a loss function l(y,x,P) that is given the ground truth y,x • Measures the difference between the prediction P(Y|x) and the ground truth • Examples: • Classification error I[y argmaxyP(y|x)] • Conditional log likelihood - log P(y|x) • Strategy: minimize empirical loss
Discriminative Vs Generative • Discriminative models: • Don’t model the input distribution, so may have more expressive power for the same level of complexity • May learn more accurate predictive models for same sized training dataset • Directly transcribe top-down evaluation of CPTs • Generative models: • More flexible, because they don’t require a priori selection of the dependent variable Y • Bottom-up inference is easier • Both useful in different situations
What class of Probability Models? • For small discrete distributions, just use a tabular representation • Very efficient learning techniques • For large discrete distributions or continuous ones, the choice of probability model is crucial • Increasing complexity => • Can represent complex distributions more accurately • Need more data to learn well (risk of overfitting) • More expensive to learn and to perform inference
Learning Coin Flips • Let the unknown fraction of cherries be q (hypothesis) • Probability of drawing a cherry is q • Suppose draws are independent and identically distributed (i.i.d) • Observe that c out of N draws are cherries (data)
Learning Coin Flips • Let the unknown fraction of cherries be q (hypothesis) • Intuition: c/N might be a good hypothesis • (or it might not, depending on the draw!)
Maximum Likelihood • Likelihood of data d={d1,…,dN} given q • P(d|q) = Pj P(dj|q) = qc (1-q)N-c i.i.d assumption Gather c cherry terms together, then N-c lime terms
Maximum Likelihood • Likelihood of data d={d1,…,dN} given q • P(d|q) = qc (1-q)N-c
Maximum Likelihood • Likelihood of data d={d1,…,dN} given q • P(d|q) = qc (1-q)N-c
Maximum Likelihood • Likelihood of data d={d1,…,dN} given q • P(d|q) = qc (1-q)N-c
Maximum Likelihood • Likelihood of data d={d1,…,dN} given q • P(d|q) = qc (1-q)N-c
Maximum Likelihood • Likelihood of data d={d1,…,dN} given q • P(d|q) = qc (1-q)N-c
Maximum Likelihood • Likelihood of data d={d1,…,dN} given q • P(d|q) = qc (1-q)N-c
Maximum Likelihood • Likelihood of data d={d1,…,dN} given q • P(d|q) = qc (1-q)N-c
Maximum Likelihood • Peaks of likelihood function seem to hover around the fraction of cherries… • Sharpness indicates some notion of certainty…
Maximum Likelihood • P(d|q) be the likelihood function • The quantity argmaxq P(d|q) is known as the maximum likelihood estimate (MLE)
Maximum Likelihood • l(q) = log P(d|q) = log [ qc(1-q)N-c]
Maximum Likelihood • l(q) = log P(d|q) = log [ qc(1-q)N-c]= log [ qc] + log [(1-q)N-c]
Maximum Likelihood • l(q) = log P(d|q) = log [ qc(1-q)N-c]= log [ qc] + log [(1-q)N-c]= c log q + (N-c) log (1-q)
Maximum Likelihood • l(q) = log P(d|q) = c log q + (N-c) log (1-q) • Setting dl/dq(q)= 0 gives the maximum likelihood estimate
Maximum Likelihood • dl/dq(q) = c/q– (N-c)/(1-q) • At MLE, c/q – (N-c)/(1-q) = 0=> q = c/N c and N are known as sufficient statistics for the parameter q– no other values give additional information about q
Other MLE results • Categorical distributions (Non-binary discrete variables): take fraction of counts for each value (histogram) • Continuous Gaussian distributions • Mean = average data • Standard deviation = standard deviation of data
Maximum Likelihood for BN • For any BN, the ML parameters of any CPT can be derived by the fraction of observed values in the data, conditioned on matched parent values N=1000 B: 200 E: 500 P(E) = 0.5 P(B) = 0.2 Earthquake Burglar A|E,B: 19/20A|B: 188/200A|E: 170/500A| : 1/380 Alarm
Proof • Let BN have structure G over variables X1,…,Xn and parameters q • Given dataset D • L(q; D) = Pm PG(d[m]; q)
Proof • Let BN have structure G over variables X1,…,Xn and parameters q • Given dataset D • L(q; D) = Pm PG(d[m]; q) = PmPi PG(xi[m] | paXi[m];q)
Fitting CPTs • Each ML entry P(xi|paXi) is given by examining counts of (xi,paXi) in D and normalizing across rows of the CPT • Note that for large k=|PaXi|, very few datapoints will share the values of paXi! • O(|D|/2k), but some values may be even rarer • Large domains |Val(Xi)| can also be a problem • Data fragmentation
Proof • Let BN have structure G over variables X1,…,Xn and parameters q • Given dataset D • L(q; D) = Pm PG(d[m]; q) = PmPi PG(xi[m] | paXi[m];q) = Pi [Pm PG(xi[m] | paXi[m]; q)] • Pm PG(xi[m] | paXi[m]; q) is the likelihood of the local CPT of Xi: L(qXi; D) • Each CPT depends on a disjoint set of parameters qXi • => maximizing L(q; D) over all parameters qis equivalent to maximizing L(qXi; D)over each individual qXi
An Alternative approach: Bayesian Estimation • P(q|d) = 1/Z P(d|q) P(q) is the posterior • Distribution of hypotheses given the data • P(d|q) is the likelihood • P(q) is the hypothesis prior q d[1] d[2] d[M]
Assumption: Uniform prior, Bernoulli Distribution • Assume P(q) is uniform • P(q|d) = 1/Z P(d|q) = 1/Z qc(1-q)N-c • What’s P(Y|D)? qi Y d[1] d[2] d[M]
Assumption: Uniform prior, Bernoulli Distribution • Assume P(q) is uniform • P(q|d) = 1/Z P(d|q) = 1/Z qc(1-q)N-c • What’s P(Y|D)? qi Y d[1] d[2] d[M]
Assumption: Uniform prior, Bernoulli Distribution • =>Z = c! (N-c)! / (N+1)! • =>P(Y) = 1/Z (c+1)! (N-c)! / (N+2)! = (c+1) / (N+2) Can think of this as a “correction” using “virtual counts” qi Y d[1] d[2] d[M]
Nonuniform priors • P(q|d) P(d|q)P(q) = qc (1-q)N-c P(q) Define, for all q, the probability that I believe in q P(q) q 0 1
Beta Distribution • Betaa,b(q) = gqa-1 (1-q)b-1 • a, bhyperparameters > 0 • g is a normalizationconstant • a=b=1 is uniform distribution
Posterior with Beta Prior • Posterior qc (1-q)N-c P(q)= gqc+a-1 (1-q)N-c+b-1= Betaa+c,b+N-c(q) • Prediction = meanE[q]=(c+a)/(N+a+b)
Posterior with Beta Prior • What does this mean? • Prior specifies a “virtual count” of a=a-1 heads, b=b-1 tails • See heads, increment a • See tails, increment b • Effect of prior diminishes with more data
Choosing a Prior • Part of the design process; must be chosen according to your intuition • Uninformed belief a=b=1, strong belief => a,b high
Extensions of Beta Priors • Parameters of categorical distributions: Dirichlet prior • Mathematical expression more complex, but in practice still takes the form of “virtual counts” • Mean, standard deviation for Gaussian distributions: Gamma prior • Conjugate priors preserve the representation of prior and posterior distributions, but do not necessary exist for general distributions
Dirichlet Prior • Categorical variable |Val(X)|=k with P(X=i) = qi • Parameter space q1,…,qk with qi 0, S qi = 1 • Maximum likelihood estimate given counts c1,…,ck in the data D: • qiML = ci/N • Dirichlet prior is Dirichlet(a1,…,ak) = • Mean is (a1/aT,…,ak/aT) with aT=Siai • Posterior P(q|D) is Dirichlet(a1+c1,…,ak+ck)
Recap • Learning => optimization problem (ML) • Learning => inference problem (Bayesian estimation) • Learning parameters of Bayesian networks • Conjugate priors