CS b553: Algorithms for Optimization and Learning

CS b553: Algorithms for Optimization and Learning Parameter Learning(From data to distributions)

Agenda • Learning probability distributions from example data • Generative vs. discriminative models • Maximum likelihood estimation (MLE) • Bayesian estimation

Motivation • Past lectures have studied how to infer characteristics of a distribution, given a fully-specified Bayes net • Next few lectures: where does the Bayes net come from? • Setting for this lecture: • Given a set of examples drawn from a distribution • Each example is complete (fully observable) • BN structure is known, but the CPTs are unknown

Density Estimation • Given dataset D={d[1],…,d[M]} drawn from underlying distribution P* • Find a distribution that matches P* as close as possible • High-level issues: • Usually, not enough data to get an accurate picture of P*, which forces us to approximate. • Even if we did have P*, how do we measure closeness? • How do we maximize closeness? • Two approaches: learning problems => • Optimization problems,or • Bayesian inference problems

Kullback-Liebler Divergence • Definition: given two probability distributions P and Q over X, the KL divergence (or relative entropy) from P to Q is given by: • Properties: • iff P=Q “almost everywhere” • Not a true “metric” – non-symmetric

Applying KL Divergence to Learning • Approach: given underlying distribution P*, find P (within a class of distributions) so KL divergence is minimized • If we approximate P* with draws from D, we get • Minimizing KL-divergence to the empirical distribution is the same as maximizing the empirical log-likelihood

Another approach: Discriminative Learning • Do we really want to model P*? We may be more concerned with predicting the values of some subset of variables • E.g., for a Bayes net CPT, we want P(Y|PaY) but may not care about the distribution of PaY • Generative model: estimate P(X,Y) • Discriminative model: estimate P(Y|X), ignore P(X)

Training Discriminative Models • Define a loss function l(y,x,P) that is given the ground truth y,x • Measures the difference between the prediction P(Y|x) and the ground truth • Examples: • Classification error I[y argmaxyP(y|x)] • Conditional log likelihood - log P(y|x) • Strategy: minimize empirical loss

Discriminative Vs Generative • Discriminative models: • Don’t model the input distribution, so may have more expressive power for the same level of complexity • May learn more accurate predictive models for same sized training dataset • Directly transcribe top-down evaluation of CPTs • Generative models: • More flexible, because they don’t require a priori selection of the dependent variable Y • Bottom-up inference is easier • Both useful in different situations

What class of Probability Models? • For small discrete distributions, just use a tabular representation • Very efficient learning techniques • For large discrete distributions or continuous ones, the choice of probability model is crucial • Increasing complexity => • Can represent complex distributions more accurately • Need more data to learn well (risk of overfitting) • More expensive to learn and to perform inference

Learning Coin Flips • Let the unknown fraction of cherries be q (hypothesis) • Probability of drawing a cherry is q • Suppose draws are independent and identically distributed (i.i.d) • Observe that c out of N draws are cherries (data)

Learning Coin Flips • Let the unknown fraction of cherries be q (hypothesis) • Intuition: c/N might be a good hypothesis • (or it might not, depending on the draw!)

Maximum Likelihood • Likelihood of data d={d1,…,dN} given q • P(d|q) = Pj P(dj|q) = qc (1-q)N-c i.i.d assumption Gather c cherry terms together, then N-c lime terms

Maximum Likelihood • Likelihood of data d={d1,…,dN} given q • P(d|q) = qc (1-q)N-c

Maximum Likelihood • Peaks of likelihood function seem to hover around the fraction of cherries… • Sharpness indicates some notion of certainty…

Maximum Likelihood • P(d|q) be the likelihood function • The quantity argmaxq P(d|q) is known as the maximum likelihood estimate (MLE)

Maximum Likelihood • l(q) = log P(d|q) = log [ qc(1-q)N-c]

Maximum Likelihood • l(q) = log P(d|q) = log [ qc(1-q)N-c]= log [ qc] + log [(1-q)N-c]

Maximum Likelihood • l(q) = log P(d|q) = log [ qc(1-q)N-c]= log [ qc] + log [(1-q)N-c]= c log q + (N-c) log (1-q)

Maximum Likelihood • l(q) = log P(d|q) = c log q + (N-c) log (1-q) • Setting dl/dq(q)= 0 gives the maximum likelihood estimate

Maximum Likelihood • dl/dq(q) = c/q– (N-c)/(1-q) • At MLE, c/q – (N-c)/(1-q) = 0=> q = c/N c and N are known as sufficient statistics for the parameter q– no other values give additional information about q

Other MLE results • Categorical distributions (Non-binary discrete variables): take fraction of counts for each value (histogram) • Continuous Gaussian distributions • Mean = average data • Standard deviation = standard deviation of data

Maximum Likelihood for BN • For any BN, the ML parameters of any CPT can be derived by the fraction of observed values in the data, conditioned on matched parent values N=1000 B: 200 E: 500 P(E) = 0.5 P(B) = 0.2 Earthquake Burglar A|E,B: 19/20A|B: 188/200A|E: 170/500A| : 1/380 Alarm

Proof • Let BN have structure G over variables X1,…,Xn and parameters q • Given dataset D • L(q; D) = Pm PG(d[m]; q)

Proof • Let BN have structure G over variables X1,…,Xn and parameters q • Given dataset D • L(q; D) = Pm PG(d[m]; q) = PmPi PG(xi[m] | paXi[m];q)

Proof • Let BN have structure G over variables X1,…,Xn and parameters q • Given dataset D • L(q; D) = Pm PG(d[m]; q) = PmPi PG(xi[m] | paXi[m];q) = Pi [Pm PG(xi[m] | paXi[m]; q)] • Pm PG(xi[m] | paXi[m]; q) is the likelihood of the local CPT of Xi: L(qXi; D) • Each CPT depends on a disjoint set of parameters qXi • => maximizing L(q; D) over all parameters qis equivalent to maximizing L(qXi; D)over each individual qXi

An Alternative approach: Bayesian Estimation • P(q|d) = 1/Z P(d|q) P(q) is the posterior • Distribution of hypotheses given the data • P(d|q) is the likelihood • P(q) is the hypothesis prior q d[1] d[2] d[M]

Assumption: Uniform prior, Bernoulli Distribution • Assume P(q) is uniform • P(q|d) = 1/Z P(d|q) = 1/Z qc(1-q)N-c • What’s P(Y|D)? qi Y d[1] d[2] d[M]

Assumption: Uniform prior, Bernoulli Distribution • =>Z = c! (N-c)! / (N+1)! • =>P(Y) = 1/Z (c+1)! (N-c)! / (N+2)! = (c+1) / (N+2) Can think of this as a “correction” using “virtual counts” qi Y d[1] d[2] d[M]

Nonuniform priors • P(q|d)  P(d|q)P(q) = qc (1-q)N-c P(q) Define, for all q, the probability that I believe in q P(q) q 0 1

Beta Distribution • Betaa,b(q) = gqa-1 (1-q)b-1 • a, bhyperparameters > 0 • g is a normalizationconstant • a=b=1 is uniform distribution

Posterior with Beta Prior • Posterior qc (1-q)N-c P(q)= gqc+a-1 (1-q)N-c+b-1= Betaa+c,b+N-c(q) • Prediction = meanE[q]=(c+a)/(N+a+b)

Posterior with Beta Prior • What does this mean? • Prior specifies a “virtual count” of a=a-1 heads, b=b-1 tails • See heads, increment a • See tails, increment b • Effect of prior diminishes with more data

Choosing a Prior • Part of the design process; must be chosen according to your intuition • Uninformed belief a=b=1, strong belief => a,b high

Extensions of Beta Priors • Parameters of categorical distributions: Dirichlet prior • Mathematical expression more complex, but in practice still takes the form of “virtual counts” • Mean, standard deviation for Gaussian distributions: Gamma prior • Conjugate priors preserve the representation of prior and posterior distributions, but do not necessary exist for general distributions

Dirichlet Prior • Categorical variable |Val(X)|=k with P(X=i) = qi • Parameter space q1,…,qk with qi  0, S qi = 1 • Maximum likelihood estimate given counts c1,…,ck in the data D: • qiML = ci/N • Dirichlet prior is Dirichlet(a1,…,ak) = • Mean is (a1/aT,…,ak/aT) with aT=Siai • Posterior P(q|D) is Dirichlet(a1+c1,…,ak+ck)

Recap • Learning => optimization problem (ML) • Learning => inference problem (Bayesian estimation) • Learning parameters of Bayesian networks • Conjugate priors

CS b553: Algorithms for Optimization and Learning

CS b553: Algorithms for Optimization and Learning

Presentation Transcript

Deterministic Optimization Models

Genetic Algorithms

Process Optimization

Genetic Algorithms

Chapter 1: Foundations: Sets, Logic, and Algorithms

The Use of Semidefinite Programming in Approximation Algorithms

Learning to Rank (part 1)

Dynamic Optimization 03.378 Dynamic Optimization BP/LP (nur für das Wahlpflichtfach Umweltökonomie) 2st Di 10-12, Geoma

QUERY OPTIMIZATION AND QUERY PROCESSING

BİM 202 ALGORITHMS

Introduction to Algorithms and Data Structures

CSCE 411H Design and Analysis of Algorithms

Fast Inference and Learning in Large-State-Space HMMs

(Nonlinear) Multiobjective Optimization

6 . Distributed Query Optimization

Randomization in Graph Optimization Problems

Randomized Algorithms and Motif Finding

Genetic Algorithms

Algorithms

CPSC 411 Design and Analysis of Algorithms

Firefly Algorithm By Rasool Tavakoli

Chapter 3: The Fundamentals: Algorithms, the Integers, and Matrices