150 likes | 275 Views
Parameterized Models. We attempt to model the underlying distribution D(x, y) or D(y | x) To do that, we assume a model P(x, y | ) or P(y | x , ), where is the set of parameters of the model Example: Probabilistic context-free grammars:
E N D
Parameterized Models • We attempt to model the underlying distribution D(x, y) or D(y | x) • To do that, we assume a model P(x, y | ) or P(y | x , ), where is the set of parameters of the model • Example: Probabilistic context-free grammars: • We assume a model of independent rules. Therefore, P(x, y | ) is the product of rule probabilities. • The parameters are rule probabilities. • Example: log-linear models • We assume a model of the form: P(x, y | ) = exp{(x,y) }/yexp{(x,y) } • where is a vector of parameters, (x,y) is a vector of features. CS446-Fall ’06
Log-Likelihood • We often look at the log-likelihood. Why? • Makes clearer / mathematically explicit that independence = linearity • Given training samples (xi; yi), maximize the log-likelihood • L() = i log P (xi; yi | ) or L() = i log P (yi | xi , )) • Monotonicity of log(.) means the that maximizes is the same CS446-Fall ’06
Parameterized Maximum Likelihood Estimate • Model: coin has a weighting “p” which determines its behavior; tosses are independent. • m tosses of the (p,1-p) coin yields HTTHTHHHTTH…HTT in our model only the numbers are important: k Heads, m-k Tails. • H is universe of p, say [0,1]. What is hML? • If p is the probability of Heads, the probability of the data observed is: P(D|p) = pk (1-p)m-k • The log Likelihood: L(p) = log P(D|p) = k log(p) + (m-k)log(1-p) • To maximize, set the derivative w.r.t. p equal to 0: dL(p)/dp = k/p – (m-k)/(1-p) • Solving this for p, gives: p=k/m (which is the sample average) • Notice we disregard any interesting prior over H (esp. striking at low m) CS446-Fall ’06
Justification • Assumption: Our selection of the model is good; that is, that there is some parameter setting * such that P(x, y | *) approximates D(x, y) • Define the maximum-likelihood estimates: ML = argmaxL() • Property of maximum-likelihood estimates: • As the training sample size goes to , then P(x, y | ML) converges to the best approximation of D(x, y) in the model space given the data CS446-Fall ’06
Other Examples (Bigrams) • Model 2: There are 3 characters, A, B, C and SP • At any point can generate any of them, according to: Start: Ps(A)=0.4; Ps(B)=0.4; Ps(C)=0.2 From A: PA(A)=0.5; PA(B)=0.3; PA(C)=0.1; PA(S)=0.1 From B: PB(A)=0.1; PB(B)=0.4; PB(C)=0.4; PB(S)=0.1 From C: PC(A)=0.3; PC(B)=0.4; PC(C)=0.2 ; PC(S)=0.1 • The generative model (Bigram) we assume is: P(U) = P(x1, x2,…, xk)= P(x1) i=2,k P(xi| x1, x2,…, xi-1) = P(x1) i=1,k P(xi+1| xi) • This is a second-order Markov model • The parameters of the model are the generation probabilities. • Goal: to determine which of two strings U, V is more likely. • The Bayesian way: compute the probability of each string, and decide which is more likely. • Learning here is: learning the parameters (just more of them). CS446-Fall ’06
Markov Models by letter; train on English text • Zeroth order: one parameter – equiprobable lettersXFOML RXKHRJFFJUJ ZLPWCFWKCYJ… • First order: 27 parameters, one for each letter OCRO HLI RGWR NMIELWIS EU LL NBNESEBYA TH … • Second order: ~ 700 parameters – letter pairs (~bigrams)ON IE ANTSOUTINYS ARE T INCTORE ST BE … • Third order: ~ 20,000 parameters – model triplets of EnglishIN NO IST LAT WHEY CRATICT FROURE BERS GROCID… • Fourth order: ~500,000 parametersTHE GENERATED JOB PROVIDUAL BETTER TRAND THE… CS446-Fall ’06
Markov Models by word; train on English text • First order: 104 parameters, one for each word REPRESWENTING AND SPEEDILY IS AN GOOD APT OR COME CAN DIFFERENT NATURAL HERE HE THE AN IN CAME THE … • Second order: 108 parameters– word pairs (bigrams)THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITER THAT THE CHARACTER OF THIS POINT IS THEREFORE ANOTHER METHOD … • Could use additional features: syntactic parts of speech, semantic (animate / inanimate / physical / mental…) • Could model a probabilistic grammar w/ nonterminal and terminals • Could train on English subsets (e.g., speeches of George Bush yields a generative model of George Bush utterances) CS446-Fall ’06
0.8 0.8 0.5 0.5 P(B)=0.5 P(I)=0.5 0.2 0.2 B I P(x|B) P(x|I) 0.5 0.5 0.2 0.5 0.5 0.5 B I I I B 0.5 0.25 0.25 0.25 0.25 0.4 a a c d d Other Examples (HMMS) • Model 3: There are 5 characters, x=a, b, c, d, e, and 2 states s=B, I • We generate characters according to: • Initial state prob: p(B)= 0.5; p(I)=0.5 • State transition prob: p(BB)=0.8 p(BI)=0.2 p(IB)=0.5 p(II)=0.5 • Output prob: p(a|B) = 0.25,p(b|B)=0.10, p(c|B)=0.10 … p(a|I) = 0.25,… • Can follow the generation process to get the observed sequence. CS446-Fall ’06
Bayesian Networks(belief networks,…) • Generative Model • Compact representation of a joint probability distribution when there are redundancies in the joint • Conditional IndependenceP(A|B,C) = P(A|C) “A is conditionally independent of B given C” • Then B is also conditionally independent of A given C • C can still depend on both A and B P(C|A) P(C) and P(C|B) P(C) • Given C, A and B do not influence each other • If we do NOT know C, then A and B MAY influence each other CS446-Fall ’06
Bayesian Networks(belief networks,…) • Graphical model • Probability + Graph theory • Nodes – discrete random variables • Directed Arcs – direct influence • Usually “causality” (class, text,…) • sometimes “dependency” • Conditional Probability Tables (CPTs) • No directed cycles (DAG) • General graphical models • Markov random fields (undirected dependencies) • Dynamic Bayesian nets • others CS446-Fall ’06
A C B Dentist Example 3 Boolean Random Variables: C – Patient has a cavity A – Patient reports a toothache B – Dentist’s probe catches on tooth C directly influences A and B A and B are conditionally independent given C Conditional Probability Table (CPT) at each node specifies a probability distribution for each context (configuration of parents) Note how we represent conditional independence assumptions CS446-Fall ’06
A C B How Many Numbers Distribution(s) at C P(cavity), P(cavity) (how many numbers?) Need one number Distribution(s) at A P(ache | cavity), P(ache | cavity)P(ache | cavity), P(ache | cavity)(how many numbers?) Two: P(ache | cavity) & P(ache | cavity) Distributions at B, like A 1 + 2 + 2 = 5 How many numbers for the joint? 23 -1 = 7 (a savings of 2!) CS446-Fall ’06
Learning BNsfrom data • Easy: estimating maximum likelihood CPT values from data for fully observable systems given the structure (How?) Count! Need significant amount of data for every important CPT entry (which ones are not important?) Hit by a meteor / car while crossing the streetBut it’s not just small CPT values! • Not too hard: estimating ML CPT values from data for partially observable systems given the structure (EM – later) • Hard: learn an adequate structure from data (Why?) CS446-Fall ’06
Inference with BNsCalculate posteriors of interest given some evidence (observations) • Exact – closed form expressions • CPTs • Marginalizing • Bayes Rule • Polytrees • Singly connected BNs • No undirected cycles • Depends on knowing conditional independences CS446-Fall ’06
Naïve Bayes • A particularly important form of Bayes nets • Friday (Li-Lun) CS446-Fall ’06