160 likes | 407 Views
Homework – BN for 3 Coin Problem. Toss coin 0. If Head – toss coin 1 four times if Tails – toss coin 2 four times Generating the sequence H HHHT, T HTHT, H HHHT, H HTTH, … produced by Coin 0 , Coin1 and Coin2 Erase the result of flipping coin 0 (the red flips)
E N D
Homework – BN for 3 Coin Problem Toss coin 0. If Head – toss coin 1 four times if Tails – toss coin 2 four times Generating the sequenceHHHHT, THTHT, HHHHT, HHTTH, … produced by Coin 0 , Coin1 and Coin2 Erase the result of flipping coin 0 (the red flips) Observing the sequence HHHT, HTHT, HHHT, HTTH, … produced by Coin 1 and/or Coin 2 Estimate most likely values for p, q and There is no known analytical solution to computing parameters that maximize the likelihood of the data CS446-Fall ’06
A B Best Bayesian Network Structure • Three Bernoulli distributions (binomial less “n”) • One parameter each • Parameters: p, q and • Whether we draw from p or q depends on the (unobserved) outcome • Best formalized as two random variables (not three, not one, not five…) • A: First coin used(unobserved first flip) • B: Second coin used(next four flips) Need one distribution here one parameter: Pr(A) or Need two distributions here two parameters:Pr(B|A) or pPr(B|~A) or q CS446-Fall ’06
In general (in classes, in life, …) • Technical mastery is important (in fact, essential) • Easily taught • Easily learned (well…can be tedious) • Learning How / When to apply techniques is harder • Less tedious & more interesting • Coin problem • Given three distributions • NOT (necessarily) given three random variables • Decide how to model the world – what are the random variables • More natural / unnatural than right / wrong • Better model = better fit of world phenomenon to our techniques • ALWAYS look for the deeper / more interesting understanding • Master the techniques. But do not be content with that! CS446-Fall ’06
Markov Networks (Markov Random Fields) • Graphical Models • Graphical structure • Parameters • Graphical semantics dictates/limits evidence propagation • Bayesian Networks • Conditional independence via directed arcs • Parameters: Conditional Probability Tables (CPTs) • Markov Networks • Undirected arcs • Potential functions over maximal cliques • Somewhat simplifies evidence propagation(at least makes it more intuitive) CS446-Fall ’06
X E Y Bayesian Networks Recall evidence propagation by d-separation X and Y are conditionally independent if: given E all paths from X to Y are d-separated CS446-Fall ’06
X E Y J A M E B Bayesian Networks: d-separation B – a burglary is in progress E – an earthquake is in progress A – the alarm is sounding J – John calls M – Mary calls CS446-Fall ’06
E E E E X E E Markov Network: Markov Blanket X is Conditionally Independent of all Other Nodes given its Markov Blanket Also true of BNs but the “Markov Blanket” definition is more complex for BNs CS446-Fall ’06
A B E Markov Network: d-separation • Sets A and B are conditionally independent given evidence E • d-separation is easier / more intuitive • “Remove” evidence nodes • No remaining path meansconditional independence CS446-Fall ’06
A B C D Markov Network Parameters • Maximal cliques • Clique: subset of nodes w/ every pair connected • Maximal clique: no node can be added & remain a clique A B C are maximal cliques D is a non-maximal clique CS446-Fall ’06
Markov Network Parameters • Potential functions • One for each maximal clique kk = 1, #max cliques • Assigns a real number to each configuration or assignment of vars. • Denotes a measure of relative preference over assignments • Not normalized, these are not probabilities • Distribution is factorable into potential functions • Probability distribution can be recovered: • Potential functions • Have the requisite algebraic structure • But need to be normalized: • Inference may not require normalization (think conditioning) CS446-Fall ’06
A B C Markov Network Inference • Given values for some evidence random variables • Compute probabilities for query random variables • In general maximal cliques will overlap • Among all maximal cliques • Balance competing preferencegiven the evidence CS446-Fall ’06
A little Information Theory • Manipulating / Inferring with probabilitiy distributions • If the world / phenomenon is a probability distribution • How accurately should we capture it with our representations? • How can we measure accuracy? • How much information is needed to achieve desired accuracy? • Entropy, Joint Entropy • Mutual Information • KL (Kulback Liebler) divergence (distance.?) CS446-Fall ’06
Entropy The entropy, H(M), of a discrete random variable M is a measure of the amount of uncertainty one has about the value of M Increasing our information about M decreases our uncertainty which is why entropy can be used to guide growing decision trees Joint entropy of two discrete (not necessarily independent) random variables X and Y is just the entropy of their pairing, (X,Y) If variables X and Y are independent, the joint entropy is just the sum of the individual entropies CS446-Fall ’06
Conditional Entropy • Conditional Entropy of X givenY is the average remaining entropy averaged over knowing Y • Different than knowing Y is a particular value – averaged over all Y: • It follows that • This should not surprise us • Conditional probabilities • The log in Entropy CS446-Fall ’06
Mutual Information • Mutual Information measures how much information can be obtained about one random variable by observing another: CS446-Fall ’06
Relative Entropy or Kullback–Leibler Divergence • KL divergence is • P and Q are distributions over the same domain XWe sum over all i X • KL divergence measures the relative inefficiency of modeling a distribution as Q when the true distribution is P • It is not a true distance (the distance there is different than the distance back) • This can be intuitive by remembering it is an inefficiency measure • Consider the same discrepancy d • At i=j where P(j) is small and Q(j) is too large • At i=k where P(k) is large and Q(k) is too small • The penalty contribution is less where P is small even though the discrepancy is the same • Since P is the true distribution we are less likely to exercise the discrepancy • Also the penalty is less since we divide by Q CS446-Fall ’06