Introduction to Bayesian Networks: Graphical Models for Efficient Computation and Learning

Introduction to Bayesian NetworksInstructor: Dan Geiger על מה המהומה ? נישואים מאושרים בין תורת ההסתברות ותורת הגרפים. הילדים המוצלחים: אלגוריתמים לגילוי תקלות, קודים לגילוי שגיאות, מודלים למערכות מורכבות. שימושים במגוון רחב של תחומים. קורס ממבט אישי למדי על נושא מחקריי לאורך שנים רבות. Web page: www.cs.technion.ac.il/~dang/courseBNhttp://webcourse.cs.technion.ac.il/236372 Email: dang@cs.technion.ac.il Phone: 829 4339 Office: Taub 616. .

What is it all above ? • How to use graphs to represent probability distributions over thousands of random variables ? • How to encode conditional independence in directed and undirected graphs ? • How to use such representations for efficient computations of the probability of events of interest ? • How to learn such models from data ?

Course Information Meetings: • Lecture: Wednesdays 10:30 –12:30 • Tutorial: Wednesdays 16:30 – 17:30 Grade: • 30% in 5 question sets. These questions sets are obligatory. Each contains mostly theoretical problems. Submit in pairs before due time (three weeks). • 40% one or two hours lecture (Priority to graduate students). • 10% Attending at least 12 lectures and recitation classes for a **passing grade**. (In very special circumstances 2% per missing item). • 20% Checking HMWs and presentation of HMW solutions • Prerequisites: • Data structure 1 (cs234218) • Algorithms 1 (cs234247) • Probability (any course) Information and handouts: • http://webcourse.cs.technion.ac.il/236372 • http://www.cs.technion.ac.il/~dang/courseBN/ (Only lecture slides)

Relations to Some Other Courses אמור לי מי חבריך ואומר לך מי אתה. • Introduction to Artificial Intelligence (cs236501) • Introduction to Machine Learning (cs236756) • Introduction to Neural Networks (cs236950) • Algorithms in Computational Biology (cs236522) • Error correcting codes • Data mining

= Student lectures (8) = TENTATIVE Student lectures (7)

Mathematical Foundations (4 weeks including students’ lectures, based on Pearl’s Chapter 3 + papers). • Properties of Conditional Independence (Soundness and completeness of marginal independence, graphoid axioms and their interpretation as “irrelevance”, incompleteness of conditional independence, no disjunctive axioms possible.) • Properties of graph separation (Paz and Pearl 85, Theorem 3), soundness and completeness of saturated independence statements. Undirected Graphs as I-maps of probability distributions. Markov-Blankets, Pairwise independence basis. Representation theorems (Pearl and Paz, from each basis to I-maps). Markov networks, HC representation theorem, Completeness theorem. Markov chains • Bayesian Networks, d-separation, Soundness, Completeness. • Chordal Graphs as the intersection of BN and Markov networks. Equivalence of their 4 definitions. • Combinatorial Optimiziation of Exact Inference in Graphical models (3 weeks including students lectures). • HMMs • Exact inference and their combinatorial optimization. • Clique tree algorithm. Conditioning. • Tree-width. Feedback Vertex Set. • Learning (5 weeks including students lectures). • Introduction to Bayesian statistics • Learning Bayesian networks • Chow and Liu’s algorithm; the TAN model. • Structural EM • Searching for Bayesian networks • Applications (2 weeks including student lectures).

Homeworks • HMW #1. Read Chapter 3.1 & 3.2.1. Answer Questions 3.1, 3.2, Prove Eq 3.5b, and fully expand/fix the proof of Theorem 2. Submit in pairs no later than 28/10/09 (Two weeks from now). • HMW #2. Read Chapter 2 in Pearl’s book and answer 6 questions of choice at the end. Submit in pairs no later than 11/11/09. • HMW #3. Read fully Chapter 3. Answer additional 5 questions of choice at the end. Submit in pairs no later than 2/12/09. • HMW #4. Submit in pairs 23/12/09 • HMW #5. Submit in pairs 6/1/10 • Pearl’s book contains all the notations that I happen not to define in these slides – consult it often – it is also a very unique and interesting classic text book.

The Traditional View of Probability in Text Books • Probability theory provides the impression that we need to literally represent a joint distribution explicitly as P(x1,…,xn) on all propositions and their combinations. It is consistent and exhaustive. • This representation stands in sharp contrast to human reasoning: It requires exponential computations to compute marginal probabilities like P(x1) or conditionals like P(x1|x2). • Humans judge pairwise conditionals swiftly while conjunctions are judged hesitantly. Numerical calculations do not reflect simple reasoning tasks.

P(e | h) Given ? Estimated or Computed ? e כחום גבוה. כמחלה ויראלית נדירה ועל h חישבו על The Traditional View of Probability in Text Books Given ? Computed ?

x Y Z The Qualitative Notion of Dependence • Marginal independence is defined numerically as P(x,y)=P(x) P(y). The truth of this equation is hard to judge by humans, while judging whether X and Y are dependent is often easy. • “Burglary within a day” and “nuclear war within five years” • Likewise, people tend to judge

The notions of relevance and dependence are far more basic than the numerical values. In a resonating system it should be asserted once and not be sensitive to numerical changes. • Acquisition of new facts may destroy existing dependencies as well as creating new once. • Learning child’s age Z destroys the dependency between height X and reading ability Y. • Learning symptoms Z of a patient creates dependencies between the diseases X and Y that could account for it. Probability theory provides in principle such a device via P(X | Y, K) = P(X |K) But can we model the dynamics of dependency changes based on logic, without reference to numerical quantities ?

Definition of Marginal Independence Definition: IP(X,Y) iff for all xDX and yDy Pr(X=x, Y=y) = Pr(X=x) Pr(Y=y) Comments: • Each Variable X has a domain DX with value (or state) x in DX. • We often abbreviate via P(x, y) = P(x) P(y). • When Y is the emptyset, we get Pr(X=x) = Pr(X=x). • Alternative notations to IP(X,Y) such as: I(X,Y) or XY • Next few slides on properties of marginal independence are based on “Axioms and algorithms for inferences involving probabilistic independence.”

Properties of Marginal Independence Trivial Independence: Ip(X,) Symmetry: Ip(X,Y)  Ip(Y,X) Decomposition: Ip(X,YW)  Ip(X,Y) Mixing: Ip(X,Y) and Ip(XY,W)  Ip(X,YW) Proof (Soundness). Trivial independence and Symmetry follow from the definition. Decomposition: Given P(x,y,w) = P(x) P(y,w), simply sum over w on both sides of the given equation. Mixing: Given: P(x,y) = P(x) P(y) and P(x,y,w) = P(x,y) P(w). Hence, P(x,y,w) = P(x) P(y) P(w) = P(x) P(y,w).

Properties of Marginal Independence Are there more such independent properties of independence ? No. There are none ! Horn axioms are of the form 1 & … & n   where each statement  stands for an independence statement. We use the symbol  for a set of independence statements. Namely:  is derivable from  via these properties if and only if  is entailed by  (i.e.,  holds in all probability distributions that satisfy ). Put differently: For every set  and a statement  not derivable from , there exists a probability distribution Pthat satisfies S and not s.

Properties of Marginal Independence Can we use these properties to infer a new independence statements  from a set of given independence statements  in polynomial time ? YES. The “membership algorithm” and completeness proof in Recitation class (Paper P2). Comment. The question “does entail ” could in principle be undecidable, drops to being decidable via a complete set of axioms, and then drops to polynomial with this claim.

Properties of Marginal Independence • Can we check consistency of a set + independence plus a set - of negated independence statements ? The membership algorithm in previous slide applies only for G that includes one negated statement – simply use the algorithm to check that it is not entailed from + . But another property of independence called “Amstrong Relation” guarantees that consistency is indeed verified by checking separately (in isolation) that each statement in - is not entailed from + .

Properties of Marginal Independence We define a “product”  of two distributions P1 and P2 over a set of n variables X1,…,Xn as follows: P(a11 ,…, ann)  P1(a1 ,…, an)  P2(1 ,…, n) Note that the domains of Xi change in this definition ! This definition implies that for every independence statement :  hold in P1 and  holds in P2 if and only if  holds in P*. Hence, consistency can be checked statement by statement.

Definitions of Conditional Independence Ip(X,Z,Y) if and only if whenever P(Y=y,Z=z) >0 (3.1) Pr(X=x | Y=y , Z=z) = Pr(X=x |Z=z)

Properties of Conditional Independence Same Properties of conditional independence: Symmetry: I(X,Z,Y)  I(Y,Z,X) Decomposition: I(X,Z,YW)  I(X,Z,Y) Mixing: I(X,Z,Y) and I(XY,Z,W)  I(X,Z,YW) BAD NEWS. Are there more properties of independence ? Yes, infinitely many independent Horn axioms. No answer to the membership problem, nor to the consistency problem.

Important Properties of Conditional Independence Recall there are some more notations.

Dependency Models – abstraction of Probability distributions

Graphical Interpretation

The Intersection Property Revisited Intersection: I(X,ZW,Y) and I(X,ZY,W)  I(X,Z,YW) This property holds for positive distributions. Counter example: X=Y=W, Z is the emptyset. A set of independence statements I(x,y,z) (namely, a dependency model) that satisfies Symmetry, Decomposition, Weak Union, Contraction, and Intersection is called a graphoid, and without intersection is called semi-graphoid.

Introduction to Bayesian Networks: Graphical Models for Efficient Computation and Learning