Efficient Learning Method for Bayesian Network Structures

EECS 800 Research SeminarMining Biological Data Instructor: Luke Huan Fall, 2006

Overview • Bayesian network and other probabilistic graph models

Bayesian networks (informal) • A simple, graphical notation for conditional independence assertions and hence for compact specification of full joint distributions • Syntax: • a set of nodes, one per variable • a directed, acyclic graph (link ≈ "directly influences") • a conditional distribution for each node given its parents: P (Xi | Parents (Xi)) • In the simplest case, conditional distribution represented as a conditional probability table (CPT) giving the distribution over Xi for each combination of parent values

Example • Topology of network encodes conditional independence assertions: • Weather is independent of the other variables • Toothache and Catch are conditionally independent given Cavity

Example • I'm at work, neighbor John calls to say my alarm is ringing, but neighbor Mary doesn't call. Sometimes it's set off by minor earthquakes. Is there a burglar? • Variables: Burglary, Earthquake, Alarm, JohnCalls, MaryCalls • Network topology reflects "causal" knowledge: • A burglar can set the alarm off • An earthquake can set the alarm off • The alarm can cause Mary to call • The alarm can cause John to call

Example contd.

Semantics • The full joint distribution is defined as the product of the local conditional distributions: P (X1, … ,Xn) = πi = 1P (Xi | Parents(Xi)) • e.g., P(j  m  a b e) = P (j | a) P (m | a) P (a | b, e) P (b) P (e) n

Inference • Given the data that “neighbor John calls to say my alarm is ringing, but neighbor Mary doesn't call”, how do we make a decision about the following four possible explanations: • Nothing at all • Burglary but not Earthquake • Earthquake but not Burglary • Burglary and Earthquake

Learning • Suppose that we only have a joint distribution, how do you “learn” the topology of a BN?

Application: Clustering Users • Input: TV shows that each user watches • Output: TV show “clusters” • Assumption: shows watched by same users are similar • Class 1 • Power rangers • Animaniacs • X-men • Tazmania • Spider man • Class 2 • Young and restless • Bold and the beautiful • As the world turns • Price is right • CBS eve news • Class 3 • Tonight show • Conan O’Brien • NBC nightly news • Later with Kinnear • Seinfeld • Class 5 • Seinfeld • Friends • Mad about you • ER • Frasier • Class 4 • 60 minutes • NBC nightly news • CBS eve news • Murder she wrote • Matlock

P(Level | Module, Regulators) Module HAP4  Expression level of Regulator1 in experiment CMK1  1 What module does gene “g” belong to? 0 Regulator1 0 0 BMH1  Regulator2 GIC2  2 Regulator3 0 0 0 Expression level in each module is a function of expression of regulators Level App.: Finding Regulatory Networks Experiment Module Gene Expression

App.: Finding Regulatory Networks Ypl230w Not3 Bmh1 Yap6 Gac1 Gis1 Tpk2 Pph3 Sip2 Gcn20 Yer184c Kin82 Xbp1 Gat1 Ime4 Ppt1 Tpk1 Msn4 Hap4 Lsg1 Cmk1 31 36 47 39 5 16 30 42 26 4 18 13 17 15 14 41 33 2 3 25 1 9 10 11 8 N36 N26 N18 N41 HSF N30 N14 N11 N13 MIG1 CAT8 GATA HAC1 STRE XBP1 ADR1 GCR1 GCN4 MCM1 ABF_C CBF1_B HAP234 REPCAR DNA and RNAprocessing Energy andcAMP signaling Amino acidmetabolism nuclear Module (number) Inferred regulation 48 Regulation supported in literature Regulator (Signaling molecule) Enriched cis-Regulatory Motif Regulator (transcription factor) Experimentally tested regulator

Constructing Bayesian networks • Base: • We know the joint distribution of X = X1, … ,Xn • We know the “topology” of X • Xi X, we know the parents of Xi • Goal: we want to create a Bayesina network that capture the joint distribution according to the topology • Theorem: such BN exists n n

Prove by Construction • A leaf in X is a Xi X such that Xi has no child. • For each Xi • add Xi to the network • select parents from X1, … ,Xi-1 such that P (Xi | Parents(Xi)) = P (Xi | X1, ... Xi-1) • X = X – {Xi} This choice of parents guarantees: P (X1, … ,Xn) = πi =1P (Xi | X1, … , Xi-1) (chain rule) = πi =1P (Xi | Parents(Xi)) (by construction)

Compactness • A CPT for Boolean Xi with k Boolean parents has 2k rows for the combinations of parent values • Each row requires one number p for Xi = true(the number for Xi = false is just 1-p) • If each variable has no more than k parents, the complete network requires O(n · 2k) numbers • I.e., grows linearly with n, vs. O(2n) for the full joint distribution • For burglary net, 1 + 1 + 4 + 2 + 2 = 10 numbers (vs. 25-1 = 31)

Reasoning: Probability Theory • Well understood framework for modeling uncertainty • Partial knowledge of the state of the world • Noisy observations • Phenomenon not covered by our model • Inherent stochasticity • Clear semantics • Can be learned from data

Probability Theory • A (Discrete) probability P over (, S = 2) is a mapping from elements in S such that: •  is a set of all possible outcomes (sample space) in a probabilistic experiment, S is a set of “events” • P() 0 for all S • P() = 1 • If ,S and =, then P()=P()+P() • Conditional Probability: • Chain Rule: • Bayes Rule: • Conditional Independence:

Random Variables & Notation • Random variable: Function from to a non-negative real value such that summation of all the values is 1. • Val(X) – set of possible values of RV X • Upper case letters denote RVs (e.g., X, Y, Z) • Upper case bold letters denote set of RVs (e.g., X, Y) • Lower case letters denote RV values (e.g., x, y, z) • Lower case bold letters denote RV set values (e.g., x) • Eg. P(X = x), P(X) = {P(X=x) | x }

Joint Probability Distribution • Given a group of random variables X= X1, … ,Xn, Xi takes value from a set xi, the joint probability distribution is a function that maps elements in  =Πxi to a non-negative valuesuch that the summation of all the values is 1. • For example, RV weather takes four values “sunny, rainy, cloudy, snow”, RV Cavity takes 2 values “true, false” P(Weather,Cavity) = a 4 × 2 matrix of values: Weather = sunny rainy cloudy snow Cavity = true 0.144 0.02 0.016 0.02 Cavity = false 0.576 0.08 0.064 0.08

Weather = sunny rainy cloudy snow Cavity = true 0.144 0.02 0.016 0.02 Cavity = false 0.576 0.08 0.064 0.08 P(weather=sunny) = 0.144 + 0.576 = 0.72 P(Cavity=true) = 0.144+0.02+0.016 + 0.02 = 0.2 Marginal Probability • Given a set of RV X and its joint probabilities, a marginal probability distribution over X’  X is:

Weather = sunny rainy cloudy snow Cavity = true 0.144 0.02 0.016 0.02 Cavity = false 0.576 0.08 0.064 0.08 P(weather=sunny) = 0.144 + 0.576 = 0.72 P(Cavity=true) = 0.144+0.02+0.016 + 0.02 = 0.2 Independence • Two RV X, Y are independent, denoted as X Y if • Conditional independence:Xis independent of Y given Z if:

If X1,..,Xn binary, need 2n parameters to describe P Representing Joint Distributions • Random variables: X1,…,Xn • P is a joint distribution over X1,…,Xn Can we represent P more compactly? • Key: Exploit independence properties

Independent Random Variables • If X and Y are independent then: • P(X, Y) = P(X|Y)P(Y) = P(X)P(Y) • If X1,…,Xn are independent then: • P(X1,…,Xn) = P(X1)…P(Xn) • O(n) parameters • All 2n probabilities are implicitly defined • Cannot represent many types of distributions • We may need to consider conditional independence

Conditional Parameterization • S = Score on test, Val(S) = {s0,s1} • I = Intelligence, Val(I) = {i0,i1} • G = Grade, Val(G) = {g0,g1,g2} • Assume that G and S are independent given I • Joint parameterization • 223=12-1=11 independent parameters • Conditional parameterization has • P(I,S,G) = P(I)P(S|I)P(G|I,S) = P(I)P(S|I)P(G|I) • P(I) – 1 independent parameter • P(S|I) – 21 independent parameters • P(G|I) - 22 independent parameters • 7 independent parameters

Naïve Bayes Model • Class variable C, Val(C) = {c1,…,ck} • Evidence variables X1,…,Xn • Naïve Bayes assumption: evidence variables are conditionally independent given C • Applications in medical diagnosis, text classification • Used as a classifier: • Problem: Double counting correlated evidence

Bayesian Network A Formal Study • A Bayesian network on a group of random variables X = X1, … ,Xnis a tupple (T, P) such that • The topology T  X  X is a directed acyclic graph • A joint distribution P such that • for all i [1,n], for all possible value of xi and xs P(Xi= xi| Xs = xs) = P(Xi = xi| parents(Xi) = xs) • S = non-descendents of Xi in X • Or, Xi is conditional independent of any of its non-descendent variables, given itsparents(Xi)

Factorization Theorem • If G is an Independence-Map (I-map) of P, then Proof: • X1,…,Xn is an ordering consistent with G • By chain rule: • From assumption: • Since G is an I-Map  (Xi; NonDesc(Xi)| Pa(Xi))I(P)

Factorization Implies I-Map  G is an I-Map of P Proof: • Need to show that P(Xi | ND(Xi)) = P(Xi | Pa(Xi)) • D is the descendents of node I, ND all nodes except i and D

Probabilistic Graphical Models • Tool for representing complex systems and performing sophisticated reasoning tasks • Fundamental notion: Modularity • Complex systems are built by combining simpler parts • Why have a model? • Compact and modular representation of complex systems • Ability to execute complex reasoning patterns • Make predictions • Generalize from particular problem

Probabilistic Graphical Models • Increasingly important in Machine Learning • Many classical probabilistic problems in statistics, information theory, pattern recognition, and statistical mechanics are special cases of the formalism • Graphical models provides a common framework • Advantage: specialized techniques developed in one field can be transferred between research communities

Representation: Graphs • Intuitive data structure for modeling highly-interacting sets of variables • Explicit model for modularity • Data structure that allows for design of efficient general-purpose algorithms

Reference • “Bayesian Networks and Beyond”,Daphne Koller (Stanford) & Nir Friedman (Hebrew U.)

Efficient Learning Method for Bayesian Network Structures

Efficient Learning Method for Bayesian Network Structures

Presentation Transcript

EECS 800 Research Seminar Mining Biological Data

EECS 800 Research Seminar Mining Biological Data

EECS 800 Research Seminar Mining Biological Data

EECS 800 Research Seminar Mining Biological Data

EECS 800 Research Seminar Mining Biological Data

EECS 800 Research Seminar Mining Biological Data

EECS 800 Research Seminar Mining Biological Data

EECS 800 Research Seminar Mining Biological Data

EECS 800 Research Seminar Mining Biological Data

EECS 800 Research Seminar Mining Biological Data

EECS 800 Research Seminar Mining Biological Data

EECS 800 Research Seminar Mining Biological Data

EECS 800 Research Seminar Mining Biological Data

EECS 800 Research Seminar Mining Biological Data

EECS 800 Research Seminar Mining Biological Data

EECS 800 Research Seminar Mining Biological Data

EECS 800 Research Seminar Mining Biological Data