L11: Uses of Bayesian Networks

L11: Uses of Bayesian Networks Nevin L. Zhang Department of Computer Science & Engineering The Hong Kong University of Science & Technology http://www.cse.ust.hk/~lzhang/

Outline • Traditional Uses • Structure Discovery • Density Estimation • Classification • Clustering • An HKUST Project

Traditional Uses • Probabilistic Expert Systems • Diagnostic • Prediction • Example: BN for diagnosing “blue baby” over phone in a London Hospital Comparable to specialist, Better than others

Traditional Uses • Language for describing probabilistic models in Science & Engineering • Example: BN for turbo code

Traditional Uses • Language for describing probabilistic models in Science & Engineering • Example: BN from Bioinformatics

BN for Structure Discovery • Given: Data set D on variables X1, X2, …, Xn • Discover dependence, independence, and even causal relationship among the variable. • Example: Evolution trees

Phylogenetic Trees • Assumption • All organisms on Earth have a common ancestor • This implies that any set of species is related. • Phylogeny • The relationship between any set of species. • Phylogenetic tree • Usually, the relationship can be represented by a tree which is called a phylogenetic (evolution) tree • this is not always true

Time giant panda moose lesser panda duck goshawk vulture alligator Phylogenetic Trees • Phylogenetic trees Current-day species at bottom

AAGGCCT AAGACTT Time AGCACTT AAGGCAT AGCACAA AGGGCAT TAGACTT AGCGCTT TAGCCCA Phylogenetic Trees • TAXA (sequences) identify species • Edge lengths represent evolution time • Assumption: bifurcating tree topology

x7 t5 t6 x5 x6 t1 t2 t3 t4 s1 s2 s3 s4 Probabilistic Models of Evolution • Characterize relationship between taxa using substitution probability: P(x | y, t): probability that ancestral sequence y evolves into sequence x along an edge of length t P(X7), P(X5|X7, t5), P(X6|X7, t6), P(S1|X5, t1), P(S2|X5, t2), ….

AAGGCCT AAGACTT AGCACTT AGCACAA AGGGCAT TAGACTT AGCGCTT TAGCCCA Probabilistic Models of Evolution • What should P(x|y, t) be? • Two assumptions of commonly used models • There are only substitutions, no insertions/deletions (aligned) • One-to-one correspondence between sites in different sequences • Each site evolves independently and identically • P(x|y, t) = ∏i=1 to m P(x(i) | y(i), t) • m is sequence length AAGGCAT

A C G T rt = 1/4 (1 + 3e-4at) st = 1/4 (1 - e-4at) Limit values when t = 0 or t = infinity? rt st st st A st rt st st C G st st rt st T st st st rt Probabilistic Models of Evolution • What should P(x(i)|y(i), t) be? • Jukes-Cantor (Character Evolution) Model [1969] • Rate of substitution a (Constant or parameter?) • Multiplicativity (lack of memory)

AGCACAA AGGGCAT TAGACTT AGCGCTT TAGCCCA Tree Reconstruction • Given: collection of current-day taxa • Find: tree • Tree topology: T • Edge lengths: t • Maximum likelihood • Find tree to maximize P(data | tree) AGGGCAT, TAGCCCA, TAGACTT, AGCACAA, AGCGCTT

AAGGCCT AAGACTT AGCACTT AGCACAA AGGGCAT TAGACTT AGCGCTT TAGCCCA Tree Reconstruction • When restricted to one particular site, a phylogenetic tree is an LT model where • The structure is a binary tree and variables share the same state space. • The conditional probabilities are from the character evolution model, parameterized by edge lengths instead of usual parameterization. • The model is the same for different sites

AAGGCCT AAGACTT AGCACTT AGCACAA AGGGCAT TAGACTT AGCGCTT TAGCCCA Tree Reconstruction • Current-day Taxa: AGGGCAT, TAGCCCA, TAGACTT, AGCACAA, AGCGCTT • Samples for LT model. One Sample per site. The samples are i.i.d. • 1st site: (A, T, T, A, A), • 2nd site: (G, A, A, G, G), • 3rd site: (G, G, G, C, C),

Tree Reconstruction • Finding ML phylogenetic tree == Finding ML LT model • Model space: • Model structures: binary tree where all variables share the same state space, which is known. • Parameterization: one parameter for each edge. (In general, P(x|y) has |x||y|-1 parameters). • The objective is to find relationships among variables. • Applying new LTM algorithms to Phylogenetic tree reconstruction?

BN for Density Estimation • Given: Data set D on variables X1, X2, …, Xn • Estimate: P(X1, X2, …, Xn) under some constraints • .. • Uses of the estimate: • Inference • Classification

BN Methods for Density Estimation • Chow-Liu tree with X1, X2, …, Xn as nodes • Easy to compute • Easy to use • Might not be good estimation of “true” distribution • BN with X1, X2, …, Xn as nodes • Can be good estimation of “true” distribution. • Might be difficult to find • Might be complex to use

BN Methods for Density Estimation • LC model with X1, X2, …, Xn as manifest variables (Lowd and Domingos 2005) • Determine the cardinality of the latent variable using hold-out validation, • Optimize the parameters using EM. • .. • Easy to compute • Can be good estimation of “true” distribution • Might be complex to use (cardinality of latent variable might be very large)

BN Methods for Density Estimation • LT model for density estimation • Pearl 1988: As model over manifest variables, LTMs • Are computationally very simple to work with. • Can represent complex relationships among manifest variables.

BN Methods for Density Estimation • New approximate inference algorithm for Bayesian networks (Wang, Zhang and Chen, AAAI 08, JAIR 32: 879-900, 08) Sample Learn sparse sparse dense dense

The problem: Given data: Find mapping (A1, A2, …, An) |- C Possible solutions ANN Decision tree (Quinlan) … (SVM: Continuous data) Bayesian Networks for Classification

Naïve Bayes model From data, learn P(C), P(Ai|C) Classification arg max_c P(C=c|A1=a1, …, An=an) Very good in practice Bayesian Networks for Classification

Bayesian Networks for Classification • Drawback of NB: • Attributes mutually independent given class variable • Often violated, leading to double counting. • Fixes: • General BN classifiers • Tree augmented Naïve Bayes (TAN) models • Hierarchical NB • Bayes rule + Density Estimation • …

Bayesian Networks for Classification • General BN classifier • Treat class variable just as another variable • Learn a BN. • Classify the next instance based on values of variables in the Markov blanket of the class variable. • Pretty bad because it does not utilize all available information because of Markov boundary

TAN model Friedman, N., Geiger, D., and Goldszmidt, M. (1997). Bayesian networks classifiers.Machine Learning, 29:131-163. Capture dependence among attributes using a tree structure. During learning, First learn a tree among attributes: use Chow-Liu algorithm Add class variable and estimate parameters Classification arg max_c P(C=c|A1=a1, …, An=an) Bayesian Networks for Classification

Hierarchical Naïve Bayes models N. L. Zhang, T. D. Nielsen, and F. V. Jensen (2002). Latent variable discovery in classification models.Artificial Intelligence in Medicine, to appear. Capture dependence among attributes using latent variables Detect interesting latent structures besides classification Algorithm in the step of DHC.. … Bayesian Networks for Classification

Bayesian Networks for Classification • Bayes Rule • . • Chow-Liu • LC model • LT Model • Wang Yi: Bayes rule + LT model is for superior

BN for Clustering • Latent class (LC) model • One latent variable • A set of manifest variables • Conditional Independence Assumption: • Xi’s mutually independent given Y. • Also known as Local Independence Assumption • Used for cluster analysis of categorical data • Determine cardinality of Y: number of clusters • Determine P(Xi|Y): characteristics of clusters

BN for Clustering Clustering Criteria • Distance based clustering: • Minimizes intra-cluster variation and/or maximizes inter-cluster variation • LC Model-based clustering: • The criterion follows from the conditional independence assumption • Divide data into clusters such that, in each cluster, manifest variables are mutually independent under the empirical distribution.

BN for Clustering • Local independence assumption often not true • LT models generalize LC models • Relax the independence assumption • Each latent variable gives a way to partition data… multidimensional clustering

ICAC Data // 31 variables, 1200 samples C_City: s0 s1 s2 s3 // very common, quit common, uncommon, .. C_Gov: s0 s1 s2 s3 C_Bus: s0 s1 s2 s3 Tolerance_C_Gov: s0 s1 s2 s3 //totally intolerable, intolerable, tolerable,... Tolerance_C_Bus: s0 s1 s2 s3 WillingReport_C: s0 s1 s2 // yes, no, depends LeaveContactInfo: s0 s1 // yes, no I_EncourageReport:s0 s1 s2 s3 s4 // very sufficient, sufficient, average, ... I_Effectiveness: s0 s1 s2 s3 s4 //very e, e, a, in-e, very in-e I_Deterrence: s0 s1 s2 s3 s4 // very sufficient, sufficient, average, ... ….. -1 -1 -1 0 0 -1 -1 -1 -1 -1 -1 0 -1 -1 -1 0 1 1 -1 -1 2 0 2 2 1 3 1 1 4 1 0 1.0 -1 -1 -1 0 0 -1 -1 1 1 -1 -1 0 0 -1 1 -1 1 3 2 2 0 0 0 2 1 2 0 0 2 1 0 1.0 -1 -1 -1 0 0 -1 -1 2 1 2 0 0 0 2 -1 -1 1 1 1 0 2 0 1 2 -1 2 0 1 2 1 0 1.0 ….

Latent Structure Discovery Y2: Demographic info; Y3: Tolerance toward corruption Y4: ICAC performance; Y7: ICAC accountability Y5: Change in level of corruption; Y6: Level of corruption (Zhang, Poon, Wang and Chen 2008)

Interpreting Partition • Y2 partition the population into 4 clusters • What is the partition about? What is “criterion”? • On what manifest variables do the clusters differ the most? • Mutual information: • The larger I(Y2, X), the more the 4 clusters differ on X

Interpreting Partition • Information curves: • Partition of Y2 is based on Income, Age, Education, Sex • Interpretation: Y2 --- Represents a partition of the population based on demographic information • Y3 --- Represents a partition based on Tolerance toward Corruption

Interpreting Clusters Y2=s0: Low income youngsters; Y2=s1: Women with no/low income Y2=s2: people with good education and good income; Y2=s3: people with poor education and average income

Interpreting Clustering Y3=s0: people who find corruption totally intolerable; 57% Y3=s1: people who find corruption intolerable; 27% Y3=s2: people who find corruption tolerable; 15% Interesting finding: Y3=s2: 29+19=48% find C-Gov totally intolerable or intolerable; 5% for C-Bus Y3=s1: 54% find C-Gov totally intolerable; 2% for C-Bus Y3=s0: Same attitude towardC-Gov and C-Bus People who are touch on corruption are equally tough toward C-Gov and C-Bus. People who are relaxed about corruption are more relaxed toward C-Bus than C-GOv

Relationship Between Dimensions Interesting finding: Relationship btw background and tolerance toward corruption Y2=s2: ( good education and good income) the least tolerant. 4% tolerable Y2=s3: (poor education and average income) the most tolerant. 32% tolerable The other two classes are in between.

Result of LCA • Partition not meaningful • Reason: • Local Independence not true • Another way to look at it • LCA assumes that all the manifest variables joint defines a meaningful way to cluster data • Obviously not true for ICAC data • Instead, one should look for subsets that do define meaningful partition and perform cluster analysis on them • This is what we do with LTA

Finite Mixture Models • Y: discrete latent variable • Xi: continuous • P(X1, X2, …, Xn|Y): Usually multivariate Gaussian • No independence assumption • Assume states of Y: 1, 2, …, k P(X1, X2, …, Xn) = P(Y=i)P(X1, X2, …, Xn|Y=i): Mixture of k Gaussian components

Finite Mixture Models • Used to cluster continuous data • Learning • Determine • k: number of clusters • P(Y) • P(X1, …, Xn|Y) • Also assume: All attributes define coherent partition • Not realistic • LT models are a natural framework for clustering high dimensional data

Observation on How Human Brain Does Thinking • Human beings often invoke latent variables to explain regularities that we observe. • Example 1 • Observe Regularity: • Beers, Diapers often bought together in early evening • Hypothesize (latent) cause: • There must be a common (latent) cause • Identify the cause and explain regularity • Shopping by Father of Babies on the way home from work • Based on our understanding of the world

Observation on How Human Brain Does Thinking • Example 2 • Background: At night, watch lighting throw windows of apartments in big buildings • Observe Regularity: • Lighting fromseveral apartments were changing in brightness and color at the same times and in perfect synchrony. • Hypothesize common (latent) cause: • There must be a (late) common cause • Identify the cause and explain the phenomenon: • People watching the same TV channel. • Based on understanding of the world

Back to Ancient Time • Observe Regularity • Several symptoms often occur together • ‘intolerance to cold’, ‘cold limbs’, and ‘cold lumbus and back’ • Hypothesize common latent cause: • There must be a common latent cause • Identify the cause • Answer based on understanding of world at that time, primitive • Conclusion: Yang deficiency (阳虚) • Explanation: Yang is like the sun, it warms your body. If you don’t have enough of it, feel cold.

Back to Ancient Time • Regularity observed: • Several symptoms often occur together • Tidal fever (潮热)，heat sensation in palm and feet (手足心热), palpitation (心慌心跳), thready and rapid pulse (脉细数) • Hypothesize common latent cause: • There must be a common latent cause • Identify the cause and explain the regularirt • Yin deficiency causing internal heart (阴虚内热) • Yin and Yang should be in balance. If Yin is in deficiency, Yang will be in excess relatively, and hence causes heat.

L11: Uses of Bayesian Networks