540 likes | 691 Views
Exploiting Parameter Domain Knowledge for Learning in Bayesian Networks. ~ Thesis Defense ~ Stefan Niculescu Carnegie Mellon University, July 2005. Thesis Committee: Tom Mitchell (Chair) John Lafferty Andrew Moore Bharat Rao (Siemens Medical Solutions). Domain Knowledge.
E N D
Exploiting Parameter Domain Knowledge forLearning in Bayesian Networks ~ Thesis Defense ~ Stefan Niculescu Carnegie Mellon University, July 2005 Thesis Committee: Tom Mitchell (Chair) John Lafferty Andrew Moore Bharat Rao (Siemens Medical Solutions)
Domain Knowledge • In real world, often data is too sparse to allow building of an accurate model • Domain knowledge can help alleviate this problem • Several types of domain knowledge: • Relevance of variables (feature selection) • Conditional Independences among variables • Parameter Domain Knowledge
Parameter Domain Knowledge • In a Bayesian Network for a real world domain: • can have huge number of parameters • not enough data to estimate them accurately • Parameter Domain Knowledge constraints: • reduce the space of feasible parameters • reduce the variance of parameter estimates
Parameter Domain Knowledge Examples: • DK: “If a person has a Family history ofHeart Attack, Race and Pollution are not significant factors for the probability of getting a Heart Attack.” • DK: “Two voxels in the brain may exhibit the same activation patterns during a cognitive task, but with different amplitudes.” • DK: “Two countries may have different Heart Disease rates, but the relative proportion of Heart Attack to CHF is the same.” • DK: “The aggregate probability of Adverbs in English is less than the aggregate probability of Verbs”.
Thesis Standard methods for performing parameter estimation in Bayesian Networks can be naturally extended to take advantage of parameter domain knowledge that can be provided by a domain expert. These new learning algorithms perform better (in terms of probability density estimation) than existing ones.
Outline • Motivation • Parameter Domain Knowledge Framework • Simple Parameter Sharing • Parameter Sharing in Hidden Process Models • Types of Parameter Domain Knowledge • Related Work • Summary / Future Work
Parameter Domain Knowledge Framework~ Domain Knowledge Constraints ~
Parameter Domain Knowledge Framework~ Frequentist Approach, Complete Data ~
Parameter Domain Knowledge Framework~ Frequentist Approach, Complete Data ~
Parameter Domain Knowledge Framework~ Frequentist Approach, Incomplete Data ~ EM Algorithm. Repeat until convergence:
Parameter Domain Knowledge Framework~ Frequentist Approach, Incomplete Data ~~ Discrete Variables ~ EM Algorithm. Repeat until convergence:
Parameter Domain Knowledge Framework~ Computing the Normalization Constant ~
Parameter Domain Knowledge Framework~ Computing the Normalization Constant ~ H(2) In H7: ε = 0.5
Outline • Motivation • Parameter Domain Knowledge Framework • Simple Parameter Sharing • Parameter Sharing in Hidden Process Models • Types of Parameter Domain Knowledge • Related Work • Summary / Future Work
Simple Parameter Sharing~ Maximum Likelihood Estimators ~ Cubical Die – cut symmetrically at each cornerk1=6 k2=8 ki places Theorem. The Maximum Likelihood parameters are given by: Total:
Simple Parameter Sharing~ Variance Reduction in Parameter Estimates ~
Simple Parameter Sharing~ Experiments – Learning a Probability Distribution ~ • Synthetic Dataset: • Probability distribution over 50 values • 50 randomly generated parameters: • 6 shared between 2 and 5 times to count as half • The rest “not shared” (shared exactly once) • 1000 examples sampled from this distribution • Purpose: • Domain Knowledge readily available • To be able to study the effect of training set size (up to 1000) • To be able to compare our estimated distribution to the true distribution • Models: • STBN ( Standard Bayesian Network ) • PDKBN ( Bayesian Network with PDK )
Experimental Results • PDKBN performs better than STBN • Largest difference: 0.05 (30 ex) • On average, STBN needs 1.86 times more examples to catch up in KL !!! • 40 (PDKBN) ~ 103 (STBN) • 200 (PDKBN) ~ 516 (STBN) • 650 (PDKBN) ~ >1000 (STBN) • The difference between PDKBN and STBN shrinks when the size of training set increases, but PDKBN is much better when training data is scarce.
Outline • Motivation • Parameter Domain Knowledge Framework • Simple Parameter Sharing • Parameter Sharing in Hidden Process Models • Types of Parameter Domain Knowledge • Related Work • Summary / Future Work
Hidden Process Models One observation (trial): N different trials: All trials and all Processes have equal length T
Parameter Sharing in HPMs • similar shape activity • different amplitudes Xv
Parameter Sharing in HPMs~ Maximum Likelihood Estimation ~ • l’(P,C) quadratic in (P,C), but • linear in P ! • linear in C !
Trial: read sentence view picture answer whether sentence describes picture 40 trials – 32 time slices (2/sec) picture presented first in half of trials sentence first in the other half Three possible objects: star, dollar, plus Collected by Just et al. IDEA: model using HPMs with two processes: “Sentence” and “Picture” We assume a process starts when stimulus is presented Will use Shared HPMs where possible Starplus Dataset
+ --- *
Parameter Sharing in HPMs~ Hierarchical Partitioning Algorithm ~
Parameter Sharing in HPMs~ Experiments ~ • We compare three models: • Based on Average (per trial) Likelihood • StHPM – Standard, per voxel HPM • ShHPM – One HPM for all voxels in an ROI (24 total) • HieHPM – Hierarchical HPM • Effect of training set size (6 to 40) in CALC: • ShHPM biased here • Better than StHPM at small sample size • Worse at 40 examples • HieHPM – the best • It can represent both models • e106 times better data likelihoodthan StHPM at 40 examples • StHPM needs 2.9 times more examples to catch up
Parameter Sharing in HPMs~ Experiments ~ Performance over whole brain (40 examples): • HieHPM – the best • e1792 times better data likelihoodthan StHPM • Better than StHPM in 23/24 ROIs • Better than ShHPM in 12/24 ROIs, equal in 11/24 • ShHPM – second best • e464 times better data likelihoodthan StHPM • Better than StHPM in 18/24 ROIs • It has bias, but makes sense to share whole ROIs not involved in the cognitive task
Learned Voxel Clusters • In the whole brain: • ~ 300 clusters • ~ 15 voxels / cluster • In CALC: • ~ 60 clusters • ~ 5 voxels / cluster
Outline • Motivation • Parameter Domain Knowledge Framework • Simple Parameter Sharing • Parameter Sharing in Hidden Process Models • Types of Parameter Domain Knowledge • Related Work • Summary / Future Work
Parameter Domain Knowledge Types • DISCRETE: • Known Parameter Values • Parameter Sharing and Proportionality Constants – One Distribution • Sum Sharing and Ratio Sharing – One Distribution • Parameter Sharing and Hierarchical Sharing – Multiple Distributions • Sum Sharing and Ratio Sharing – Multiple Distributions • CONTINUOUS (Gaussian Distributions): • Parameter Sharing and Proportionality Constants – One Distribution • Parameter Sharing in Hidden Process Models • INEQUALITY CONSTRAINTS: • Between Sums of Parameters – One Distribution • Upper Bounds on Sums of Parameters – One Distribution
Probability Ratio Sharing • Want to model P(Word|Language) • Two languages: English, Spanish • Different sets of words • Domain Knowledge: • Word groups: • About computers: computer, keyboard, monitor, etc • Relative frequency of “computer” to “keyboard” same in both languages • Aggregate mass can be different T1Computer Words T2 Business Words
Probability Ratio Sharing DK: Parameters of a given color preserve their relative ratios across all distributions! ...
Inequalities between Sums of Parameters • In spoken language: • Each Adverb comes along with a Verb • Each Adjective comes with a Noun or Pronoun • Therefore it is reasonable to expect that: • The frequency of Adverbs is less than that of Verbs • The frequency of Adjectives is less than that of Nouns and Pronouns Equivalently: • In general, within the same distribution:
Outline • Motivation • Parameter Domain Knowledge Framework • Simple Parameter Sharing • Parameter Sharing in Hidden Process Models • Types of Parameter Domain Knowledge • Related Work • Summary / Future Work
Dirichlet Priors in a Bayes Net Prior Belief Spread • The Domain Expert specifies an assignment of parameters. • leaves room for some error (Variance). • Several types: • Standard • Dirichlet Tree Priors • Dependent Dirichlet
... ... Markov Models ... ...
Module Networks • In a Module: • Same parents • Same CPTs Image from “Learning Module Networks” by Eran Segal and Daphne Koller
Context Specific Independence Burglary Set Alarm
Limitations of Current Models • Dirichlet priors • When the number of parameters is huge, specifying a useful prior is difficult • Unable to enforce even simple constraints: • Need additional hyperparameters to enforce basic parameter sharing, but no closed form MAP estimates can be computed ! • Dependent Dirichlet Priors are not conjugate priors • Our priors are dependent and also conjugate !!! • Markov Models, Module Networks and CSI • Particular cases of our Parameter Sharing DK • Do not allow sharing at parameter level of granularity
Outline • Motivation • Parameter Domain Knowledge Framework • Simple Parameter Sharing • Parameter Sharing in Hidden Process Models • Types of Parameter Domain Knowledge • Related Work • Summary / Future Work
Summary • Parameter Related Domain Knowledge is needed when data is scarce • Reduces the number of free parameters • Reduces the variance in parameter estimates (illustrated on Simple Parameter Sharing) • Developed unified Parameter Domain Knowledge Framework • From both a frequentist and Bayesian point of view • From both complete and incomplete data • Developed efficient learning algorithms for several types of PDK: • Closed form solutions for most of these types • For both discrete and continuous variables • For both equality and inequality constraints • Particular cases of our parameter sharing framework: • Markov Models, Module Nets, Context Specific Independence • Developed method of automatically learning the domain knowledge (illustrated on HPMs) • Experiments show the superiority of models using PDK