Query-Specific Learning and Inference for Probabilistic Graphical Models

Query-Specific Learning and Inferencefor Probabilistic Graphical Models Anton Chechetka Thesis committee: Carlos Guestrin Eric Xing J. Andrew Bagnell Pedro Domingos (University of Washington) 14 June 2011

Motivation Fundamental problem: to reason accurately aboutnoisyhigh-dimensionaldata with local interactions

Sensor networks • noisy: sensors failnoise in readings • high-dimensional:many sensors,(temperature, humidity, …) per sensor • local interactions: nearby locations have high correlations

Hypertext classification • noisy: automated text understanding is far from perfect • high-dimensional:a variable for every webpage • local interactions: directly linked pages have correlated topics

Image segmentation • noisy: local information is not enough camera sensor noise compression artifacts • high-dimensional:a variable for every patch • local interactions: cows are next to grass, airplanes next to sky

Probabilistic graphical models Noisyhigh-dimensionaldata with local interactions Probabilistic inference a graph to encodeonlydirect interactionsover many variables query evidence

Graphical models semantics Factorized distributions Graph structure X3 X4 X5 X2 X1 X7 X6 separator X are small subsets of X  compact representation

Graphical models workflow Factorized distributions Graph structure X3 X4 X5 X2 X1 X7 X6 Learn/constructstructure Learn/defineparameters Inference P(Q|E=E)

Graph. models fundamental problems Compoundingerrors Learn/constructstructure NP-complete Learn/defineparameters exp(|X|) Inference #P-complete (exact)NP-complete (approx) P(Q|E=E)

Domain knowledge structures don’t help Domain knowledge-based structuresdo not support tractable inference (webpages)

This thesis: general directions • Emphasizing the computational aspects of the graph • Learn accurate and tractable models • Compensate for reduced expressive power withexact inference and optimal parameters • Gain significant speedups • Inference speedups via better prioritization of computation • Estimate the long-term effects of propagating information through the graph • Use long-term estimates to prioritize updates New algorithms for learning and inferencein graphical modelsto make answering the queries better

Thesis contributions • Learn accurate and tractable models • In the generative setting P(Q,E) [NIPS 2007] • In the discriminative setting P(Q|E) [NIPS 2010] • Speed up belief propagation for cases with many nuisance variables [AISTATS 2010]

Generative learning • Useful when E is not known in advance • Sensors fail unpredictably • Measurements are expensive (e.g. user time), want adaptive evidence selection learning goal query goal

Tractable vs intractable models workflow Tractable models Intractable models learnsimpletractablestructure from domainknowledge +data constructintractablestructure from domain knowledge learnintractablestructure from data optimal parameters,exact inference approximate algs: no qualityguarantees approx. P(Q|E=E) approx. P(Q|E=E)

Tractability via low treewidth • Exact inference exponential in treewidth (sum-product) • Treewidth NP-complete to compute in general • Low-treewidth graphs are easy to construct • Convenient representation: junction tree • Other tractable model classes exist too 3 4 Treewidth:size of largest clique in atriangulated graph 1 5 6 7 2

Junction trees • Cliques connected by edges with separators • Running intersection property • Most likely junction tree of given treewidth >1 is NP-complete • We will look for good approximations X4,X5 X1,X4,X5 X4,X5,X6 X1,X4,X5 X1,X4,X5 X4,X5,X6 C1 C4 X1,X5 X1,X5 X1,X2,X5 X1,X3,X5 X1,X2,X5 X1,X2,X5 X1,X3,X5 C2 C5 X1,X2 X4,X5,X6 X1,X2,X7 C3 3 4 X1,X3,X5 1 5 6 7 2

Independencies in low-treewidth distributions P(X)factorizes according to a JT conditional independencies hold conditional mutual information works in the other way too! X4,X5,X6 X1,X3,X5 X = X2 X3 X7 X = X4 X6 X1,X5 X1,X4,X5 X1,X2,X5 X1,X2,X7

Constraint-based structure learning I(X , XX | S3) <  Look for JTs where this holds(constraint-based structure learning) S8 C1 C4 S1 X X X S3 C2 X1 X4 C5 S7 … C3 all variables X partition remainingvariables into weaklydependent subsets find consistentjunction tree all candidateseparators

Mutual information complexity I(X , X- | S) = H(X | S) - H(X | X- S3) everything except for X conditional entropy I(X , X- | S) depends on all assignments to X:exp(|X|) complexity in general Our contribution: polynomial-time upper bound

Mutual info upper bound: intuition • Only look at small subsets D, F • Poly number of small subsets • Poly complexity for every pair hard I(A,B | C)=?? easy A B I(D,F|C) D F |DF| k Any conclusions about I(A,B|C)? In general, no If a good junction tree exists, yes

Contribution: mutual info upper bound • Suppose an -JT of treewidth kforP(ABC)exists: • Let for |DF| k+1 • Then Theorem:  = max I(D, F | C) I(A, B | C)  |ABC| ( + ) A B I(D,F|C) |DF| treewidth+1 D F

Mutual info upper bound: complexity • Direct computation: complexity exp(|ABC|) • Our upper bound: • O(|AB|treewidth + 1)small subsets • exp(|C|+ treewidth) time each • |C| = treewidthfor structure learning A B I(D,F|C) D F |DF| treewidth+1 polynomial(|ABC|) complexity

Guarantees on learned model quality Theorem:Suppose a strongly connected-JT of treewidthk for P(X) exists. • Then our alg. will with probability at least (1-)find a JT s.t. quality guarantee using samples and time. poly samples poly time Corollary: strongly connected junction trees are PAC-learnable

Related work

Results – typical convergence time Test log-likelihood better good results early on in practice

Results – log-likelihood OBS  local search in limited in-degree Bayes nets Chow-Liu  most likely JTs of treewidth 1 Karger-Srebro  constant-factor approximation JTs better ourmethod

Conclusions • A tractable upper bound on conditional mutual info • Graceful quality degradation and PAC learnabilityguarantees • Analysis on when dynamic programming works[in the thesis] • Dealing with unknown mutual information threshold[in the thesis] • Speedups preserving the guarantees • Further speedups without guarantees

Discriminative learning • Useful when variables E are always the same • Non-adaptive, one-shot observation • Image pixels  scene description • Document text  topic, named entities • Better accuracy than generative models query goal learning goal

Discriminative log-linear models weight(learn from data) • Don’t sum over all values of E • Don’t model P(E) feature(domain knowledge) Evidence evidence-dependentnormalization f34 f12 No need for structure overE Query

Model tractability still important Observation #1: tractable models are necessary for exact inference and parameter learning in the discriminative setting Tractability is determined by thestructure over query

Simple local models: motivation query Locally almost linear Q Q=f(E) E evidence Exploiting evidence values overcomes the expressive power deficit of simple models We will learn local tractable models

Context-specific independence noedge Observation #2: use evidence values at test time to tune the structure of the models,do not commit to a single tractable model

Low-dimensional dependencies in generative structure learning Generative structure learning often relies only on low-dimensional marginals separators cliques Junction trees: decomposable scores Low-dimensional independence tests: Small changes to structure  quick score recomputation Discriminative structure learning: need inference in full modelfor every datapointeven for small changes in structure

Leverage generative learning Observation #3: generative structure learningalgorithms have very useful properties, can we leverage them?

Observations so far • Discriminative setting has extra information, including evidence values at test time • Want to use to learn local tractable models • Good structure learning algorithms exist for generative settingthat only require low-dimensional marginalsP(Q) • Approach: 1. use local conditionalsP(Q | E=E)as “fake marginals” to learn local tractable structures 2. learn exact discriminative feature weights

Evidence-specific CRF overview • Approach: 1. use local conditionalsP(Q | E=E)as “fake marginals” to learn local tractable structures 2. learn exact discriminative feature weights Local conditional density estimators P(Q | E) Evidencevalue E=E Generative structurelearning algorithm Featureweights w P(Q | E=E) Tractable structurefor E=E Tractable evidence-specific CRF

Evidence-specific CRF formalism Observation: identically zero feature  0does not affect the model extra “structural” parameters evidence-specific structure:I(E,u){0, 1} ( ) ( ) × = ( ) ( ) Evidence-specific model Fixed dense model Evidence-specificfeature values Evidence-specifictree “mask” × E=E1 × E=E2 × E=E3

Evidence-specific CRF learning Learning is in the same order as testing Local conditional density estimators P(Q | E) Evidencevalue E=E Generative structurelearning algorithm Featureweights w P(Q | E=E) Tractable structurefor E=E Tractable evidence-specific CRF

Plug in generative structure learning encodes the output of the chosen structure learning algorithm Directly generalize generative algorithms : Generative Discriminative P(Qi,Qj) (pairwisemarginals)+Chow-Liu algorithm=optimal tree P(Qi,Qj|E=E) (pairwiseconditionals)+Chow-Liu algorithm=good tree for E=E

Evidence-specific CRF learning: structure Choose generative structure learning algorithm A Chow-Liu Identify low-dimensional subsets Qβ that A may need All pairs (Qi, Qj) E Q E E Q1,Q3 E Q3,Q4 Q1,Q2 , … original problem low-dimensional pairwise problems

Estimating low-dimensional conditionals Use the same features as the baseline high-treewidth model Baseline CRF Scope restriction Low-dimensionalmodel End result: optimal u

Evidence-specific CRF learning: weights • Already chosen the algorithm behind I(E,u) • Already learned parameters u “effective features” Only need to learn feature weights w log P(Q|E,w,u) is concave in w unique global optimum

Evidence-specific CRF learning: weights Tree-structured distribution ( ) ( ) ) ( ) ( Exacttree-structuredgradients wrtw Overall gradient(dense) Fixed dense model Evidence-specifictree “mask” E=E1 Q=Q1 E=E2 Q=Q2 Σ E=E3 Q=Q3

Results – WebKB Text + links  webpage topic Ignore links Standard dense CRF Our work Max-margin model better Prediction error Time

Image segmentation - accuract local segment features + neighbor segments type of object Ignore links Standard dense CRF Our work better Accuracy

Image segmentation - time Ignore links Standard dense CRF Our work better Train time (log scale) Test time (log scale)

Conclusions • Using evidence valuesto tune low-treewidth model structure • Compensates for the reduced expressive power • Order of magnitude speedup at test time (sometimes train time too) • General framework for plugging in existing generative structure learners • Straightforward relational extension [in the thesis]

Why high-treewidth models? • A dense model expressing laws of nature • Protein folding • Max-margin parameters don’t work well (yet?) with evidence-specific structures

Query-Specific Learning and Inference for Probabilistic Graphical Models

Query-Specific Learning and Inference for Probabilistic Graphical Models

Presentation Transcript

An introduction to machine learning and probabilistic graphical models

Exact and approximate inference in probabilistic graphical models

Exact and approximate inference in probabilistic graphical models

Learning with Inference for Discrete Graphical Models

Focused Belief Propagation for Query-Specific Inference

Active Learning for Probabilistic Models

Exact Inference on Graphical Models

Probabilistic Lexical Models for Textual Inference

Undirected Probabilistic Graphical Models (Markov Nets)

Probabilistic Graphical Models for Semi-Supervised Traffic Classification

Graphical Models - Inference -

Inference using Graphical Models and Software Tools

Probabilistic graphical models

Probabilistic Graphical Models

Directed Graphical Probabilistic Models:

Probabilistic graphical models and regulatory networks

Probabilistic Graphical Models

Web Information Extraction Learning based on Probabilistic Graphical Models

Probabilistic Graphical Models

Causal Inference and Graphical Models