910 likes | 1.03k Views
Query-Specific Learning and Inference for Probabilistic Graphical Models. Anton Chechetka.
E N D
Query-Specific Learning and Inferencefor Probabilistic Graphical Models Anton Chechetka Thesis committee: Carlos Guestrin Eric Xing J. Andrew Bagnell Pedro Domingos (University of Washington) 14 June 2011
Motivation Fundamental problem: to reason accurately aboutnoisyhigh-dimensionaldata with local interactions
Sensor networks • noisy: sensors failnoise in readings • high-dimensional:many sensors,(temperature, humidity, …) per sensor • local interactions: nearby locations have high correlations
Hypertext classification • noisy: automated text understanding is far from perfect • high-dimensional:a variable for every webpage • local interactions: directly linked pages have correlated topics
Image segmentation • noisy: local information is not enough camera sensor noise compression artifacts • high-dimensional:a variable for every patch • local interactions: cows are next to grass, airplanes next to sky
Probabilistic graphical models Noisyhigh-dimensionaldata with local interactions Probabilistic inference a graph to encodeonlydirect interactionsover many variables query evidence
Graphical models semantics Factorized distributions Graph structure X3 X4 X5 X2 X1 X7 X6 separator X are small subsets of X compact representation
Graphical models workflow Factorized distributions Graph structure X3 X4 X5 X2 X1 X7 X6 Learn/constructstructure Learn/defineparameters Inference P(Q|E=E)
Graph. models fundamental problems Compoundingerrors Learn/constructstructure NP-complete Learn/defineparameters exp(|X|) Inference #P-complete (exact)NP-complete (approx) P(Q|E=E)
Domain knowledge structures don’t help Domain knowledge-based structuresdo not support tractable inference (webpages)
This thesis: general directions • Emphasizing the computational aspects of the graph • Learn accurate and tractable models • Compensate for reduced expressive power withexact inference and optimal parameters • Gain significant speedups • Inference speedups via better prioritization of computation • Estimate the long-term effects of propagating information through the graph • Use long-term estimates to prioritize updates New algorithms for learning and inferencein graphical modelsto make answering the queries better
Thesis contributions • Learn accurate and tractable models • In the generative setting P(Q,E) [NIPS 2007] • In the discriminative setting P(Q|E) [NIPS 2010] • Speed up belief propagation for cases with many nuisance variables [AISTATS 2010]
Generative learning • Useful when E is not known in advance • Sensors fail unpredictably • Measurements are expensive (e.g. user time), want adaptive evidence selection learning goal query goal
Tractable vs intractable models workflow Tractable models Intractable models learnsimpletractablestructure from domainknowledge +data constructintractablestructure from domain knowledge learnintractablestructure from data optimal parameters,exact inference approximate algs: no qualityguarantees approx. P(Q|E=E) approx. P(Q|E=E)
Tractability via low treewidth • Exact inference exponential in treewidth (sum-product) • Treewidth NP-complete to compute in general • Low-treewidth graphs are easy to construct • Convenient representation: junction tree • Other tractable model classes exist too 3 4 Treewidth:size of largest clique in atriangulated graph 1 5 6 7 2
Junction trees • Cliques connected by edges with separators • Running intersection property • Most likely junction tree of given treewidth >1 is NP-complete • We will look for good approximations X4,X5 X1,X4,X5 X4,X5,X6 X1,X4,X5 X1,X4,X5 X4,X5,X6 C1 C4 X1,X5 X1,X5 X1,X2,X5 X1,X3,X5 X1,X2,X5 X1,X2,X5 X1,X3,X5 C2 C5 X1,X2 X4,X5,X6 X1,X2,X7 C3 3 4 X1,X3,X5 1 5 6 7 2
Independencies in low-treewidth distributions P(X)factorizes according to a JT conditional independencies hold conditional mutual information works in the other way too! X4,X5,X6 X1,X3,X5 X = X2 X3 X7 X = X4 X6 X1,X5 X1,X4,X5 X1,X2,X5 X1,X2,X7
Constraint-based structure learning I(X , XX | S3) < Look for JTs where this holds(constraint-based structure learning) S8 C1 C4 S1 X X X S3 C2 X1 X4 C5 S7 … C3 all variables X partition remainingvariables into weaklydependent subsets find consistentjunction tree all candidateseparators
Mutual information complexity I(X , X- | S) = H(X | S) - H(X | X- S3) everything except for X conditional entropy I(X , X- | S) depends on all assignments to X:exp(|X|) complexity in general Our contribution: polynomial-time upper bound
Mutual info upper bound: intuition • Only look at small subsets D, F • Poly number of small subsets • Poly complexity for every pair hard I(A,B | C)=?? easy A B I(D,F|C) D F |DF| k Any conclusions about I(A,B|C)? In general, no If a good junction tree exists, yes
Contribution: mutual info upper bound • Suppose an -JT of treewidth kforP(ABC)exists: • Let for |DF| k+1 • Then Theorem: = max I(D, F | C) I(A, B | C) |ABC| ( + ) A B I(D,F|C) |DF| treewidth+1 D F
Mutual info upper bound: complexity • Direct computation: complexity exp(|ABC|) • Our upper bound: • O(|AB|treewidth + 1)small subsets • exp(|C|+ treewidth) time each • |C| = treewidthfor structure learning A B I(D,F|C) D F |DF| treewidth+1 polynomial(|ABC|) complexity
Guarantees on learned model quality Theorem:Suppose a strongly connected-JT of treewidthk for P(X) exists. • Then our alg. will with probability at least (1-)find a JT s.t. quality guarantee using samples and time. poly samples poly time Corollary: strongly connected junction trees are PAC-learnable
Results – typical convergence time Test log-likelihood better good results early on in practice
Results – log-likelihood OBS local search in limited in-degree Bayes nets Chow-Liu most likely JTs of treewidth 1 Karger-Srebro constant-factor approximation JTs better ourmethod
Conclusions • A tractable upper bound on conditional mutual info • Graceful quality degradation and PAC learnabilityguarantees • Analysis on when dynamic programming works[in the thesis] • Dealing with unknown mutual information threshold[in the thesis] • Speedups preserving the guarantees • Further speedups without guarantees
Thesis contributions • Learn accurate and tractable models • In the generative setting P(Q,E) [NIPS 2007] • In the discriminative setting P(Q|E) [NIPS 2010] • Speed up belief propagation for cases with many nuisance variables [AISTATS 2010]
Discriminative learning • Useful when variables E are always the same • Non-adaptive, one-shot observation • Image pixels scene description • Document text topic, named entities • Better accuracy than generative models query goal learning goal
Discriminative log-linear models weight(learn from data) • Don’t sum over all values of E • Don’t model P(E) feature(domain knowledge) Evidence evidence-dependentnormalization f34 f12 No need for structure overE Query
Model tractability still important Observation #1: tractable models are necessary for exact inference and parameter learning in the discriminative setting Tractability is determined by thestructure over query
Simple local models: motivation query Locally almost linear Q Q=f(E) E evidence Exploiting evidence values overcomes the expressive power deficit of simple models We will learn local tractable models
Context-specific independence noedge Observation #2: use evidence values at test time to tune the structure of the models,do not commit to a single tractable model
Low-dimensional dependencies in generative structure learning Generative structure learning often relies only on low-dimensional marginals separators cliques Junction trees: decomposable scores Low-dimensional independence tests: Small changes to structure quick score recomputation Discriminative structure learning: need inference in full modelfor every datapointeven for small changes in structure
Leverage generative learning Observation #3: generative structure learningalgorithms have very useful properties, can we leverage them?
Observations so far • Discriminative setting has extra information, including evidence values at test time • Want to use to learn local tractable models • Good structure learning algorithms exist for generative settingthat only require low-dimensional marginalsP(Q) • Approach: 1. use local conditionalsP(Q | E=E)as “fake marginals” to learn local tractable structures 2. learn exact discriminative feature weights
Evidence-specific CRF overview • Approach: 1. use local conditionalsP(Q | E=E)as “fake marginals” to learn local tractable structures 2. learn exact discriminative feature weights Local conditional density estimators P(Q | E) Evidencevalue E=E Generative structurelearning algorithm Featureweights w P(Q | E=E) Tractable structurefor E=E Tractable evidence-specific CRF
Evidence-specific CRF formalism Observation: identically zero feature 0does not affect the model extra “structural” parameters evidence-specific structure:I(E,u){0, 1} ( ) ( ) × = ( ) ( ) Evidence-specific model Fixed dense model Evidence-specificfeature values Evidence-specifictree “mask” × E=E1 × E=E2 × E=E3
Evidence-specific CRF learning Learning is in the same order as testing Local conditional density estimators P(Q | E) Evidencevalue E=E Generative structurelearning algorithm Featureweights w P(Q | E=E) Tractable structurefor E=E Tractable evidence-specific CRF
Plug in generative structure learning encodes the output of the chosen structure learning algorithm Directly generalize generative algorithms : Generative Discriminative P(Qi,Qj) (pairwisemarginals)+Chow-Liu algorithm=optimal tree P(Qi,Qj|E=E) (pairwiseconditionals)+Chow-Liu algorithm=good tree for E=E
Evidence-specific CRF learning: structure Choose generative structure learning algorithm A Chow-Liu Identify low-dimensional subsets Qβ that A may need All pairs (Qi, Qj) E Q E E Q1,Q3 E Q3,Q4 Q1,Q2 , … original problem low-dimensional pairwise problems
Estimating low-dimensional conditionals Use the same features as the baseline high-treewidth model Baseline CRF Scope restriction Low-dimensionalmodel End result: optimal u
Evidence-specific CRF learning: weights • Already chosen the algorithm behind I(E,u) • Already learned parameters u “effective features” Only need to learn feature weights w log P(Q|E,w,u) is concave in w unique global optimum
Evidence-specific CRF learning: weights Tree-structured distribution ( ) ( ) ) ( ) ( Exacttree-structuredgradients wrtw Overall gradient(dense) Fixed dense model Evidence-specifictree “mask” E=E1 Q=Q1 E=E2 Q=Q2 Σ E=E3 Q=Q3
Results – WebKB Text + links webpage topic Ignore links Standard dense CRF Our work Max-margin model better Prediction error Time
Image segmentation - accuract local segment features + neighbor segments type of object Ignore links Standard dense CRF Our work better Accuracy
Image segmentation - time Ignore links Standard dense CRF Our work better Train time (log scale) Test time (log scale)
Conclusions • Using evidence valuesto tune low-treewidth model structure • Compensates for the reduced expressive power • Order of magnitude speedup at test time (sometimes train time too) • General framework for plugging in existing generative structure learners • Straightforward relational extension [in the thesis]
Thesis contributions • Learn accurate and tractable models • In the generative setting P(Q,E) [NIPS 2007] • In the discriminative setting P(Q|E) [NIPS 2010] • Speed up belief propagation for cases with many nuisance variables [AISTATS 2010]
Why high-treewidth models? • A dense model expressing laws of nature • Protein folding • Max-margin parameters don’t work well (yet?) with evidence-specific structures