1 / 91

Query-Specific Learning and Inference for Probabilistic Graphical Models

Query-Specific Learning and Inference for Probabilistic Graphical Models. Anton Chechetka.

Download Presentation

Query-Specific Learning and Inference for Probabilistic Graphical Models

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Query-Specific Learning and Inferencefor Probabilistic Graphical Models Anton Chechetka Thesis committee: Carlos Guestrin Eric Xing J. Andrew Bagnell Pedro Domingos (University of Washington) 14 June 2011

  2. Motivation Fundamental problem: to reason accurately aboutnoisyhigh-dimensionaldata with local interactions

  3. Sensor networks • noisy: sensors failnoise in readings • high-dimensional:many sensors,(temperature, humidity, …) per sensor • local interactions: nearby locations have high correlations

  4. Hypertext classification • noisy: automated text understanding is far from perfect • high-dimensional:a variable for every webpage • local interactions: directly linked pages have correlated topics

  5. Image segmentation • noisy: local information is not enough camera sensor noise compression artifacts • high-dimensional:a variable for every patch • local interactions: cows are next to grass, airplanes next to sky

  6. Probabilistic graphical models Noisyhigh-dimensionaldata with local interactions Probabilistic inference a graph to encodeonlydirect interactionsover many variables query evidence

  7. Graphical models semantics Factorized distributions Graph structure X3 X4 X5 X2 X1 X7 X6 separator X are small subsets of X  compact representation

  8. Graphical models workflow Factorized distributions Graph structure X3 X4 X5 X2 X1 X7 X6 Learn/constructstructure Learn/defineparameters Inference P(Q|E=E)

  9. Graph. models fundamental problems Compoundingerrors Learn/constructstructure NP-complete Learn/defineparameters exp(|X|) Inference #P-complete (exact)NP-complete (approx) P(Q|E=E)

  10. Domain knowledge structures don’t help Domain knowledge-based structuresdo not support tractable inference (webpages)

  11. This thesis: general directions • Emphasizing the computational aspects of the graph • Learn accurate and tractable models • Compensate for reduced expressive power withexact inference and optimal parameters • Gain significant speedups • Inference speedups via better prioritization of computation • Estimate the long-term effects of propagating information through the graph • Use long-term estimates to prioritize updates New algorithms for learning and inferencein graphical modelsto make answering the queries better

  12. Thesis contributions • Learn accurate and tractable models • In the generative setting P(Q,E) [NIPS 2007] • In the discriminative setting P(Q|E) [NIPS 2010] • Speed up belief propagation for cases with many nuisance variables [AISTATS 2010]

  13. Generative learning • Useful when E is not known in advance • Sensors fail unpredictably • Measurements are expensive (e.g. user time), want adaptive evidence selection learning goal query goal

  14. Tractable vs intractable models workflow Tractable models Intractable models learnsimpletractablestructure from domainknowledge +data constructintractablestructure from domain knowledge learnintractablestructure from data optimal parameters,exact inference approximate algs: no qualityguarantees approx. P(Q|E=E) approx. P(Q|E=E)

  15. Tractability via low treewidth • Exact inference exponential in treewidth (sum-product) • Treewidth NP-complete to compute in general • Low-treewidth graphs are easy to construct • Convenient representation: junction tree • Other tractable model classes exist too 3 4 Treewidth:size of largest clique in atriangulated graph 1 5 6 7 2

  16. Junction trees • Cliques connected by edges with separators • Running intersection property • Most likely junction tree of given treewidth >1 is NP-complete • We will look for good approximations X4,X5 X1,X4,X5 X4,X5,X6 X1,X4,X5 X1,X4,X5 X4,X5,X6 C1 C4 X1,X5 X1,X5 X1,X2,X5 X1,X3,X5 X1,X2,X5 X1,X2,X5 X1,X3,X5 C2 C5 X1,X2 X4,X5,X6 X1,X2,X7 C3 3 4 X1,X3,X5 1 5 6 7 2

  17. Independencies in low-treewidth distributions P(X)factorizes according to a JT conditional independencies hold conditional mutual information works in the other way too! X4,X5,X6 X1,X3,X5 X = X2 X3 X7 X = X4 X6 X1,X5 X1,X4,X5 X1,X2,X5 X1,X2,X7

  18. Constraint-based structure learning I(X , XX | S3) <  Look for JTs where this holds(constraint-based structure learning) S8 C1 C4 S1 X X X S3 C2 X1 X4 C5 S7 … C3 all variables X partition remainingvariables into weaklydependent subsets find consistentjunction tree all candidateseparators

  19. Mutual information complexity I(X , X- | S) = H(X | S) - H(X | X- S3) everything except for X conditional entropy I(X , X- | S) depends on all assignments to X:exp(|X|) complexity in general Our contribution: polynomial-time upper bound

  20. Mutual info upper bound: intuition • Only look at small subsets D, F • Poly number of small subsets • Poly complexity for every pair hard I(A,B | C)=?? easy A B I(D,F|C) D F |DF| k Any conclusions about I(A,B|C)? In general, no If a good junction tree exists, yes

  21. Contribution: mutual info upper bound • Suppose an -JT of treewidth kforP(ABC)exists: • Let for |DF| k+1 • Then Theorem:  = max I(D, F | C) I(A, B | C)  |ABC| ( + ) A B I(D,F|C) |DF| treewidth+1 D F

  22. Mutual info upper bound: complexity • Direct computation: complexity exp(|ABC|) • Our upper bound: • O(|AB|treewidth + 1)small subsets • exp(|C|+ treewidth) time each • |C| = treewidthfor structure learning A B I(D,F|C) D F |DF| treewidth+1 polynomial(|ABC|) complexity

  23. Guarantees on learned model quality Theorem:Suppose a strongly connected-JT of treewidthk for P(X) exists. • Then our alg. will with probability at least (1-)find a JT s.t. quality guarantee using samples and time. poly samples poly time Corollary: strongly connected junction trees are PAC-learnable

  24. Related work

  25. Results – typical convergence time Test log-likelihood better good results early on in practice

  26. Results – log-likelihood OBS  local search in limited in-degree Bayes nets Chow-Liu  most likely JTs of treewidth 1 Karger-Srebro  constant-factor approximation JTs better ourmethod

  27. Conclusions • A tractable upper bound on conditional mutual info • Graceful quality degradation and PAC learnabilityguarantees • Analysis on when dynamic programming works[in the thesis] • Dealing with unknown mutual information threshold[in the thesis] • Speedups preserving the guarantees • Further speedups without guarantees

  28. Thesis contributions • Learn accurate and tractable models • In the generative setting P(Q,E) [NIPS 2007] • In the discriminative setting P(Q|E) [NIPS 2010] • Speed up belief propagation for cases with many nuisance variables [AISTATS 2010]

  29. Discriminative learning • Useful when variables E are always the same • Non-adaptive, one-shot observation • Image pixels  scene description • Document text  topic, named entities • Better accuracy than generative models query goal learning goal

  30. Discriminative log-linear models weight(learn from data) • Don’t sum over all values of E • Don’t model P(E) feature(domain knowledge) Evidence evidence-dependentnormalization f34 f12 No need for structure overE Query

  31. Model tractability still important Observation #1: tractable models are necessary for exact inference and parameter learning in the discriminative setting Tractability is determined by thestructure over query

  32. Simple local models: motivation query Locally almost linear Q Q=f(E) E evidence Exploiting evidence values overcomes the expressive power deficit of simple models We will learn local tractable models

  33. Context-specific independence noedge Observation #2: use evidence values at test time to tune the structure of the models,do not commit to a single tractable model

  34. Low-dimensional dependencies in generative structure learning Generative structure learning often relies only on low-dimensional marginals separators cliques Junction trees: decomposable scores Low-dimensional independence tests: Small changes to structure  quick score recomputation Discriminative structure learning: need inference in full modelfor every datapointeven for small changes in structure

  35. Leverage generative learning Observation #3: generative structure learningalgorithms have very useful properties, can we leverage them?

  36. Observations so far • Discriminative setting has extra information, including evidence values at test time • Want to use to learn local tractable models • Good structure learning algorithms exist for generative settingthat only require low-dimensional marginalsP(Q) • Approach: 1. use local conditionalsP(Q | E=E)as “fake marginals” to learn local tractable structures 2. learn exact discriminative feature weights

  37. Evidence-specific CRF overview • Approach: 1. use local conditionalsP(Q | E=E)as “fake marginals” to learn local tractable structures 2. learn exact discriminative feature weights Local conditional density estimators P(Q | E) Evidencevalue E=E Generative structurelearning algorithm Featureweights w P(Q | E=E) Tractable structurefor E=E Tractable evidence-specific CRF

  38. Evidence-specific CRF formalism Observation: identically zero feature  0does not affect the model extra “structural” parameters evidence-specific structure:I(E,u){0, 1} ( ) ( ) × = ( ) ( ) Evidence-specific model Fixed dense model Evidence-specificfeature values Evidence-specifictree “mask” × E=E1 × E=E2 × E=E3

  39. Evidence-specific CRF learning Learning is in the same order as testing Local conditional density estimators P(Q | E) Evidencevalue E=E Generative structurelearning algorithm Featureweights w P(Q | E=E) Tractable structurefor E=E Tractable evidence-specific CRF

  40. Plug in generative structure learning encodes the output of the chosen structure learning algorithm Directly generalize generative algorithms : Generative Discriminative P(Qi,Qj) (pairwisemarginals)+Chow-Liu algorithm=optimal tree P(Qi,Qj|E=E) (pairwiseconditionals)+Chow-Liu algorithm=good tree for E=E

  41. Evidence-specific CRF learning: structure Choose generative structure learning algorithm A Chow-Liu Identify low-dimensional subsets Qβ that A may need All pairs (Qi, Qj) E Q E E Q1,Q3 E Q3,Q4 Q1,Q2 , … original problem low-dimensional pairwise problems

  42. Estimating low-dimensional conditionals Use the same features as the baseline high-treewidth model Baseline CRF Scope restriction Low-dimensionalmodel End result: optimal u

  43. Evidence-specific CRF learning: weights • Already chosen the algorithm behind I(E,u) • Already learned parameters u “effective features” Only need to learn feature weights w log P(Q|E,w,u) is concave in w unique global optimum

  44. Evidence-specific CRF learning: weights Tree-structured distribution ( ) ( ) ) ( ) ( Exacttree-structuredgradients wrtw Overall gradient(dense) Fixed dense model Evidence-specifictree “mask” E=E1 Q=Q1 E=E2 Q=Q2 Σ E=E3 Q=Q3

  45. Results – WebKB Text + links  webpage topic Ignore links Standard dense CRF Our work Max-margin model better Prediction error Time

  46. Image segmentation - accuract local segment features + neighbor segments type of object Ignore links Standard dense CRF Our work better Accuracy

  47. Image segmentation - time Ignore links Standard dense CRF Our work better Train time (log scale) Test time (log scale)

  48. Conclusions • Using evidence valuesto tune low-treewidth model structure • Compensates for the reduced expressive power • Order of magnitude speedup at test time (sometimes train time too) • General framework for plugging in existing generative structure learners • Straightforward relational extension [in the thesis]

  49. Thesis contributions • Learn accurate and tractable models • In the generative setting P(Q,E) [NIPS 2007] • In the discriminative setting P(Q|E) [NIPS 2010] • Speed up belief propagation for cases with many nuisance variables [AISTATS 2010]

  50. Why high-treewidth models? • A dense model expressing laws of nature • Protein folding • Max-margin parameters don’t work well (yet?) with evidence-specific structures

More Related