470 likes | 595 Views
Thesis Proposa l. Algorithms for Answering Queries with Graphical Models. Anton Chechetka. Thesis committee: Carlos Guestrin Eric Xing Drew Bagnell Pedro Domingos (UW). 21 May 2009. Motivation.
E N D
Thesis Proposal Algorithms for Answering Queries with Graphical Models Anton Chechetka Thesis committee:Carlos Guestrin Eric Xing Drew Bagnell Pedro Domingos (UW) 21 May 2009
Motivation Activity recognition Sensor networks Patient monitoring &diagnosis Image credit: http://www.dremed.com Image credit: [Pentney+al:2006]
Motivation Common problem: computeP(Q | E =e) True temperature in a room? Sensor 3 reads 25°C Has the person finished cooking? The person is next to the kitchen sink (RFID) Is the patient well? Heart rate is 70 BPM
Common solution Common problem: compute P(Q | E =e) (query)Common solution: probabilistic graphical models This thesis: New algorithms for learning and inference in PGMs to make answering queries better [Pentney+al:2006] [Deshpande+al:2004] [Beinlich+al:1988]
Graphical models Represent factorized distributions X are small subsets of X compact representation corresponding graph structure X4 X1 X3 X5 X2 Learn/constructstructure Learn/defineparameters Inference P(Q|E=e) • Fundamental problems: • P(Q|E=e) given a PGM? • Best parameters f given the structure? • Optimal structure (i.e. sets X)? #P-complete / NP-complete exp(|X|) complexity NP-complete
This thesis Learn/constructstructure Learn/defineparameters Inference P(Q|E=e) [NIPS 2007] 1. Learning tractable models efficiently and with quality guarantees 2. Simplifying large-scale models / focusing inference on the query Thesis contributions: 3. Learning simple local models by exploiting evidence assignments
Leaning tractable models Learn/constructstructure • Every step in the pipeline is computationally hard for general PGMs • Compounding errors • But there are exact inference and parameter learning algorithms with exp(graph treewidth) complexity • So if we learn low-treewidth models, all the rest is easy! Learn/defineparameters Inference P(Q|E=e)
Treewidth Learn/constructstructure • Learn low-treewidth models all the rest is easy! • Treewidth = size of largest clique in a triangulated graph • Computing treewidth is NP-complete in general • But easy to constructgraphs with given treewidth • Convenient representation: junction tree Learn/defineparameters Inference P(Q|E=e) X4,X5 X1,X4,X5 X4,X5,X6 C1 C4 X1,X5 X1,X5 X1,X2,X5 X1,X3,X5 C2 C5 3 4 X1,X2 1 5 6 X1,X2,X7 7 2 C3
Junction trees • Learn junction trees all the rest is easy! • Other classes of tractable models exist, e.g. [Lowd+Domingos:2008] • Running intersection property • Most likely junction tree of fixed treewidth >1 is NP-complete • We will look for good approximations X1,X4,X5 X4,X5,X6 X1,X4,X5 X4,X5 X1,X2,X5 X1,X3,X5 X1,X2,X5 X1,X4,X5 X4,X5,X6 C1 C4 X1,X5 X4,X5,X6 X1,X5 X1,X2,X5 X1,X3,X5 C2 C5 X1,X2 X1,X3,X5 X1,X2,X7 C3
Independencies in low-treewidth distributions P(X)factorizes according to a JT conditional independencies hold conditional mutual information works in the other way too! X4,X5,X6 X1,X3,X5 X=X4X6 X=X2X3X7 X1,X5 X1,X4,X5 X1,X2,X5 X1,X2,X7
Constraint-based structure learning We will look for JTs where this holds Constraint-based structure learning X-VS S1 X-S3 S4 S3 S3 S S1 … X-S1 V S1 S2 I(V, X \VS | S) < ?? S2 S4 S2 X-S2 Construct a junction tree (e.g. using dynamic programming) Take all candidate separators Partition remaining variables into weakly dependent subsets
Mutual information estimation I(V, X \VS | S) < ?? definition: I(A,B|S) = H(A| S) – H(A|BS) naïve estimation of costs exp(|X|), too expensive sum over all 2|X|assignments to X our work: upper bound on I(V, X \VS | S), using values of I(Y,Z|S) for |YZ|treewidth+1 there are O(|X|treewidth+1) subsets Y and Z complexity polynomial in |X|
Mutual information estimation I(V, X \VS | S) < ?? hard • Theorem: suppose that P(X), S, V are s.t. • an -JT of treewidth k for P(X) exists • for every AV, BX-VSs.t. |AB| k+1 • I( ) • Then • I(V, X-VS | S) |X|( + ) I(V,X-VS | S)=?? V X-VS easy I(A,B|S) B A |AB|treewidth+1 • Complexity O(|Xk+1|) exponential speedup • No need to know the -JT, only that it exists • The bound is loose only when there is no hope to learn a good JT
Guarantees on learned model quality • Theorem:suppose that P(X) is s.t. • a strongly connected-JT of treewidth k for P(X) exists • Then our alg. will with probability at least (1-) find a JT (C,E) s.t. quality guarantee using samples and time poly samples poly time Corollary: strongly connected junction trees are PAC-learnable
Results – typical convergence time Test log-likelihood good results early on in practice
Results – log-likelihood OBS local search in limited in-degree Bayes nets Chow-Liu most likely JTs of treewidth 1 Karger-Srebro constant-factor approximation JTs better ourmethod
This thesis Learn/constructstructure Learn/defineparameters Inference P(Q|E=e) [NIPS 2007] 1. Learning tractable models efficiently and with quality guarantees 2. Simplifying large-scale models / focusing inference on the query Thesis contributions: 3. Learning simple local models by exploiting evidence assignments
Approximate inference is still useful • Often learning a tractable graphical model is not an option • Need domain knowledge • Templatized models • Markov logic nets • Probabilistic relational models • Dynamic Bayesian nets • This part: the (intractable) PGM is a given • What can we do with the inference? • What if we know the query variables Q and evidence E=e?
Query-specific simplification This part: the (intractable) PGM is a given Observation: often many variables are unknown, but also not important to the user Suppose we know the variables Q of interest (the query) Observation:usually, variables far away from the query do not affect P(Q) much these have little effect on P(Q) query
Query-specific simplification Observation: variables far away from the query do not affect P(Q) much these have little effect on P(Q) query Idea: discard parts of the model that have little effect on the query Observation:values of potentials are important want this part first Our work: • edge importance from values of potentials • efficient algorithms for model simplification • focused inference as soft model simplification
Belief propagation [Pearl:1988] • For every edge Xi-Xj and variable, a message • Beliefabout the marginal over Xi : • Algorithm: until convergence • Fixed point of BP() is the solution
Model simplification problem Model simplificationproblem: which messages to skip updating s.t.- inference cost gets small enough - BP fixed point for P(Q) does not change much query
Edge costs • Inference cost IC(ij) • complexity of one BP update for mij • Approximation value AV(ij) • Measure of influence of mij on the belief P(Q) Model simplificationproblem: Find the set E’E of edges s.t.- AV(ij) max - IC(ij) inference budget maximize fit quality keep inference affordable Lemma: Model simplification problem is NP-hard Greedy edge selection gets -factor approximation
Approximation values • Approximation value AV(ij) • Measure of influence of mij on the belief P(Q) (vn) (ij) - how important is it? fix all messagesnot in (rq) simple path query mrq= BP*(mvn) define: path strength() = max-sensitivity approximation value AV(ij) is the single strongestdependency (in derivative) that (ij) participates in define: AV(ij) = max(ij) path strength()
Efficient model simplification max-sensitivity approximation value AV(ij) is the single strongest dependency (in derivative) that (ij) participates in Lemma: with max-sensitivity edge values can find optimal submodel - as the first M edges expanded by best-first search - with constant-time computation per expanded edge(using [Mooij+Kappen:2007]) Simplification complexity independent of the size of the full model(only depends on the solution size) Templated models:only instantiate model parts that are in the solution
Future work: multi-path dependencies • All paths possible, but expensive: O(|E|3) • k strongest paths? • AV(ij) = max(ij)1,…,kmpath strength(m) • best-first search with at most k visits of an edge? (ij) Want to take both of these into account (rq) query
Perturbation approximation values mrq= BP*(mvn) (vn) fix all messagesnot in (rq) simple path query path strength() is the largest derivative value along the path w.r.t the endpoint message mean value theorem tighter bound from BP message properties upper bound on mrq= change observation: do not take the possible range of the endpoint messageinto account define: path strength*() =
Efficient model simplification define: max-perturbation AV(i) = max(i) path strength*() Lemma: with max-perturbation edge values,assuming that the message derivatives along paths are known, can find optimal submodel - as the first M edges expanded by best-first search - with constant-time computation per expanded edge extra work: need to know derivatives along paths solution: max-sensitivity best-first search as a subroutine
Future work: efficient max-perturbation simplification AV(ij) only need exact derivative if||f||derivativeis in this range min||f|| current lower bound onpath strength* from BFS define: path strength*() = extra work: need to know derivatives along paths not always!
Future work: computation trees 1 1 ? 2 3 2 4 4 prune computation trees according to edge importance 1 3 computation tree traversal ~ message update schedule 3 2 1 4 4 2 4 …
Focused inference BP proceeds until allbeliefs converge But we only care about query beliefs Residual importance weighting for convergence testing For residual BP more attention to more important regions convergence hereis less important Weigh residuals by edge importance convergence hereis more important
Related work • Minimal submodel to have exactly the same P(Q|E=e) regardless of the values of potentials • Knowledge-based model construction [Wellman+al:1992,Richardson+Domingos:2006] • Graph distance as edge importance measure [Pentney+al:2006] • Empirical mutual information as variable importance measure [Pentney+al:2007] • Inference in simplified model to quantify the effect of an extra edge exactly[Kjaerulff:1993,Choi+Darwiche:2008]
This thesis Learn/constructstructure Learn/defineparameters Inference P(Q|E=e) [NIPS 2007] 1. Learning tractable models efficiently and with quality guarantees 2. Simplifying large-scale models / focusing inference on the query Thesis contributions: 3. Learning simple local models by exploiting evidence assignments
Local models: motivation Common approach: Learn/constructstructure Approximate parameters Approximateinference P(Q|E=e) This talk, part 1: Learn tractablestructure optimalparameters exactinference P(Q|E=e) What if no single tractable structure fits well?
Local models: motivation What if no single tractable structure fits well? But locallyalmost lineardependence Regression analogy: query no single linefits well q q=f(e) solution: learn local tractable models e evidence get evidenceasssignmentE=e learn tractablestructurefor E=e learn tractablestructure optimalparameters parametersfor E=e exactinference P(Q|E=e)
Local models: example get evidenceasssignmentE=e learn tractablestructurefor E=e parametersfor E=e exactinference P(Q|E=e) example: local conditional random fields (CRFs) global CRF local CRF weight feature query-specific structure. I(E){0, 1} E=e1 E=e1 E=e2 E=e2 … … E=en E=en
Learning local models Need to learn w and QS structure I(E) known structures for every training point E=e1 good local structures (e.g. local search) … E=en E=e1 Q=q1 … + … E=en Q=qn Iterate! E=e1 Q=q1 … + … E=en Q=qn optimal weights w(convex opt) known weights w local CRF query-specific structure. I(E){0, 1} need query values here!cannot use at test time E=e1 … E=en
Learning local models parametrize I(E) by V: I=I(E,V) learn w and QS structure parameters V known structures for every training point optimize V so that I(E,V) mimicsthe good local structures well for training data E=e1, V … E=en, V E=e1 Q=q1 … + … E=en Q=qn Iterate! E=e1 good local structures (e.g. local search) … E=en E=e1 Q=q1 … + … E=en Q=qn optimal weights w(convex opt) known weights w
Future work: better exploration need to avoid shallow local minima- multiple structures per datapoint- stochastic optimization sample structures will these be different? known structures for every training point E=e1 good local structures (e.g. local search) … E=en E=e1 Q=q1 … + … E=en Q=qn E=e1 Q=q1 … + … E=en Q=qn optimal weights w(convex opt) known weights w
Future work: multi-query optimization known structures for every training point optimize V so that I(E,V) mimicsthe good local structures well for training data E=e1, V … E=en, V E=e1 Q=q1 … + … E=en Q=qn Iterate! E=e1 good local structures (e.g. local search) … E=en separate structure for every query may be too costlyquery clustering- directly using evidence- using inferred model parameters (given wand V) E=e1 Q=q1 … + … E=en Q=qn optimal weights w(convex opt) known weights w
Future work: faster local search need efficient structure learning- amortize inference cost for scoring multiple search steps known structures for every training point optimize V so that I(E,V) mimicsthe good local structures well for training data E=e1, V … … … E=en, V E=e1 Q=q1 nuisance variable … + … E=en Q=qn queryvariable E=e1 good local structures (e.g. local search) … E=en E=e1 Q=q1 … + … need support for nuisance vars in structure scores E=en Q=qn optimal weights w(convex opt) known weights w
Recap Learn/constructstructure Learn/defineparameters Inference P(Q|E=e) [NIPS 2007] 1. Learning tractable models efficiently and with quality guarantees 2. Simplifying large-scale models / focusing inference on the query Thesis contributions: 3. Learning localtractable models by exploiting evidence assignments
Timeline • Validation of QS model simplification • Activity recognition data, MLN data • QS simplification • Multi-path extensions for edge importance measures • Computation trees connections • Max-perturbation computation speedups • QS learning • Better exploration (stochastic optimization / multiple structures per datapoint) • Multi-query optimization • Validation • QS learning • Nuisance variables support • Local search speedups • Quality guarantees • Validation • Write thesis, defend Summer 2009 Fall 2009 Spring 2010 Summer 2010
Thank you! Collaborators: Carlos Guestrin, Joseph Bradley, Dafna Shahaf
Speeding things up there are O(|X|k) separators here • Constraint-based algorithm: • set L = • for every potential separator SXs.t. |S|=k • do I() estimation, change L • find junction tree (C,E) consistent with L Observation: there are |X|-k separators in (C,E) I() estimations for the rest O(|X|k) separators are wasted • Faster heuristic: • until (C,E) passes checks • do I() estimation, change L • find junction tree (C,E) consistent with L
Speeding things up • Faster heuristic: • until (C,E) passes checks • do I() estimation, change L • find junction tree (C,E) consistent with L Recall that our upper bound on I()uses all YX \S for |Y|k I(V,X-VS | S)=?? V X-VS Idea: get a rough estimate by only looking at smaller Y (e.g. |Y|=2) I(YV,YX-VS|S) YX-VS • Faster heuristic: • estimateI() with |Y|=2, form L • do • find junction tree (C,E) consistent with L • estimate I(|S) with |Y|=k for SS, update L • check if (C,E) is still an -JT with the updated I(|S) • until (C,E) passes checks YV