Algorithms for Answering Queries with Graphical Models

Thesis Proposal Algorithms for Answering Queries with Graphical Models Anton Chechetka Thesis committee:Carlos Guestrin Eric Xing Drew Bagnell Pedro Domingos (UW) 21 May 2009

Motivation Activity recognition Sensor networks Patient monitoring &diagnosis Image credit: http://www.dremed.com Image credit: [Pentney+al:2006]

Motivation Common problem: computeP(Q | E =e) True temperature in a room? Sensor 3 reads 25°C Has the person finished cooking? The person is next to the kitchen sink (RFID) Is the patient well? Heart rate is 70 BPM

Common solution Common problem: compute P(Q | E =e) (query)Common solution: probabilistic graphical models This thesis: New algorithms for learning and inference in PGMs to make answering queries better [Pentney+al:2006] [Deshpande+al:2004] [Beinlich+al:1988]

Graphical models Represent factorized distributions X are small subsets of X  compact representation corresponding graph structure X4 X1 X3 X5 X2 Learn/constructstructure Learn/defineparameters Inference P(Q|E=e) • Fundamental problems: • P(Q|E=e) given a PGM? • Best parameters f given the structure? • Optimal structure (i.e. sets X)? #P-complete / NP-complete exp(|X|) complexity NP-complete

This thesis Learn/constructstructure Learn/defineparameters Inference P(Q|E=e) [NIPS 2007] 1. Learning tractable models efficiently and with quality guarantees 2. Simplifying large-scale models / focusing inference on the query Thesis contributions: 3. Learning simple local models by exploiting evidence assignments

Leaning tractable models Learn/constructstructure • Every step in the pipeline is computationally hard for general PGMs • Compounding errors • But there are exact inference and parameter learning algorithms with exp(graph treewidth) complexity • So if we learn low-treewidth models, all the rest is easy! Learn/defineparameters Inference P(Q|E=e)

Treewidth Learn/constructstructure • Learn low-treewidth models  all the rest is easy! • Treewidth = size of largest clique in a triangulated graph • Computing treewidth is NP-complete in general • But easy to constructgraphs with given treewidth • Convenient representation: junction tree Learn/defineparameters Inference P(Q|E=e) X4,X5 X1,X4,X5 X4,X5,X6 C1 C4 X1,X5 X1,X5 X1,X2,X5 X1,X3,X5 C2 C5 3 4 X1,X2 1 5 6 X1,X2,X7 7 2 C3

Junction trees • Learn junction trees all the rest is easy! • Other classes of tractable models exist, e.g. [Lowd+Domingos:2008] • Running intersection property • Most likely junction tree of fixed treewidth >1 is NP-complete • We will look for good approximations X1,X4,X5 X4,X5,X6 X1,X4,X5 X4,X5 X1,X2,X5 X1,X3,X5 X1,X2,X5 X1,X4,X5 X4,X5,X6 C1 C4 X1,X5 X4,X5,X6 X1,X5 X1,X2,X5 X1,X3,X5 C2 C5 X1,X2 X1,X3,X5 X1,X2,X7 C3

Independencies in low-treewidth distributions P(X)factorizes according to a JT conditional independencies hold conditional mutual information works in the other way too! X4,X5,X6 X1,X3,X5 X=X4X6 X=X2X3X7 X1,X5 X1,X4,X5 X1,X2,X5 X1,X2,X7

Constraint-based structure learning We will look for JTs where this holds Constraint-based structure learning X-VS S1 X-S3 S4 S3 S3 S S1 … X-S1 V S1 S2 I(V, X \VS | S) <  ?? S2 S4 S2 X-S2 Construct a junction tree (e.g. using dynamic programming) Take all candidate separators Partition remaining variables into weakly dependent subsets

Mutual information estimation I(V, X \VS | S) <  ?? definition: I(A,B|S) = H(A| S) – H(A|BS) naïve estimation of costs exp(|X|), too expensive sum over all 2|X|assignments to X our work: upper bound on I(V, X \VS | S), using values of I(Y,Z|S) for |YZ|treewidth+1 there are O(|X|treewidth+1) subsets Y and Z  complexity polynomial in |X|

Mutual information estimation I(V, X \VS | S) <  ?? hard • Theorem: suppose that P(X), S, V are s.t. • an -JT of treewidth k for P(X) exists • for every AV, BX-VSs.t. |AB|  k+1 • I(  )   • Then • I(V, X-VS | S)  |X|( + ) I(V,X-VS | S)=?? V X-VS easy I(A,B|S) B A |AB|treewidth+1 • Complexity O(|Xk+1|)  exponential speedup • No need to know the -JT, only that it exists • The bound is loose only when there is no hope to learn a good JT

Guarantees on learned model quality • Theorem:suppose that P(X) is s.t. • a strongly connected-JT of treewidth k for P(X) exists • Then our alg. will with probability at least (1-) find a JT (C,E) s.t. quality guarantee using samples and time poly samples poly time Corollary: strongly connected junction trees are PAC-learnable

Related work

Results – typical convergence time Test log-likelihood good results early on in practice

Results – log-likelihood OBS  local search in limited in-degree Bayes nets Chow-Liu  most likely JTs of treewidth 1 Karger-Srebro  constant-factor approximation JTs better ourmethod

Approximate inference is still useful • Often learning a tractable graphical model is not an option • Need domain knowledge • Templatized models • Markov logic nets • Probabilistic relational models • Dynamic Bayesian nets • This part: the (intractable) PGM is a given • What can we do with the inference? • What if we know the query variables Q and evidence E=e?

Query-specific simplification This part: the (intractable) PGM is a given Observation: often many variables are unknown, but also not important to the user Suppose we know the variables Q of interest (the query) Observation:usually, variables far away from the query do not affect P(Q) much these have little effect on P(Q) query

Query-specific simplification Observation: variables far away from the query do not affect P(Q) much these have little effect on P(Q) query Idea: discard parts of the model that have little effect on the query Observation:values of potentials are important want this part first Our work: • edge importance from values of potentials • efficient algorithms for model simplification • focused inference as soft model simplification

Belief propagation [Pearl:1988] • For every edge Xi-Xj and variable, a message • Beliefabout the marginal over Xi : • Algorithm: until convergence • Fixed point of BP() is the solution

Model simplification problem Model simplificationproblem: which messages to skip updating s.t.- inference cost gets small enough - BP fixed point for P(Q) does not change much query

Edge costs • Inference cost IC(ij) • complexity of one BP update for mij • Approximation value AV(ij) • Measure of influence of mij on the belief P(Q) Model simplificationproblem: Find the set E’E of edges s.t.-  AV(ij)  max -  IC(ij)  inference budget maximize fit quality keep inference affordable Lemma: Model simplification problem is NP-hard Greedy edge selection gets -factor approximation

Approximation values • Approximation value AV(ij) • Measure of influence of mij on the belief P(Q) (vn) (ij) - how important is it? fix all messagesnot in  (rq) simple path  query mrq= BP*(mvn) define: path strength() = max-sensitivity approximation value AV(ij) is the single strongestdependency (in derivative) that (ij) participates in define: AV(ij) = max(ij) path strength()

Efficient model simplification max-sensitivity approximation value AV(ij) is the single strongest dependency (in derivative) that (ij) participates in Lemma: with max-sensitivity edge values can find optimal submodel - as the first M edges expanded by best-first search - with constant-time computation per expanded edge(using [Mooij+Kappen:2007]) Simplification complexity independent of the size of the full model(only depends on the solution size) Templated models:only instantiate model parts that are in the solution

Future work: multi-path dependencies • All paths possible, but expensive: O(|E|3) • k strongest paths? • AV(ij) = max(ij)1,…,kmpath strength(m) • best-first search with at most k visits of an edge? (ij) Want to take both of these into account (rq) query

Perturbation approximation values mrq= BP*(mvn) (vn) fix all messagesnot in  (rq) simple path  query path strength() is the largest derivative value along the path w.r.t the endpoint message mean value theorem tighter bound from BP message properties upper bound on mrq= change observation: do not take the possible range of the endpoint messageinto account define: path strength*() =

Efficient model simplification define: max-perturbation AV(i) = max(i) path strength*() Lemma: with max-perturbation edge values,assuming that the message derivatives along paths  are known, can find optimal submodel - as the first M edges expanded by best-first search - with constant-time computation per expanded edge extra work: need to know derivatives along paths solution: max-sensitivity best-first search as a subroutine

Future work: efficient max-perturbation simplification AV(ij) only need exact derivative if||f||derivativeis in this range min||f|| current lower bound onpath strength* from BFS define: path strength*() = extra work: need to know derivatives along paths  not always!

Future work: computation trees 1 1 ? 2 3 2 4 4 prune computation trees according to edge importance 1 3 computation tree traversal ~ message update schedule 3 2 1 4 4 2 4 …

Focused inference BP proceeds until allbeliefs converge But we only care about query beliefs Residual importance weighting for convergence testing For residual BP  more attention to more important regions convergence hereis less important Weigh residuals by edge importance convergence hereis more important

Related work • Minimal submodel to have exactly the same P(Q|E=e) regardless of the values of potentials • Knowledge-based model construction [Wellman+al:1992,Richardson+Domingos:2006] • Graph distance as edge importance measure [Pentney+al:2006] • Empirical mutual information as variable importance measure [Pentney+al:2007] • Inference in simplified model to quantify the effect of an extra edge exactly[Kjaerulff:1993,Choi+Darwiche:2008]

Local models: motivation Common approach: Learn/constructstructure Approximate parameters Approximateinference P(Q|E=e) This talk, part 1: Learn tractablestructure optimalparameters exactinference P(Q|E=e) What if no single tractable structure fits well?

Local models: motivation What if no single tractable structure fits well? But locallyalmost lineardependence Regression analogy: query no single linefits well q q=f(e) solution: learn local tractable models e evidence get evidenceasssignmentE=e learn tractablestructurefor E=e learn tractablestructure optimalparameters parametersfor E=e exactinference P(Q|E=e)

Local models: example get evidenceasssignmentE=e learn tractablestructurefor E=e parametersfor E=e exactinference P(Q|E=e) example: local conditional random fields (CRFs) global CRF local CRF weight feature query-specific structure. I(E){0, 1} E=e1 E=e1 E=e2 E=e2 … … E=en E=en

Learning local models Need to learn w and QS structure I(E) known structures for every training point E=e1 good local structures (e.g. local search) … E=en E=e1 Q=q1 … + … E=en Q=qn Iterate! E=e1 Q=q1 … + … E=en Q=qn optimal weights w(convex opt) known weights w local CRF query-specific structure. I(E){0, 1} need query values here!cannot use at test time  E=e1 … E=en

Learning local models parametrize I(E) by V: I=I(E,V) learn w and QS structure parameters V known structures for every training point optimize V so that I(E,V) mimicsthe good local structures well for training data E=e1, V … E=en, V E=e1 Q=q1 … + … E=en Q=qn Iterate! E=e1 good local structures (e.g. local search) … E=en E=e1 Q=q1 … + … E=en Q=qn optimal weights w(convex opt) known weights w

Future work: better exploration need to avoid shallow local minima- multiple structures per datapoint- stochastic optimization  sample structures will these be different? known structures for every training point E=e1 good local structures (e.g. local search) … E=en E=e1 Q=q1 … + … E=en Q=qn E=e1 Q=q1 … + … E=en Q=qn optimal weights w(convex opt) known weights w

Future work: multi-query optimization known structures for every training point optimize V so that I(E,V) mimicsthe good local structures well for training data E=e1, V … E=en, V E=e1 Q=q1 … + … E=en Q=qn Iterate! E=e1 good local structures (e.g. local search) … E=en separate structure for every query may be too costlyquery clustering- directly using evidence- using inferred model parameters (given wand V) E=e1 Q=q1 … + … E=en Q=qn optimal weights w(convex opt) known weights w

Future work: faster local search need efficient structure learning- amortize inference cost for scoring multiple search steps known structures for every training point optimize V so that I(E,V) mimicsthe good local structures well for training data E=e1, V … … … E=en, V E=e1 Q=q1 nuisance variable … + … E=en Q=qn queryvariable E=e1 good local structures (e.g. local search) … E=en E=e1 Q=q1 … + … need support for nuisance vars in structure scores E=en Q=qn optimal weights w(convex opt) known weights w

Recap Learn/constructstructure Learn/defineparameters Inference P(Q|E=e) [NIPS 2007] 1. Learning tractable models efficiently and with quality guarantees 2. Simplifying large-scale models / focusing inference on the query Thesis contributions: 3. Learning localtractable models by exploiting evidence assignments

Timeline • Validation of QS model simplification • Activity recognition data, MLN data • QS simplification • Multi-path extensions for edge importance measures • Computation trees connections • Max-perturbation computation speedups • QS learning • Better exploration (stochastic optimization / multiple structures per datapoint) • Multi-query optimization • Validation • QS learning • Nuisance variables support • Local search speedups • Quality guarantees • Validation • Write thesis, defend Summer 2009 Fall 2009 Spring 2010 Summer 2010

Thank you! Collaborators: Carlos Guestrin, Joseph Bradley, Dafna Shahaf

Speeding things up there are O(|X|k) separators here • Constraint-based algorithm: • set L =  • for every potential separator SXs.t. |S|=k • do I() estimation, change L • find junction tree (C,E) consistent with L Observation: there are |X|-k separators in (C,E)  I() estimations for the rest O(|X|k) separators are wasted • Faster heuristic: • until (C,E) passes checks • do I() estimation, change L • find junction tree (C,E) consistent with L

Speeding things up • Faster heuristic: • until (C,E) passes checks • do I() estimation, change L • find junction tree (C,E) consistent with L Recall that our upper bound on I()uses all YX \S for |Y|k I(V,X-VS | S)=?? V X-VS Idea: get a rough estimate by only looking at smaller Y (e.g. |Y|=2) I(YV,YX-VS|S) YX-VS • Faster heuristic: • estimateI() with |Y|=2, form L • do • find junction tree (C,E) consistent with L • estimate I(|S) with |Y|=k for SS, update L • check if (C,E) is still an -JT with the updated I(|S) • until (C,E) passes checks YV

Algorithms for Answering Queries with Graphical Models

Algorithms for Answering Queries with Graphical Models

Presentation Transcript

Graphical Models

Graphical Interface for Queries

Graphical Models for the Internet

Learning with Inference for Discrete Graphical Models

Graphical Models

Graphical Models

Variational Methods for Graphical Models

Graphical Models

Answering queries across mappings

GRAPHICAL MODELS

Answering Queries: Problems

Answering Approximate Queries Efficiently

Answering Conceptual Queries with Ferret

Answering Queries Using Views

Crowd DB : Answering Queries with Crowdsourcing

Answering Queries Using Views

Graphical Models

Expectation Propagation for Graphical Models

Answering Queries using Templates with Binding Patterns

Answering Approximate Queries Efficiently

Graphical Models