710 likes | 1.07k Views
Learning Large-Scale Conditional Random Fields. Thesis Defense. Joseph K. Bradley. Committee Carlos Guestrin (U. of Washington, Chair) Tom Mitchell John Lafferty (U. of Chicago) Andrew McCallum (U. of Massachusetts at Amherst). 1 / 18 / 2013. Modeling Distributions.
E N D
Learning Large-Scale Conditional Random Fields Thesis Defense Joseph K. Bradley Committee Carlos Guestrin (U. of Washington, Chair) Tom Mitchell John Lafferty (U. of Chicago) Andrew McCallum (U. of Massachusetts at Amherst) 1 / 18 / 2013
Modeling Distributions Goal: Model distribution P(X) over random variables X E.g.: Model life of a grad student. X6: loud roommate? X1: losing sleep? X7: taking classes? X4: losing hair? X2: deadline? X3: sick? X8: cold weather? X9: exercising? X5: overeating? X10: gaining weight? X11: single?
Modeling Distributions Goal: Model distribution P(X) over random variables X E.g.: Model life of a grad student. = P( losing sleep, overeating | deadline, taking classes ) X6: loud roommate? X1: losing sleep? X7: taking classes? X4: losing hair? X2: deadline? X3: sick? X8: cold weather? X9: exercising? X5: overeating? X10: gaining weight? X10: single?
Markov Random Fields (MRFs) Goal: Model distribution P(X) over random variables X E.g.: Model life of a grad student. X6: loud roommate? X1: losing sleep? X7: taking classes? X4: losing hair? X2: deadline? X3: sick? X8: cold weather? X9: exercising? X5: overeating? X10: gaining weight? X10: single?
Markov Random Fields (MRFs) Goal: Model distribution P(X) over random variables X factor (parameters) X6 X1 X7 X4 X2 graphical structure X3 X8 X9 X5 X10 X10
Conditional Random Fields (CRFs) (Lafferty et al., 2001) MRFs: P(X) CRFs: P(Y|X) X1 Y1 X3 Y4 X2 Y3 X4 X5 Y5 Y2 X6 Simpler structure (over Y only) Do not model P(X)
MRFs & CRFs • Benefits • Principled statistical and computational framework • Large body of literature • Applications • Natural language processing (e.g., Lafferty et al., 2001) • Vision (e.g., Tappen et al., 2007) • Activity recognition (e.g., Vail et al., 2007) • Medical applications (e.g., Schmidt et al., 2008) • ...
Challenges Goal: Given data, learn CRF structure and parameters. X1 Big structured optimization problem Y1 NP hard in general (Srebro, 2003) Y4 X2 Y3 Many learning methods require inference, i.e., answering queries P(A|B) NP hard to approximate (Roth, 1996) X5 Y5 Approximations often lack strong guarantees. Y2 X6
Thesis Statement CRFs offer statistical and computational advantages, but traditional learning methods are often impractical for large problems. We can scale learning by using decompositions of learning problems which trade off sample complexity, computation, and parallelization.
Outline Parameter Learning Structure Learning Scaling core methods • Learning without intractable inference • Learning tractable structures solve via Parallel Regression Parallel scaling • Multicore sparse regression
Outline Parameter Learning Structure Learning Scaling core methods • Learning without intractable inference • Learning tractable structures solve via Parallel Regression Parallel scaling • Multicore sparse regression
Log-linear MRFs Goal: Model distribution P(X) over random variables X X6 X1 X7 X4 X2 X3 X8 X9 X5 All results generalize to CRFs. X10 X10 Parameters Features
Parameter Learning: MLE Parameter Learning Given structure Φ and samples from Pθ*(X), Learn parameters θ. Traditional method: max-likelihood estimation (MLE) Minimize objective: Loss Gold Standard: MLE is (optimally) statistically efficient.
Parameter Learning: MLE Inference makes learning hard. Can we learn without intractable inference? • MLE requires inference. • Provably hard for general MRFs.(Roth, 1996)
Parameter Learning: MLE Inference makes learning hard. Can we learn without intractable inference? • Approximate inference & objectives • Many works: Hinton (2002), Sutton & McCallum (2005), Wainwright (2006), ... • Many lack strong theory. • Almost no guarantees for general MRFs or CRFs.
Our Solution Bradley, Guestrin (2012) Sample complexity Parallel optimization Computational complexity Max Likelihood Estimation (MLE) Optimal Difficult High Max Pseudolikelihood Estimation (MPLE) High Easy Low PAC learnability for many MRFs!
Our Solution Bradley, Guestrin (2012) Sample complexity Parallel optimization Computational complexity Max Likelihood Estimation (MLE) Optimal Difficult High Max Pseudolikelihood Estimation (MPLE) High Easy Low PAC learnability for many MRFs!
Our Solution Bradley, Guestrin (2012) Sample complexity Parallel optimization Computational complexity Max Likelihood Estimation (MLE) Optimal Difficult High Max Composite Likelihood Estimation (MCLE) Easy Low Low Choose MCLE structure to optimize trade-offs Max Pseudolikelihood Estimation (MPLE) High Easy Low
Deriving Pseudolikelihood (MPLE) MLE: Hard to compute. So replace it! X1 X4 X2 X3 X5
Deriving Pseudolikelihood (MPLE) MLE: X1 X4 X2 X3 X5 MPLE: Estimate via regression: (Besag, 1975) Tractable inference!
Pseudolikelihood (MPLE) • Cons • Less statistically efficient than MLE (Liang & Jordan, 2008) • No PAC bounds • Pros • No intractable inference! • Consistent estimator MPLE: PAC = Probably Approximately Correct (Valiant, 1984) (Besag, 1975)
Sample Complexity: MLE Our Theorem: Bound on n (# training examples needed) probability of failure parameter error (L1) # parameters (length of θ) Recall: Requires intractable inference. Λmin: min eigenvalue of Hessian of loss at θ*
Sample Complexity: MPLE Our Theorem: Bound on n (# training examples needed) probability of failure parameter error (L1) # parameters (length of θ) PAC learnability for many MRFs! Recall: Tractable inference. Λmin: mini [ min eigenvalue of Hessian of component i at θ* ]
Sample Complexity: MPLE Our Theorem: Bound on n (# training examples needed) PAC learnability for many MRFs! • Related Work • Ravikumar et al. (2010) • Regression Yi~X with Ising models • Basis of our theory • Liang & Jordan (2008) • Asymptotic analysis of MLE, MPLE • Our bounds match theirs • Abbeel et al. (2006) • Only previous method with PAC bounds for high-treewidth MRFs • We extend their work: • Extension to CRFs, algorithmic improvements, analysis • Their method is very similar to MPLE.
Trade-offs: MLE & MPLE Our Theorem: Bound on n (# training examples needed) MLE Larger Λmin => Lower sample complexity MPLE Smaller Λmin => Higher sample complexity Higher computational complexity Lower computational complexity Sample — computational complexity trade-off
Trade-offs: MPLE Joint optimization for MPLE: X2 X2 X1 X1 Lower sample complexity Disjoint optimization for MPLE: Data-parallel • 2 estimates of • Average estimates Sample complexity — parallelism trade-off
Synthetic CRFs Chains Stars Grids Random Factor strength = strength of variable interactions Associative
Predictive Power of Bounds MPLE MLE better MPLE-disjoint Length-4 chains L1 param error ε Factors: random, fixed strength Errors should be ordered: MLE < MPLE < MPLE-disjoint # training examples
Predictive Power of Bounds MLE & MPLE Sample Complexity: MLE 10,000 train exs Actual ε better harder Length-6 chains Factors: random
Failure Modes of MPLE How do Λmin(MLE) and Λmin(MPLE) vary for different models? Sample complexity: Model diameter Factor strength Node degree
Λmin: Model Diameter Λmin ratio: MLE/MPLE (Higher = MLE better) Relative MPLE performance is independent of diameter in chains. (Same for random factors) Λmin ratio Model diameter Chains Factors: associative, fixed strength
Λmin: Factor Strength Λmin ratio: MLE/MPLE (Higher = MLE better) Length-8 Chains Factors: associative MPLE performs poorly with strong factors. (Same for random factors, and star & grid models) Λmin ratio Factor strength
Λmin: Node Degree Λmin ratio: MLE/MPLE (Higher = MLE better) Stars Factors: associative, fixed strength MPLE performs poorly with high-degree nodes. (Same for random factors) Λmin ratio Node degree
Failure Modes of MPLE How do Λmin(MLE) and Λmin(MPLE) vary for different models? Sample complexity: Model diameter Factor strength Node degree We can often fix this!
Composite Likelihood (MCLE) MLE: Estimate P(Y) all at once
Composite Likelihood (MCLE) MLE: Estimate P(Y) all at once MPLE: Estimate P(Yi|Y\i) separately Yi
Composite Likelihood (MCLE) MLE: Estimate P(Y) all at once MPLE: Estimate P(Yi|Y\i) separately Something in between? Composite Likelihood (MCLE): Estimate P(YAi|Y\Ai) separately. (Lindsay, 1988) YAi
Composite Likelihood (MCLE) Generalizes MLE, MPLE; analogous: • Objective • Sample complexity • Joint & disjoint optimization MCLE Class: Node-disjoint subgraphs which cover graph.
Composite Likelihood (MCLE) Combs MCLE Class: Node-disjoint subgraphs which cover graph. Generalizes MLE, MPLE; analogous: Objective Sample complexity Joint & disjoint optimization • Trees (tractable inference) • Follow structure of P(X) • Cover star structures • Cover strong factors • Choose large components
Structured MCLE on a Grid Grid. Associative factors. 10,000 train exs. Gibbs sampling. MPLE MLE MCLE (combs) MPLE Training time (sec) Log loss ratio (other/MLE) better MCLE (combs) Grid size |X| Grid size |X| MCLE (combs) lowers sample complexity MCLE tailored to model structure. Also in thesis: tailoring to correlations in data. ...without increasing computation!
Summary: Parameter Learning Sample complexity Parallel optimization Computational complexity • Finite sample complexity bounds for general MRFs, CRFs • PAC learnability for certain classes • Empirical analysis • Guidelines for choosing MCLE structures: tailor to model, data Likelihood (MLE) Optimal Difficult High Composite Likelihood (MCLE) Easy Low Low Pseudolikelihood (MPLE) High Easy Low
Outline Structure Learning Parameter Learning Scaling core methods • Learning tractable structures • Learning without intractable inference solve via Parallel Regression Parallel scaling • Multicore sparse regression
CRF Structure Learning Structure learning: Choose YC I.e., learn conditional independence Evidence selection: Choose XD I.e., select X relevant to each YC X1: loud roommate? Y1: losing sleep? X2: taking classes? X3: deadline? Y3: sick? Y2: losing hair?
Related Work • Most similar to our work: • They focus on selecting treewidth-k structures. • We focus on the choice of edge weight.
Tree CRFs with Local Evidence Bradley, Guestrin (2010) Goal • Given: • Data • Local evidence • Learn tree CRF structure • Via a scalable method Xi relevant to each Yi Fast inference at test-time
Chow-Liu for MRFs Chow & Liu (1968) Algorithm • Weight edges with mutual information: Y2 Y1 Y3
Chow-Liu for MRFs Chow & Liu (1968) Algorithm • Weight edges with mutual information: Y2 Y1 Choose max-weight spanning tree. Y3 Chow-Liu finds a max-likelihood structure.
Chow-Liu for CRFs? Algorithm • Weight each possible edge: Global Conditional Mutual Information (CMI) Choose max-weight spanning tree. What edge weight? must be efficient to compute Pro: Finds max-likelihood structure (with enough data) Con: Intractable for large |X|
Generalized Edge Weights Global CMI Local Linear Entropy Scores (LLES): w(i,j) = linear combination of entropies over Yi,Yj,Xi,Xj Theorem No LLES can recover all tree CRFs (even with non-trivial parameters and exact entropies).