1 / 69

Thesis Defense

Learning Large-Scale Conditional Random Fields. Thesis Defense. Joseph K. Bradley. Committee Carlos Guestrin (U. of Washington, Chair) Tom Mitchell John Lafferty (U. of Chicago) Andrew McCallum (U. of Massachusetts at Amherst). 1 / 18 / 2013. Modeling Distributions.

junius
Download Presentation

Thesis Defense

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Learning Large-Scale Conditional Random Fields Thesis Defense Joseph K. Bradley Committee Carlos Guestrin (U. of Washington, Chair) Tom Mitchell John Lafferty (U. of Chicago) Andrew McCallum (U. of Massachusetts at Amherst) 1 / 18 / 2013

  2. Modeling Distributions Goal: Model distribution P(X) over random variables X E.g.: Model life of a grad student. X6: loud roommate? X1: losing sleep? X7: taking classes? X4: losing hair? X2: deadline? X3: sick? X8: cold weather? X9: exercising? X5: overeating? X10: gaining weight? X11: single?

  3. Modeling Distributions Goal: Model distribution P(X) over random variables X E.g.: Model life of a grad student. = P( losing sleep, overeating | deadline, taking classes ) X6: loud roommate? X1: losing sleep? X7: taking classes? X4: losing hair? X2: deadline? X3: sick? X8: cold weather? X9: exercising? X5: overeating? X10: gaining weight? X10: single?

  4. Markov Random Fields (MRFs) Goal: Model distribution P(X) over random variables X E.g.: Model life of a grad student. X6: loud roommate? X1: losing sleep? X7: taking classes? X4: losing hair? X2: deadline? X3: sick? X8: cold weather? X9: exercising? X5: overeating? X10: gaining weight? X10: single?

  5. Markov Random Fields (MRFs) Goal: Model distribution P(X) over random variables X factor (parameters) X6 X1 X7 X4 X2 graphical structure X3 X8 X9 X5 X10 X10

  6. Conditional Random Fields (CRFs) (Lafferty et al., 2001) MRFs: P(X) CRFs: P(Y|X) X1 Y1 X3 Y4 X2 Y3 X4 X5 Y5 Y2 X6 Simpler structure (over Y only) Do not model P(X)

  7. MRFs & CRFs • Benefits • Principled statistical and computational framework • Large body of literature • Applications • Natural language processing (e.g., Lafferty et al., 2001) • Vision (e.g., Tappen et al., 2007) • Activity recognition (e.g., Vail et al., 2007) • Medical applications (e.g., Schmidt et al., 2008) • ...

  8. Challenges Goal: Given data, learn CRF structure and parameters. X1 Big structured optimization problem Y1 NP hard in general (Srebro, 2003) Y4 X2 Y3 Many learning methods require inference, i.e., answering queries P(A|B) NP hard to approximate (Roth, 1996) X5 Y5 Approximations often lack strong guarantees. Y2 X6

  9. Thesis Statement CRFs offer statistical and computational advantages, but traditional learning methods are often impractical for large problems. We can scale learning by using decompositions of learning problems which trade off sample complexity, computation, and parallelization.

  10. Outline Parameter Learning Structure Learning Scaling core methods • Learning without intractable inference • Learning tractable structures solve via Parallel Regression Parallel scaling • Multicore sparse regression

  11. Outline Parameter Learning Structure Learning Scaling core methods • Learning without intractable inference • Learning tractable structures solve via Parallel Regression Parallel scaling • Multicore sparse regression

  12. Log-linear MRFs Goal: Model distribution P(X) over random variables X X6 X1 X7 X4 X2 X3 X8 X9 X5 All results generalize to CRFs. X10 X10 Parameters Features

  13. Parameter Learning: MLE Parameter Learning Given structure Φ and samples from Pθ*(X), Learn parameters θ. Traditional method: max-likelihood estimation (MLE) Minimize objective: Loss Gold Standard: MLE is (optimally) statistically efficient.

  14. Parameter Learning: MLE

  15. Parameter Learning: MLE Inference makes learning hard. Can we learn without intractable inference? • MLE requires inference. • Provably hard for general MRFs.(Roth, 1996)

  16. Parameter Learning: MLE Inference makes learning hard. Can we learn without intractable inference? • Approximate inference & objectives • Many works: Hinton (2002), Sutton & McCallum (2005), Wainwright (2006), ... • Many lack strong theory. • Almost no guarantees for general MRFs or CRFs.

  17. Our Solution Bradley, Guestrin (2012) Sample complexity Parallel optimization Computational complexity Max Likelihood Estimation (MLE) Optimal Difficult High Max Pseudolikelihood Estimation (MPLE) High Easy Low PAC learnability for many MRFs!

  18. Our Solution Bradley, Guestrin (2012) Sample complexity Parallel optimization Computational complexity Max Likelihood Estimation (MLE) Optimal Difficult High Max Pseudolikelihood Estimation (MPLE) High Easy Low PAC learnability for many MRFs!

  19. Our Solution Bradley, Guestrin (2012) Sample complexity Parallel optimization Computational complexity Max Likelihood Estimation (MLE) Optimal Difficult High Max Composite Likelihood Estimation (MCLE) Easy Low Low Choose MCLE structure to optimize trade-offs Max Pseudolikelihood Estimation (MPLE) High Easy Low

  20. Deriving Pseudolikelihood (MPLE) MLE: Hard to compute. So replace it! X1 X4 X2 X3 X5

  21. Deriving Pseudolikelihood (MPLE) MLE: X1 X4 X2 X3 X5 MPLE: Estimate via regression: (Besag, 1975) Tractable inference!

  22. Pseudolikelihood (MPLE) • Cons • Less statistically efficient than MLE (Liang & Jordan, 2008) • No PAC bounds • Pros • No intractable inference! • Consistent estimator MPLE: PAC = Probably Approximately Correct (Valiant, 1984) (Besag, 1975)

  23. Sample Complexity: MLE Our Theorem: Bound on n (# training examples needed) probability of failure parameter error (L1) # parameters (length of θ) Recall: Requires intractable inference. Λmin: min eigenvalue of Hessian of loss at θ*

  24. Sample Complexity: MPLE Our Theorem: Bound on n (# training examples needed) probability of failure parameter error (L1) # parameters (length of θ) PAC learnability for many MRFs! Recall: Tractable inference. Λmin: mini [ min eigenvalue of Hessian of component i at θ* ]

  25. Sample Complexity: MPLE Our Theorem: Bound on n (# training examples needed) PAC learnability for many MRFs! • Related Work • Ravikumar et al. (2010) • Regression Yi~X with Ising models • Basis of our theory • Liang & Jordan (2008) • Asymptotic analysis of MLE, MPLE • Our bounds match theirs • Abbeel et al. (2006) • Only previous method with PAC bounds for high-treewidth MRFs • We extend their work: • Extension to CRFs, algorithmic improvements, analysis • Their method is very similar to MPLE.

  26. Trade-offs: MLE & MPLE Our Theorem: Bound on n (# training examples needed) MLE Larger Λmin => Lower sample complexity MPLE Smaller Λmin => Higher sample complexity Higher computational complexity Lower computational complexity Sample — computational complexity trade-off

  27. Trade-offs: MPLE Joint optimization for MPLE: X2 X2 X1 X1 Lower sample complexity Disjoint optimization for MPLE: Data-parallel • 2 estimates of • Average estimates Sample complexity — parallelism trade-off

  28. Synthetic CRFs Chains Stars Grids Random Factor strength = strength of variable interactions Associative

  29. Predictive Power of Bounds MPLE MLE better MPLE-disjoint Length-4 chains L1 param error ε Factors: random, fixed strength Errors should be ordered: MLE < MPLE < MPLE-disjoint # training examples

  30. Predictive Power of Bounds MLE & MPLE Sample Complexity: MLE 10,000 train exs Actual ε better harder Length-6 chains Factors: random

  31. Failure Modes of MPLE How do Λmin(MLE) and Λmin(MPLE) vary for different models? Sample complexity: Model diameter Factor strength Node degree

  32. Λmin: Model Diameter Λmin ratio: MLE/MPLE (Higher = MLE better) Relative MPLE performance is independent of diameter in chains. (Same for random factors) Λmin ratio Model diameter Chains Factors: associative, fixed strength

  33. Λmin: Factor Strength Λmin ratio: MLE/MPLE (Higher = MLE better) Length-8 Chains Factors: associative MPLE performs poorly with strong factors. (Same for random factors, and star & grid models) Λmin ratio Factor strength

  34. Λmin: Node Degree Λmin ratio: MLE/MPLE (Higher = MLE better) Stars Factors: associative, fixed strength MPLE performs poorly with high-degree nodes. (Same for random factors) Λmin ratio Node degree

  35. Failure Modes of MPLE How do Λmin(MLE) and Λmin(MPLE) vary for different models? Sample complexity: Model diameter Factor strength Node degree We can often fix this!

  36. Composite Likelihood (MCLE) MLE: Estimate P(Y) all at once

  37. Composite Likelihood (MCLE) MLE: Estimate P(Y) all at once MPLE: Estimate P(Yi|Y\i) separately Yi

  38. Composite Likelihood (MCLE) MLE: Estimate P(Y) all at once MPLE: Estimate P(Yi|Y\i) separately Something in between? Composite Likelihood (MCLE): Estimate P(YAi|Y\Ai) separately. (Lindsay, 1988) YAi

  39. Composite Likelihood (MCLE) Generalizes MLE, MPLE; analogous: • Objective • Sample complexity • Joint & disjoint optimization MCLE Class: Node-disjoint subgraphs which cover graph.

  40. Composite Likelihood (MCLE) Combs MCLE Class: Node-disjoint subgraphs which cover graph. Generalizes MLE, MPLE; analogous: Objective Sample complexity Joint & disjoint optimization • Trees (tractable inference) • Follow structure of P(X) • Cover star structures • Cover strong factors • Choose large components

  41. Structured MCLE on a Grid Grid. Associative factors. 10,000 train exs. Gibbs sampling. MPLE MLE MCLE (combs) MPLE Training time (sec) Log loss ratio (other/MLE) better MCLE (combs) Grid size |X| Grid size |X| MCLE (combs) lowers sample complexity MCLE tailored to model structure. Also in thesis: tailoring to correlations in data. ...without increasing computation!

  42. Summary: Parameter Learning Sample complexity Parallel optimization Computational complexity • Finite sample complexity bounds for general MRFs, CRFs • PAC learnability for certain classes • Empirical analysis • Guidelines for choosing MCLE structures: tailor to model, data Likelihood (MLE) Optimal Difficult High Composite Likelihood (MCLE) Easy Low Low Pseudolikelihood (MPLE) High Easy Low

  43. Outline Structure Learning Parameter Learning Scaling core methods • Learning tractable structures • Learning without intractable inference solve via Parallel Regression Parallel scaling • Multicore sparse regression

  44. CRF Structure Learning Structure learning: Choose YC I.e., learn conditional independence Evidence selection: Choose XD I.e., select X relevant to each YC X1: loud roommate? Y1: losing sleep? X2: taking classes? X3: deadline? Y3: sick? Y2: losing hair?

  45. Related Work • Most similar to our work: • They focus on selecting treewidth-k structures. • We focus on the choice of edge weight.

  46. Tree CRFs with Local Evidence Bradley, Guestrin (2010) Goal • Given: • Data • Local evidence • Learn tree CRF structure • Via a scalable method Xi relevant to each Yi Fast inference at test-time

  47. Chow-Liu for MRFs Chow & Liu (1968) Algorithm • Weight edges with mutual information: Y2 Y1 Y3

  48. Chow-Liu for MRFs Chow & Liu (1968) Algorithm • Weight edges with mutual information: Y2 Y1 Choose max-weight spanning tree. Y3 Chow-Liu finds a max-likelihood structure.

  49. Chow-Liu for CRFs? Algorithm • Weight each possible edge: Global Conditional Mutual Information (CMI) Choose max-weight spanning tree. What edge weight?  must be efficient to compute Pro: Finds max-likelihood structure (with enough data) Con: Intractable for large |X|

  50. Generalized Edge Weights Global CMI Local Linear Entropy Scores (LLES): w(i,j) = linear combination of entropies over Yi,Yj,Xi,Xj Theorem No LLES can recover all tree CRFs (even with non-trivial parameters and exact entropies).

More Related