1 / 50

Joseph K. Bradley

Sample Complexity of CRF Parameter Learning. Joseph K. Bradley. Joint work with Carlos Guestrin CMU Machine Learning Lunch talk on work appearing in AISTATS 2012. 4 / 9 / 2012. Markov Random Fields (MRFs). Goal: Model distribution P(X) over random variables X. E.g.,.

ghalib
Download Presentation

Joseph K. Bradley

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sample Complexity of CRF Parameter Learning Joseph K. Bradley Joint work with Carlos Guestrin CMU Machine Learning Lunch talk on work appearing in AISTATS 2012 4 / 9 / 2012

  2. Markov Random Fields (MRFs) Goal: Model distribution P(X) over random variables X E.g., = P( deadline | bags under eyes, losing hair ) X2: bags under eyes? X4: losing hair? X1: deadline? X3: sick? X5: overeating?

  3. Markov Random Fields (MRFs) Goal: Model distribution P(X) over random variables X factor X2: bags under eyes? X4: losing hair? X1: deadline? X3: sick? X5: overeating?

  4. Log-linear MRFs Binary X: Parameters Features Real X: Our goal: Given structure Φ and data, learn parameters θ.

  5. Parameter Learning: MLE Traditional learning: max-likelihood estimation (MLE) Given data: n i.i.d. samples from L2 regularization is more common. Our analysis applies to L1 & L2. Minimize objective: Loss Regularization Gold Standard: MLE is (optimally) statistically efficient.

  6. Parameter Learning: MLE Algorithm Iterate: • Compute gradient. • Step along gradient. Traditional learning: max-likelihood estimation (MLE) Given data: n i.i.d. samples from Minimize objective:

  7. Parameter Learning: MLE Algorithm Iterate: • Compute gradient. • Step along gradient. Traditional learning: max-likelihood estimation (MLE)

  8. Parameter Learning: MLE Algorithm Iterate: • Compute gradient. • Step along gradient. Traditional learning: max-likelihood estimation (MLE) Requires inference.  Provably hard for general MRFs. Inference makes learning hard. Can we learn without intractable inference?

  9. Conditional Random Fields (CRFs) MRFs CRFs (Lafferty et al., 2001) X2: bags under eyes? X4: losing hair? Inference exponential in |X|, not |E|. X1: deadline? X3: sick? E1: weather X5: overeating? E2: full moon E3: Steelers game …

  10. Conditional Random Fields (CRFs) MRFs CRFs (Lafferty et al., 2001) Inference exponential in |X|, not |E|. But Z depends on E! Inference makes learning even harder for CRFs. Can we learn without intractable inference? Objective: Compute Z(e) for every training example!

  11. Outline • Parameter learning Before: No PAC learning results for general MRFs or CRFs • Sample complexity results PAC learning via pseudolikelihood for general MRFs and CRFs • Empirical analysis of bounds Tightness & dependence on model • Structured composite likelihood Lowering sample complexity

  12. Related Work • Ravikumar et al. (2010): PAC bounds for regression Yi~X with Ising factors • Our theory is largely derived from this work. • Liang and Jordan (2008): Asymptotic bounds for pseudolikelihood, composite likelihood • Our finite sample bounds are of the same order. • Learning with approximate inference • No PAC-style bounds for general MRFs,CRFs. • c.f.: Hinton (2002), Koller & Friedman (2009), Wainwright (2006)

  13. Outline • Parameter learning Before: No PAC learning results for general MRFs or CRFs • Sample complexity results PAC learning via pseudolikelihood for general MRFs and CRFs • Empirical analysis of bounds Tightness & dependence on model • Structured composite likelihood Lowering sample complexity

  14. Avoiding Intractable Inference MLE loss: Hard to compute. So replace it!

  15. Pseudolikelihood (MPLE) MLE loss: Pseudolikelihood (MPLE) loss: (Besag, 1975) Intuition: Approximate distribution as product of local conditionals. X2: bags under eyes? X4: losing hair? X1: deadline? X3: sick? X5: overeating?

  16. Pseudolikelihood (MPLE) MLE loss: Pseudolikelihood (MPLE) loss: (Besag, 1975) Intuition: Approximate distribution as product of local conditionals. X2: bags under eyes? X4: losing hair? X1: deadline? X3: sick? X5: overeating?

  17. Pseudolikelihood (MPLE) MLE loss: Pseudolikelihood (MPLE) loss: (Besag, 1975) Intuition: Approximate distribution as product of local conditionals. X2: bags under eyes? X4: losing hair? X1: deadline? X3: sick? X5: overeating?

  18. Pseudolikelihood (MPLE) MLE loss: Pseudolikelihood (MPLE) loss: (Besag, 1975) No intractable inference required! • Previous work: • Pro: Consistent estimator • Con: Less statistically efficient than MLE • Con: No PAC bounds

  19. Outline • Parameter learning Before: No PAC learning results for general MRFs or CRFs • Sample complexity results PAC learning via pseudolikelihood for general MRFs and CRFs • Empirical analysis of bounds Tightness & dependence on model • Structured composite likelihood Lowering sample complexity

  20. Sample Complexity: MLE Theorem Given n i.i.d. samples from Pθ*(X), MLE using L1 or L2 regularization achieves avg. per-parameter error with probability ≥ 1-δ if: # parameters (length of θ) Λmin: min eigenvalue of Hessian of loss at θ*:

  21. Sample Complexity: MPLE Same form as for MLE: r = length of θ ε = avg. per-parameter error δ = probability of failure For MLE: Λmin = min eigval of Hessian of loss at θ*: For MPLE: Λmin = mini [ min eigval of Hessian of loss component i at θ* ]:

  22. Joint vs. Disjoint Optimization Pseudolikelihood (MPLE) loss: Intuition: Approximate distribution as product of local conditionals. X2: bags under eyes? X4: losing hair? X1: deadline? X3: sick? X5: overeating?

  23. Joint vs. Disjoint Optimization Joint: MPLE Disjoint: Regress Xi~X-i. Average parameter estimates. X2: bags under eyes? X4: losing hair? X1: deadline? X3: sick? X5: overeating?

  24. Joint vs. Disjoint Optimization Sample complexity bounds Joint MPLE: Disjoint MPLE: Con: worse bound Pro: data parallel

  25. Bounds for Log Loss We have seen MLE & MPLE sample complexity: where Theorem If parameter estimation error ε is small, then log loss converges quadratically in ε: else log loss converges linearly in ε: (Matches rates from Liang and Jordan, 2008)

  26. Outline • Parameter learning Before: No PAC learning results for general MRFs or CRFs • Sample complexity results PAC learning via pseudolikelihood for general MRFs and CRFs • Empirical analysis of bounds Tightness & dependence on model • Structured composite likelihood Lowering sample complexity

  27. Synthetic CRFs Random: X1 factor strength if X2 Associative: otherwise Chains Stars Grids

  28. Tightness of Bounds Chain. |X|=4. Random factors. Parameter estimation error ≤ f(sample size) Log loss ≤ f(parameter estimation error) MPLE-disjoint MPLE MLE

  29. Tightness of Bounds Chain. |X|=4. Random factors. Log loss ≤ f(parameter estimation error) L1 param error L1 param error bound MPLE-disjoint MPLE MLE Training set size

  30. Tightness of Bounds Chain. |X|=4. Random factors. L1 param error bound Log loss bound, given params Log (base e) loss L1 param error Training set size MPLE-disjoint Training set size MPLE MLE

  31. Tightness of Bounds Parameter estimation error ≤ f(sample size) Log loss ≤ f(parameter estimation error) (looser) (tighter)

  32. Predictive Power of Bounds Parameter estimation error ≤ f(sample size) Is the bound still useful (predictive)? Examine dependence on Λmin, r. (looser)

  33. Predictive Power of Bounds Chains. Random factors. 10,000 train exs. MLE (similar results for MPLE) r=5 r=11 r=23 L1 param error • Actual error vs. bound: • Different constants • Similar behavior • Nearly independent of r L1 param error bound 1/Λmin

  34. Recall: Λmin How do Λmin(MLE) and Λmin(MPLE) vary for different models? Sample complexity: For MLE: Λmin = min eigval of Hessian of at θ*. For MPLE: Λmin = mini [ min eigval of Hessian of at θ* ].

  35. Λmin ratio: MLE/MPLE: chains Random factors Associative factors Λmin ratio Λmin ratio better Λmin ratio Λmin ratio Model size |Y| (Fixed factor strength) Model size |Y| (Fixed factor strength) Factor strength (Fixed |Y|=8) Factor strength (Fixed |Y|=8)

  36. Λmin ratio: MLE/MPLE: stars Random factors Associative factors Λmin ratio Λmin ratio Λmin ratio better Λmin ratio Factor strength (Fixed |Y|=8) Model size |Y| (Fixed factor strength) Model size |Y| (Fixed factor strength) Factor strength (Fixed |Y|=8)

  37. Outline • Parameter learning Before: No PAC learning results for general MRFs or CRFs • Sample complexity results PAC learning via pseudolikelihood for general MRFs and CRFs • Empirical analysis of bounds Tightness & dependence on model • Structured composite likelihood Lowering sample complexity

  38. Grid Example MLE: Estimate P(Y) all at once

  39. Grid Example MLE: Estimate P(Y) all at once MPLE: Estimate P(Yi|Y-i) separately Yi

  40. Grid Example MLE: Estimate P(Y) all at once MPLE: Estimate P(Yi|Y-i) separately Something in between?  Estimate a larger component, but keep inference tractable. Composite Likelihood (MCLE): Estimate P(YAi|Y-Ai) separately, where YAi in Y. (Lindsay, 1988) YAi

  41. Grid Example Composite Likelihood (MCLE): Estimate P(YAi|Y-Ai) separately, where YAi in Y • Choosing MCLE components YAi: • Larger is better. • Keep inference tractable. • Choose using model structure. Good choice: vertical combs Weak horizontal factors Strong vertical factors

  42. Λmin ratio: MLE vs. MPLE,MCLE: grids Random factors Associative factors MPLE MPLE combs combs Λmin ratio Λmin ratio Λmin ratio better Λmin ratio MPLE MPLE combs Factor strength (Fixed |Y|=8) Grid width (Fixed factor strength) combs Grid width (Fixed factor strength) Factor strength (Fixed |Y|=8)

  43. Structured MCLE on a Grid Grid with associative factors (fixed strength). 10,000 training samples. Gibbs sampling for inference. MPLE MLE combs MPLE Training time (sec) Log loss ratio (other/MLE) better combs Grid size |X| Grid size |X| Combs (MCLE) lower sample complexity ...without increasing computation!

  44. Averaging MCLE Estimators MLE & MPLE sample complexity: Λmin(MLE) = min eigval of Hessian of at θ*. MCLE sample complexity: Λmin(MPLE) = mini [ min eigval of Hessian of at θ* ]. ρmin = minj [ sum over components Ai which estimate θj of [ min eigval of Hessian of at θ* ]. Mmax = maxj [ number of components Ai which estimate θj ].

  45. Averaging MCLE Estimators MLE & MPLE sample complexity: MCLE sample complexity: ρmin = minj [ sum over components Ai which estimate θj of [ min eigval of Hessian of at θ* ]. Estimated by both components Mmax = maxj [ number of components Ai which estimate θj ]. Estimated by one component 1 2 Mmax = 2 Mmax = 3 Mmax = 2 3 4

  46. Averaging MCLE Estimators MLE & MPLE sample complexity: For MPLE, a single bad estimator P(Xi|X-i) can give a bad bound. MCLE sample complexity: For MCLE, the effect of a bad estimator P(XAi|X-Ai) can be averaged out by other good estimators.

  47. Structured MCLE on a Grid Grid with strong vertical (associative) factors. Comb-vert MLE better Λmin Combs-both MPLE Comb-horiz Grid width

  48. Summary: MLE vs. MPLE/MCLE Relative performance of estimators • Increasing model diameter has little effect. • MPLE/MCLE get worse with increasing: • Factor strength • Node degree • Grid width Structured MCLE partly solves these problems. • Choose MCLE structure according to factor strength, node degree, grid structure. • Same computational cost as MPLE.

  49. Summary • PAC learning via MPLE & MCLE for general MRFs and CRFs. • Empirical analysis: • Bounds are predictive of empirical behavior. • Strong factors and high-degree nodes hurt MPLE. • Structured MCLE • Can have lower sample complexity than MPLE but same computational complexity. • Future work • Choosing MCLE structure on natural graphs. • Parallel learning: Improving statistical efficiency of disjoint optimization via limited communication. • Comparing with MLE using approximate inference. Thank you!

  50. Canonical Parametrization Abbeel et al. (2006): Only previous method for PAC-learning high-treewidth discrete MRFs. • PAC bounds for low-degree factor graphs over discrete X. • Main idea: • Re-write P(X) as a ratio of many small factors P( XCi | X-Ci ). • Fine print: Each factor is instantiated 2|Ci| times using a reference assignment. • Estimate each small factor P( XCi | X-Ci ) from data. • Plug factors into big expression for P(X). Theorem If the canonical parametrization uses the factorization of P(X), it is equivalent to MPLE with disjoint optimization. • Computing MPLE directly is faster. • Our analysis covers their learning method.

More Related