Thesis Defense

Learning Large-Scale Conditional Random Fields Thesis Defense Joseph K. Bradley Committee Carlos Guestrin (U. of Washington, Chair) Tom Mitchell John Lafferty (U. of Chicago) Andrew McCallum (U. of Massachusetts at Amherst) 1 / 18 / 2013

Modeling Distributions Goal: Model distribution P(X) over random variables X E.g.: Model life of a grad student. X6: loud roommate? X1: losing sleep? X7: taking classes? X4: losing hair? X2: deadline? X3: sick? X8: cold weather? X9: exercising? X5: overeating? X10: gaining weight? X11: single?

Modeling Distributions Goal: Model distribution P(X) over random variables X E.g.: Model life of a grad student. = P( losing sleep, overeating | deadline, taking classes ) X6: loud roommate? X1: losing sleep? X7: taking classes? X4: losing hair? X2: deadline? X3: sick? X8: cold weather? X9: exercising? X5: overeating? X10: gaining weight? X10: single?

Markov Random Fields (MRFs) Goal: Model distribution P(X) over random variables X E.g.: Model life of a grad student. X6: loud roommate? X1: losing sleep? X7: taking classes? X4: losing hair? X2: deadline? X3: sick? X8: cold weather? X9: exercising? X5: overeating? X10: gaining weight? X10: single?

Markov Random Fields (MRFs) Goal: Model distribution P(X) over random variables X factor (parameters) X6 X1 X7 X4 X2 graphical structure X3 X8 X9 X5 X10 X10

Conditional Random Fields (CRFs) (Lafferty et al., 2001) MRFs: P(X) CRFs: P(Y|X) X1 Y1 X3 Y4 X2 Y3 X4 X5 Y5 Y2 X6 Simpler structure (over Y only) Do not model P(X)

MRFs & CRFs • Benefits • Principled statistical and computational framework • Large body of literature • Applications • Natural language processing (e.g., Lafferty et al., 2001) • Vision (e.g., Tappen et al., 2007) • Activity recognition (e.g., Vail et al., 2007) • Medical applications (e.g., Schmidt et al., 2008) • ...

Challenges Goal: Given data, learn CRF structure and parameters. X1 Big structured optimization problem Y1 NP hard in general (Srebro, 2003) Y4 X2 Y3 Many learning methods require inference, i.e., answering queries P(A|B) NP hard to approximate (Roth, 1996) X5 Y5 Approximations often lack strong guarantees. Y2 X6

Thesis Statement CRFs offer statistical and computational advantages, but traditional learning methods are often impractical for large problems. We can scale learning by using decompositions of learning problems which trade off sample complexity, computation, and parallelization.

Outline Parameter Learning Structure Learning Scaling core methods • Learning without intractable inference • Learning tractable structures solve via Parallel Regression Parallel scaling • Multicore sparse regression

Log-linear MRFs Goal: Model distribution P(X) over random variables X X6 X1 X7 X4 X2 X3 X8 X9 X5 All results generalize to CRFs. X10 X10 Parameters Features

Parameter Learning: MLE Parameter Learning Given structure Φ and samples from Pθ*(X), Learn parameters θ. Traditional method: max-likelihood estimation (MLE) Minimize objective: Loss Gold Standard: MLE is (optimally) statistically efficient.

Parameter Learning: MLE

Parameter Learning: MLE Inference makes learning hard. Can we learn without intractable inference? • MLE requires inference. • Provably hard for general MRFs.(Roth, 1996)

Parameter Learning: MLE Inference makes learning hard. Can we learn without intractable inference? • Approximate inference & objectives • Many works: Hinton (2002), Sutton & McCallum (2005), Wainwright (2006), ... • Many lack strong theory. • Almost no guarantees for general MRFs or CRFs.

Our Solution Bradley, Guestrin (2012) Sample complexity Parallel optimization Computational complexity Max Likelihood Estimation (MLE) Optimal Difficult High Max Pseudolikelihood Estimation (MPLE) High Easy Low PAC learnability for many MRFs!

Our Solution Bradley, Guestrin (2012) Sample complexity Parallel optimization Computational complexity Max Likelihood Estimation (MLE) Optimal Difficult High Max Composite Likelihood Estimation (MCLE) Easy Low Low Choose MCLE structure to optimize trade-offs Max Pseudolikelihood Estimation (MPLE) High Easy Low

Deriving Pseudolikelihood (MPLE) MLE: Hard to compute. So replace it! X1 X4 X2 X3 X5

Deriving Pseudolikelihood (MPLE) MLE: X1 X4 X2 X3 X5 MPLE: Estimate via regression: (Besag, 1975) Tractable inference!

Pseudolikelihood (MPLE) • Cons • Less statistically efficient than MLE (Liang & Jordan, 2008) • No PAC bounds • Pros • No intractable inference! • Consistent estimator MPLE: PAC = Probably Approximately Correct (Valiant, 1984) (Besag, 1975)

Sample Complexity: MLE Our Theorem: Bound on n (# training examples needed) probability of failure parameter error (L1) # parameters (length of θ) Recall: Requires intractable inference. Λmin: min eigenvalue of Hessian of loss at θ*

Sample Complexity: MPLE Our Theorem: Bound on n (# training examples needed) probability of failure parameter error (L1) # parameters (length of θ) PAC learnability for many MRFs! Recall: Tractable inference. Λmin: mini [ min eigenvalue of Hessian of component i at θ* ]

Sample Complexity: MPLE Our Theorem: Bound on n (# training examples needed) PAC learnability for many MRFs! • Related Work • Ravikumar et al. (2010) • Regression Yi~X with Ising models • Basis of our theory • Liang & Jordan (2008) • Asymptotic analysis of MLE, MPLE • Our bounds match theirs • Abbeel et al. (2006) • Only previous method with PAC bounds for high-treewidth MRFs • We extend their work: • Extension to CRFs, algorithmic improvements, analysis • Their method is very similar to MPLE.

Trade-offs: MLE & MPLE Our Theorem: Bound on n (# training examples needed) MLE Larger Λmin => Lower sample complexity MPLE Smaller Λmin => Higher sample complexity Higher computational complexity Lower computational complexity Sample — computational complexity trade-off

Trade-offs: MPLE Joint optimization for MPLE: X2 X2 X1 X1 Lower sample complexity Disjoint optimization for MPLE: Data-parallel • 2 estimates of • Average estimates Sample complexity — parallelism trade-off

Synthetic CRFs Chains Stars Grids Random Factor strength = strength of variable interactions Associative

Predictive Power of Bounds MPLE MLE better MPLE-disjoint Length-4 chains L1 param error ε Factors: random, fixed strength Errors should be ordered: MLE < MPLE < MPLE-disjoint # training examples

Predictive Power of Bounds MLE & MPLE Sample Complexity: MLE 10,000 train exs Actual ε better harder Length-6 chains Factors: random

Failure Modes of MPLE How do Λmin(MLE) and Λmin(MPLE) vary for different models? Sample complexity: Model diameter Factor strength Node degree

Λmin: Model Diameter Λmin ratio: MLE/MPLE (Higher = MLE better) Relative MPLE performance is independent of diameter in chains. (Same for random factors) Λmin ratio Model diameter Chains Factors: associative, fixed strength

Λmin: Factor Strength Λmin ratio: MLE/MPLE (Higher = MLE better) Length-8 Chains Factors: associative MPLE performs poorly with strong factors. (Same for random factors, and star & grid models) Λmin ratio Factor strength

Λmin: Node Degree Λmin ratio: MLE/MPLE (Higher = MLE better) Stars Factors: associative, fixed strength MPLE performs poorly with high-degree nodes. (Same for random factors) Λmin ratio Node degree

Failure Modes of MPLE How do Λmin(MLE) and Λmin(MPLE) vary for different models? Sample complexity: Model diameter Factor strength Node degree We can often fix this!

Composite Likelihood (MCLE) MLE: Estimate P(Y) all at once

Composite Likelihood (MCLE) MLE: Estimate P(Y) all at once MPLE: Estimate P(Yi|Y\i) separately Yi

Composite Likelihood (MCLE) MLE: Estimate P(Y) all at once MPLE: Estimate P(Yi|Y\i) separately Something in between? Composite Likelihood (MCLE): Estimate P(YAi|Y\Ai) separately. (Lindsay, 1988) YAi

Composite Likelihood (MCLE) Generalizes MLE, MPLE; analogous: • Objective • Sample complexity • Joint & disjoint optimization MCLE Class: Node-disjoint subgraphs which cover graph.

Composite Likelihood (MCLE) Combs MCLE Class: Node-disjoint subgraphs which cover graph. Generalizes MLE, MPLE; analogous: Objective Sample complexity Joint & disjoint optimization • Trees (tractable inference) • Follow structure of P(X) • Cover star structures • Cover strong factors • Choose large components

Structured MCLE on a Grid Grid. Associative factors. 10,000 train exs. Gibbs sampling. MPLE MLE MCLE (combs) MPLE Training time (sec) Log loss ratio (other/MLE) better MCLE (combs) Grid size |X| Grid size |X| MCLE (combs) lowers sample complexity MCLE tailored to model structure. Also in thesis: tailoring to correlations in data. ...without increasing computation!

Summary: Parameter Learning Sample complexity Parallel optimization Computational complexity • Finite sample complexity bounds for general MRFs, CRFs • PAC learnability for certain classes • Empirical analysis • Guidelines for choosing MCLE structures: tailor to model, data Likelihood (MLE) Optimal Difficult High Composite Likelihood (MCLE) Easy Low Low Pseudolikelihood (MPLE) High Easy Low

Outline Structure Learning Parameter Learning Scaling core methods • Learning tractable structures • Learning without intractable inference solve via Parallel Regression Parallel scaling • Multicore sparse regression

CRF Structure Learning Structure learning: Choose YC I.e., learn conditional independence Evidence selection: Choose XD I.e., select X relevant to each YC X1: loud roommate? Y1: losing sleep? X2: taking classes? X3: deadline? Y3: sick? Y2: losing hair?

Related Work • Most similar to our work: • They focus on selecting treewidth-k structures. • We focus on the choice of edge weight.

Tree CRFs with Local Evidence Bradley, Guestrin (2010) Goal • Given: • Data • Local evidence • Learn tree CRF structure • Via a scalable method Xi relevant to each Yi Fast inference at test-time

Chow-Liu for MRFs Chow & Liu (1968) Algorithm • Weight edges with mutual information: Y2 Y1 Y3

Chow-Liu for MRFs Chow & Liu (1968) Algorithm • Weight edges with mutual information: Y2 Y1 Choose max-weight spanning tree. Y3 Chow-Liu finds a max-likelihood structure.

Chow-Liu for CRFs? Algorithm • Weight each possible edge: Global Conditional Mutual Information (CMI) Choose max-weight spanning tree. What edge weight?  must be efficient to compute Pro: Finds max-likelihood structure (with enough data) Con: Intractable for large |X|

Generalized Edge Weights Global CMI Local Linear Entropy Scores (LLES): w(i,j) = linear combination of entropies over Yi,Yj,Xi,Xj Theorem No LLES can recover all tree CRFs (even with non-trivial parameters and exact entropies).

Thesis Defense

Thesis Defense

Presentation Transcript

Thesis Defense Olufunke Olaleye

Masters Thesis Defense

Thesis Defense

MS Thesis Defense:

Internal Defense of Doctoral Thesis

Emily Voelkel, B.S. Thesis Defense June 2010

THESIS DEFENSE

MS Thesis Defense

Master’s Thesis Defense: Aspectual Concepts

Elizabeth Waring Thesis Defense

Master’s Thesis Defense

Public PhD thesis defense at WU

PhD Thesis Defense Daniel Navarro Urrios

Final Thesis Defense

Thesis Defense

THESIS DEFENSE

Darren Van Cleave M.S. Thesis Defense

Isabelle Lesur Thesis defense – 04/25/05

Master’s Thesis Defense: Aspectual Concepts

Master’s Thesis Defense: Aspectual Concepts

Master’s Thesis Defense: Aspectual Concepts