Scalable training of L 1 -regularized log-linear models

Scalable training of L1-regularizedlog-linear models Galen Andrew (Joint work with Jianfeng Gao) ICML, 2007

Minimizing regularized loss • Many parametric ML models are trained by minimizing a regularized loss of the form: • isa loss function quantifying “fit to the data” • Negative log-likelihood of training data • Distance from decision boundary of incorrect examples • If zero is a reasonable “default” parameter value, we can use where is a norm, penalizing large vectors, and C is a constant

Types of norms • A norm precisely defines “size” of a vector 1 2 1 3 2 3 Contours of L2-norm in 2D Contours of L1-norm in 2D

A nice property of L1 • Gradients of L2- and L1-norm 1 2 1 3 2 3 “Negative gradient” of L1-norm (direction of steepest descent) points toward coordinate axes Negative gradient of L2-norm always points directly toward 0

A nice property of L1 • 1-D slice of L1-regularized objective Sharp bend causes optimal value at x = 0 x

A nice property of L1 • At global optimum, many parameters have value exactly zero • L2 would give small, nonzero values • Thus L1does continuous feature selection • More interpretable, computationally manageable models • C parameter tunes sparsity/accuracy tradeoff • In our experiments, only 1.5% of feats remain

A nasty property of L1 • The sharp bend at zero is also a problem: • Objective is non-differentiable • Cannot solve with standard gradient-based methods Non-differentiable at sharp bend (gradient undefined)

Digression: Newton’s method • To optimize a function f: • Form 2nd-order Taylor expansion around x0 • Jump to minimum: (Actually, line search in direction of xnew) • Repeat • Sort of an ideal. • In practice, H is too large ( )

Limited-Memory Quasi-Newton • Approximate H-1 with a low-rank matrix built using information from recent iterations • Approximate H-1 and not H, sono need to invert the matrix or solve linear system! • Most popular L-M Q-N method: L-BFGS • Storage and computation are O(# vars) • Very good theoretical convergence properties • Empirically, best method for training large-scale log-linear models with L2 (Malouf ‘02, Minka ‘03)

Orthant-Wise Limited-memory Quasi-Newton algorithm • Our algorithm (OWL-QN) uses the fact that L1 is differentiable on any given orthant • In fact, it is linear, so it doesn’t affect Hessian

OWL-QN (cont.) • For a given orthant defined by the objective can be written • The Hessian of fdetermined by loss alone • Can use gradient of loss at previous iterations to estimate Hessian of objective on any orthant • Constrain steps to not cross orthant boundaries Linear function of w Hessian = 0

OWL-QN (cont.) • Choose an orthant • Find Quasi-Newton quadratic approximation to objective on orthant • Jump to minimum of quadratic (Actually, line search in direction of minimum) • Project back onto sectant • Repeat steps 1-4 until convergence

Choosing a sectant to explore • We use the sectant… • in which the current point sits • into which the direction of steepest descent points (Computing direction of steepest descent given the gradient of the loss is easy; see the paper for details.)

Toy example • One iteration of L-BFGS-L1: • Find vector of steepest descent • Choose sectant • Find L-BFGS quadratic approximation • Jump to minimum • Project back onto sectant • Update Hessian approximation using gradient of loss alone

Notes • Variables added/subtracted from model as orthant boundaries are hit • A variable can change signs in two iterations • Glossing over some details: • Line search with projection at each iteration • Convenient for implementation to expand notion of “orthant” to constrain some variables at zero • See paper for complete details • In paper we prove convergence to optimum

Experiments • We ran experiments with the parse re-ranking model of Charniak & Johnson (2005) • Start with a set of candidate parses for each sentence (produced by a baseline parser) • Train a log-linear model to select the correct one • Model uses ~1.2M features of a parse • Train on Sections 2-19 of PTB (36K sentences with 50 parses each) • Fit C to max. F-meas on Sec. 20-21 (4K sent.)

Training methods compared • Compared OWL-QN with three other methods • Kazama & Tsujii (2003) paired variable formulation for L1 implemented with AlgLib’s L-BFGS-B • L2 with our own implementation of L-BFGS (on which OWL-QN is based) • L2 with AlgLib’s implementation of L-BFGS • K&T turns L1 training into constrained differentiable problem by doubling variables • Similar to Goodman’s 2004 method, but with L-BFGS-B instead of GIS

Comparison Methodology • For each problem (L1 and L2) • Run both algorithms until value nearly constant • Report time to reach within 1% of best value • We also report num. of function evaluations • Implementation independent comparison • Function evaluation dominates runtime • Results reported with chosen value of C • L-BFGS memory parameter = 5 for all runs

Results

Notes: • Our L-BFGS and AlgLib’s are comparable, so OWL-QN and K&T with AlgLib is fair comparison • In terms of function evaluations and raw time, OWL-QN orders of magnitude faster than K&T • The most expensive step of OWL-QN is computing L-BFGS direction (not projections, computing steepest descent vector, etc.) • Optimizing L1 objective with OWL-QN is twice as fast as optimizing L2 with L-BFGS

Objective value during training L1, OWL-QN L2, Our L-BFGS L1, Kazama & Tsujii L2, AlgLib’s L-BFGS

Sparsity during training • Both algorithms start with ~5% of features, then gradually prune them away • At second iteration, OWL-QN removes many features, then replaces them with opp. sign OWL-QN Kazama & Tsujii

Extensions • For ACL paper, ran on 3 very different log-linear NLP models with up to 8M features • CMM sequence model for POS tagging • Reranking log-linear model for LM adaptation • Semi-CRF for Chinese word segmentation • Can use any smooth convex loss • We’ve also tried least-squares (LASSO regression) • A small change allows non-convex loss • Only local minimum guaranteed

Software download • We’ve released c++ OWL-QN source • User can specify arbitrary convex smooth loss • Also included are standalone trainer for L1 logistic regression and least-squares (LASSO) • Please visit my webpage for download • (Find with search engine of your choice)

THANKS.

Scalable training of L 1 -regularized log-linear models