Frank-Wolfe optimization insights in machine learning

Frank-Wolfe optimization insights in machine learning Simon Lacoste-Julien INRIA / École Normale Supérieure SIERRA Project Team SMILE– November 4th 2013

Outline • Frank-Wolfe optimization • Frank-Wolfe for structured prediction • links with previous algorithms • block-coordinate extension • results for sequence prediction • Herding as Frank-Wolfe optimization • extension: weighted Herding • simulations for quadrature

Frank-Wolfe algorithm [Frank, Wolfe 1956] (aka conditional gradient) • FW algorithm – repeat: • alg. for constrained opt.: convex & cts. differentiable where: convex & compact 1) Find good feasible direction by minimizing linearization of : • Properties: O(1/T) rate • sparse iterates • get duality gap for free • affine invariant • rate holds even if linear subproblem solved approximately 2) Take convex step in direction:

Frank-Wolfe: properties • convex steps => convex sparse combo: • get duality gap certificate for free • (special case of Fenchel duality gap) • also converge as O(1/T)! • only need to solve linear subproblem *approximately* (additive/multiplicative bound) • affine invariant! [see Jaggi ICML 2013]

Block-Coordinate Frank-Wolfe Optimization for Structured SVMs [ICML 2013] Martin Jaggi Simon Lacoste-Julien Patrick Pletscher Mark Schmidt

Structured SVM optimization • structured prediction: • learn classifier: decoding structured hinge loss: • structured SVM primal: -> loss-augmented decoding vs. binary hinge loss: • structured SVM dual: primal-dual pair: -> exp. number of variables!

Structured SVM optimization (2) rate: after K passes through data: • popular approaches: • stochastic subgradient method • pros: online! • cons: sensitive to step-size; don’t know when to stop • cutting plane method (SVMstruct) • pros: automatic step-size; duality gap • cons: batch! -> slow for large n • our approach: block-coordinate Frank-Wolfe on dual -> combines best of both worlds: • online! • automatic step-size via analytic line search • duality gap • rates also hold for approximate oracles [Ratliff et al. 07, Shalev-Shwartz et al. 10] [Tsochantaridis et al. 05, Joachims et al. 09]

Frank-Wolfe algorithm [Frank, Wolfe 1956] (aka conditional gradient) • FW algorithm – repeat: • alg. for constrained opt.: convex & cts. differentiable where: convex & compact 1) Find good feasible direction by minimizing linearization of : • Properties: O(1/T) rate • sparse iterates • get duality gap for free • affine invariant • rate holds even if linear subproblem solved approximately 2) Take convex step in direction:

Frank-Wolfe for structured SVM • structured SVM dual: • FW algorithm – repeat: use primal-dual link: link between FW and subgradient method: see [Bach 12] key insight: 1) Find good feasible direction by minimizing linearization of : loss-augmented decoding on each example i  2) Take convex step in direction: becomes a batch subgradient step: choose by analytic line search on quadratic dual

FW for structured SVM: properties  • running FW on dual batch subgradient on primal • but adaptive step-size from analytic line-search • and duality gap stopping criterion • ‘fully corrective’ FW on dual cutting plane alg. • still O(1/T) rate; but provides simpler proof for SVMstruct convergence + approximate oracles guarantees • not faster than simple FW in our experiments • BUT: still batch => slow for large n...  (SVMstruct)

Block-Coordinate Frank-Wolfe (new!) • for constrained optimization over compact product domain: • pick i at random; update only block i with a FW step: • Properties: O(1/T) rate • sparse iterates • get duality gap guarantees • affine invariant • rate holds even if linear subproblem solved approximately • we proved same O(1/T) rate as batch FW -> each step n times cheaper though -> constant can be the same (SVM e.g.)

Block-Coordinate Frank-Wolfe (new!) • for constrained optimization over compact product domain: • pick i at random; update only block i with a FW step: structured SVM:  loss-augmented decoding • we proved same O(1/T) rate as batch FW -> each step n times cheaper though -> constant can be the same (SVM e.g.)

BCFW for structured SVM: properties (vs. n for SVMstruct) • each update requires 1 oracle call • advantages over stochastic subgradient: • step-sizes by line-search -> more robust • duality gap certificate -> know when to stop • guarantees hold for approximate oracles • implementation: https://github.com/ppletscher/BCFWstruct • almost as simple as stochastic subgradient method • caveat: need to store one parameter vector per example (or store the dual variables) • for binary SVM -> reduce to DCA method [Hsieh et al. 08] • interesting link with prox SDCA [Shalev-Shwartz et al. 12] so get error after K passes through data (vs. for SVMstruct)

More info about constants... • batch FW rate: • BCFW rate: “curvature” “product curvature” ->remove with line-search • comparing constants: • for structured SVM – same constants: • identity Hessian + cube constraint: (no speed-up)

Sidenote: weighted averaging • standard to average iterates of stochastic subgradient method uniform averaging: vs. t-weighted averaging: [L.-J. et al. 12], [Shamir & Zhang 13] • weighted avg. improves duality gap for BCFW • also makes a big difference in test error!

Experiments OCR dataset CoNLL dataset

Surprising test error though! CoNLL dataset test error: optimization error: flipped!

Conclusions for 1st part • applying FW on dual of structured SVM • unified previous algorithms • provided line-search version of batch subgradient • new block-coordinate variant of Frank-Wolfe algorithm • same convergence rate but with cheaper iteration cost • yields a robust & fast algorithm for structured SVM • future work: • caching tricks • non-uniform sampling • regularization path • explain weighted avg. test error mystery

On the Equivalence between Herding and Conditional Gradient Algorithms [ICML 2012] Guillaume Obozinski Simon Lacoste-Julien Francis Bach

A motivation: quadrature • Approximating integrals: • Random sampling yields error • Herding [Welling 2009] yields error! [Chen et al. 2010] (like quasi-MC) • This part -> links herding with optimization algorithm (conditional gradient / Frank-Wolfe) • suggests extensions - e.g. weighted version with • BUT extensions worse for learning??? • -> yields interesting insights on properties of herding...

Outline • Background: • Herding • [Conditional gradient algorithm] • Equivalence between herding & cond. gradient • Extensions • New rates & theorems • Simulations • Approximation of integrals with cond. gradient variants • Learned distribution vs. max entropy

Review of herding [Welling ICML 2009] • Learning in MRF: • Motivation: feature map parameter learning: (app.) ML / max. entropy moment matching (app.) inference: sampling samples data (pseudo)- herding

Herding updates • Zero temperature limit of log-likelihood: • Herding updates - subgradient ascent updates: • Properties: 1) weakly chaotic -> entropy? 2) Moment matching: -> our focus ‘Tipi’ function: (thanks to Max Welling for picture)

Approx. integrals in RKHS • Controllingmoment discrepancy is enough to control error of integrals in RKHS : • Reproducing property: • Define mean map : • Want to approximate integrals of the form: • Use weighted sum to get approximated mean: • Approximation error is then bounded by:

Conditional gradient algorithm (aka Frank-Wolfe) • Alg. to optimize: • Repeat: Find good feasible direction by minimizing linearization of J: Take convex step in direction: -> Converges in O(1/T) in general convex & (twice) cts. differentiable convex & compact

Herding & cond. grad. are equiv. • Trick: look at cond. gradient on dummy objective: + Do change of variable: cond. grad. updates: herding updates: Subgradient ascent and cond. gradient are Fenchel duals of each other! (see also [Bach 2012]) Same with step-size:

Extensions of herding • More general step-sizes -> gives weighted sum: • Two extensions: 1) Line search for 2) Min. norm point algorithm (min J(g) on convex hull of previously visited points)

Rates of convergence & thms. • No assumption: cond. grad. yields*: • If assume in rel. int. of with radius • [Chen et al. 2010] yields for herding • Whereas line search version yields [Guélat & Marcotte 1986, Beck & Teboulle 2004] • Propositions: 1) 2) (i.e. [Chen et al. 2010] doesn’t hold!)

Simulation 1: approx. integrals • Kernel herding on Use RKHS with Bernouilli polynomial kernel (infinite dim.) (closed form)

Simulation 2: max entropy? • learning independent bits: error on moments error on distribution

Conclusions for 2nd part • Equivalence of herding and cond. gradient: -> Yields better alg. for quadrature based on moments -> But highlights max entropy / moment matching tradeoff! • Other interesting points: • Setting up fake optimization problems -> harvest properties of known algorithms • Conditional gradient algorithm useful to know... • Duality of subgradient & cond. gradient is more general • Recentrelatedwork: • linkwithBayesian quadrature [Huszar & Duvenaud UAI 2012] • herded Gibbs sampling [Born et al. ICLR 2013]

Thank you!

Frank-Wolfe optimization insights in machine learning

Frank-Wolfe optimization insights in machine learning

Presentation Transcript

Topics in Machine Learning

Machine-Dependent Optimization

Robust Optimization and Applications in Machine Learning

Machine Learning in Bioinformatics

Machine-Dependent Optimization

Machine Learning in Compiler Optimization

WOLFE

A Casual Chat on Convex Optimization in Machine Learning

Machine Learning in DryadLINQ

Machine-Dependent Optimization

Machine learning in IDS

Machine Learning in GATE

Machine-Independent Optimization

Experiments in Machine Learning

Evaluation in Machine Learning

Machine Learning in Football

Machine Learning and Deep Learning Transform data into actionable insights

Non-convex Optimization for Machine Learning

Machine-Independent Optimization

Robust Optimization and Applications in Machine Learning

Experiments in Machine Learning

Optimization Algorithms for Machine Learning