1 / 19

Asynchronous Stochastic Quasi-Newton MCMC for Non-Convex Optimization

This paper explores the enhanced performance of the asynchronous Stochastic Quasi-Newton MCMC algorithm for non-convex optimization compared to traditional methods like synchronous L-BFGS. The algorithm leverages a combination of techniques including overlapping batches and robust versions for improved convergence and communication efficiency. By incorporating momentum and individual local memories, the algorithm achieves stability and efficient hessian updates. The algorithm demonstrates non-asymptotic bounds superior to synchronous counterparts and is comparable to asynchronous SGD in practice. Theoretical extensions for asynchronous SGD analysis and potential limitations due to dimension dependence are discussed.

galindo
Download Presentation

Asynchronous Stochastic Quasi-Newton MCMC for Non-Convex Optimization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Asynchronous Stochastic Quasi-Newton MCMC for Non-Convex Optimization Umut Simsekli, Cagatay Yildiz, Than Huy Nguyen, A. Taylan Cemgil, Gael Richard ;  Proceedings of the 35th International Conference on Machine Learning, PMLR 80:4681-4690, 2018.

  2. Context • Objective: • First order methods: GD, SGD, VR variants, etc. • Second order methods: Newton, QN (L-BFGS, mb-L-BFGS), etc. • Parallelize: • Asynchronous SGD (or SVRG): Unstable (Wild!), often unscalable at large batch size, thus need too much communication practically. • Synchronous SGD (OSA, MB or Local SGD): diminishing returns on MB. • Can use QN methods with better bounds and more computation per node to win the communication v/s convergence trade off. • Hyper optimised implementations often a mixture of above two.

  3. mb-L-BFGS • The L-BFGS algorithm (Nocedal & Wright, 2006) is a limited-memory BATCH QN method. • Need to use a stochastic online version for reducing computation. • Berahas et al. (2016) proposed a parallel stochastic L-BFGS algorithm, called multi-batch L-BFGS (mb-L-BFGS), which is suitable for synchronous distributed architectures. (Synchronous L-BFGS!!) • It uses a different batch (deterministic???) at every iteration for QN (stable???). Better than MB, as batch sizes are large, winning the computation v/s communication trade-off. • To stabilize use overlapping batches (gradient consistency) and to make robust use non-random structure in the batches.

  4. mb-L-BFGS Performance • The bounds for synchronous L-BGFS are not easily interpretable, asymptotic and are given only for the finite horizon setting. On the other hand, the bounds are available for non-convex functions. • The performance of synchronous L-BGFS seems justifiably (w.r.t. extra computation) better and thus they can be used upon first order methods to improve performance. • A fault tolerant robust version was provided, which was also better than the normal mb-L-BFGS. • BUT, the bottom line is: Computation complexity of QN + communication + overlapping set calculation.

  5. Asynchronous L-BGFS? • mb-L-BFGS and its robust version are both synchronous parallel algorithms. • Synchronous vs asynchronous trade-off is equivalent to a “stability” vs “communication” (synchronization and coordination) trade-off. • The trade-off is still not very well understood, even for the first order methods! • Synchronous first order methods are the most widely used tools in deep learning tasks today. Infact, even the famous VR methods are not used very often for non-convex optimization. • To get as-L-BGFS we have two common directions, either the master/slave setting or the shared memory setting.

  6. as-L-BGFS (This paper) • MAP (regularized) objective and not MLE. • Reformulate as distribution sampling (conc. around optima). • Use SGMCMC to sample further (SGLD precisely). • Prove ergodic convergence to stationary distribution. • Major problems: • Inconsistent differences due to sub-sampling. • SPD of the hessian not ensured in expectation or o/w. • SGLD convergence can be shown to have nice properties for regular objectives (often convergence to neighbourhoods only). • Numerically unstable if simply SGLD (SGD with a smart noise) is used (2016).

  7. as-L-BGFS (This paper) • Introducing momentum/friction (delayed) solves the numerical instability problem. • Each worker has its own local L-BFGS memories since the master node would not be able to keep track of the gradient and iterate differences, which are received in an asynchronous manner. • To ensure that the hessian is SPD, they use an intricate update scheme (haven’t added the equations for brevity, similar to mb-L-BFGS). • The time complexity arises from the hessian-gradient product, which is O(Md) compared to previous O(Md^2) (2016 attempt).

  8. Under the constraints on regularity of the function (UB on variance, hessian, lipschitzness, etc.) and the stationary distribution (bounded second order etc.) the following result can be obtained. • O(1/sqrt(N)) rate can be obtained, which is optimal compared to state of the art results in asynchronous non-convex problems. • Due to exponential dependence in the dimension, the algorithm might get stuck onto local minima like other tempered algorithms (SGLD like).

  9. My opinion on this paper? • Finally, non-asymptotic bounds, tighter than the synchronous counterpart, more interpretable. • Bounds generalize to the less understood non-convex setting and QN convergence. • Algorithm performs better in practice than the synchronous counterpart and is even comparable to asynchronous SGD. • An interesting theoretical application of SG-MCMC algorithms: is it possible to extend such ideas to asynchronous SGD analysis? • It seems theoretical results are weak at the moment due to dimension dependence, but authors accept that limitation.

  10. Questions?

  11. Thank You

More Related