1 / 39

Jorge Nocedal Northwestern University With S. Hansen, R. Byrd and Y. Singer IPAM, UCLA, Feb 2014

A Stochastic Quasi-Newton Method for Large-Scale Learning. Jorge Nocedal Northwestern University With S. Hansen, R. Byrd and Y. Singer IPAM, UCLA, Feb 2014. Propose a robust quasi-Newton method that operates in the stochastic approximation regime

prince
Download Presentation

Jorge Nocedal Northwestern University With S. Hansen, R. Byrd and Y. Singer IPAM, UCLA, Feb 2014

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Stochastic Quasi-Newton Method for Large-Scale Learning Jorge Nocedal Northwestern University With S. Hansen, R. Byrd and Y. Singer IPAM, UCLA, Feb 2014

  2. Propose a robust quasi-Newton method that operates in the stochastic approximation regime • purely stochastic method (not batch) – to compete with stochatic gradient (SG) method • Full non-diagonal Hessian approximation • Scalable to millions of parameters Goal

  3. Outline Are iterations of this following form viable? - theoretical considerations; iteration costs - differencing noisy gradients? Key ideas: compute curvature information pointwise at regular intervals, build on strength of BFGS updating recalling that it is an overwriting (and not averaging process) - results on text and speech problems - examine both training and testing errors

  4. Problem • Applications • Simulation optimization • Machine learning Algorithm not (yet) applicable to simulation based optimization

  5. Stochastic gradient method For loss function Robbins-Monro or stochastic gradient method using stochastic gradient (estimator) min-batch

  6. Is there any reason to think that including a Hessian approximation will improve upon stochastic gradient method? • Iteration costs are so high that even if method is faster than SG in terms of training costs it will be a weaker learner Why it won’t work ….

  7. Theoretical Considerations Number of iterations needed to compute an epsilon-accurate solution: Completely removes the dependency on the condition number (Murata 98); cf Bottou-Bousquet Depends on the condition number of the Hessian at the true solution Depends on the Hessian at true solution and the gradient covariance matrix

  8. Computational cost • Assuming we obtain efficiencies of classical quasi-Newton • methods in limited memory form • Each iteration requires 4Md operations • M = memory in limited memory implementation; M=5 • d = dimension of the optimization problem Stochastic gradient method

  9. Mini-batching • assuming a min-batch b=50 • cost of stochastic gradient = 50d • Use of small mini-batches will be a game-changer • b =10, 50, 100

  10. Game changer? Not quite… • Mini-batching makes operation counts favorable but does not resolve challenges related to noise • Avoid differencing noise • Curvature estimates cannot suffer from sporadic spikes in noise (Schraudolph et al. (99), Ribeiro et at (2013) • Quasi-Newton updating is an overwriting process not an averaging process • Control quality of curvature information • Cost of curvature computation • Use of small mini-batches will be a game-changer • b =10,50,100

  11. Desing of Stochastic Quasi-Newton Method • Propose a method based on the famous BFGS formula • all components seem to fit together well • numerical performance appears to be strong • Propose a new quasi-Newton updating formula • Specifically designed to deal with noisy gradients • Work in progress

  12. Review of the deterministic BFGS method

  13. The remarkable properties of BFGS method (convex case) Superlinear convergence; global convergence for strongly convex problems, self-correction properties Only need to approximate Hessian in a subspace Powell 76 Byrd-N 89

  14. Adaptation to stochastic setting Cannot mimic classical approach and update after each iteration Since batch size b is small this will yield highly noisy curvature estimates Instead: Use a collection of iterates to define the correction pairs

  15. Stochastic BFGS: Approach 1 Define two collections of size L: Define average iterate/gradient: New curvature pair:

  16. Stochastic L-BFGS: First Approach

  17. Two sources of error • Sample variance • Lack of sample uniformity • Initial reaction • Control quality of average gradients • Use of sample variance … dynamic sampling Stochastic BFGS: Approach 1 We could not make this work in a robust manner! • Proposed Solution • Control quality of curvature yestimate directly

  18. Key idea: avoid differencing • Standard definition • arises from • Hessian-vector products are often available Define curvature vector for L-BFGS via a Hessian-vector product perform only every L iterations

  19. Structure of Hessian-vector product Code Hessian-vector product directly Achieve sample uniformity automatically (c.f. Schraudolph) Avoid numerical problems when ||s||is small Control cost of y computation Mini-batch stochastic gradient

  20. The Proposed Algorithm

  21. Algorithmic Parameters • b: stochastic gradient batch size • bH: Hessian-vector batch size • L: controls frequency of quasi-Newton updating • M: memory parameter in L-BFGS updating M=5 • - use limited memory form

  22. Need Hessian to implement a quasi-Newton method? ?? Œ ⌥ ä ϖ ś ħ Ŧ Are you out of your mind? We don’t need Hessian-vector product, but it has many Advantages: complete freedom in sampling and accuracy

  23. Numerical Tests Stochastic gradient method (SGD) Stochastic quasi-Newton method (SQN) It is well know that SGD is highly sensitive to choice of steplength, and so will be the SQN method (though perhaps less)

  24. n = 112919, N = 688329 RCV1 Problem • b = 50, 300, 1000 • M = 5, L = 20 bH = 1000 sgd sqn Accessed data points; includes Hessian- vector products

  25. Speech Problem n= 30315, N = 191607 sgd sqn • b = 100, 500 , M = 5, L = 20, bH = 1000

  26. Varying Hessian batch bH: RCV1 b=300

  27. Varying memory size M in limited memory BFGS: RCV1

  28. Varying L-BFGS Memory Size: Synthetic problem

  29. Generalization Error: RCV1 Problem SQN SGD

  30. Test Problems • Synthetically Generated Logistic Regression: Singer et al • n = 50, N = 7000 • Training data:. • RCV1 dataset • n = 112919, N = 688329 • Training data: • SPEECH dataset • NF = 235, |C| = 129 • n = NF x |C| --> n= 30315, N = 191607 • Training data:

  31. Iteration Costs SGD SQN • mini-batch stochastic gradient • mini-batch stochastic gradient • Hessian-vector product every L iterations • matrix-vector product

  32. Iteration Costs SGD SQN • b = 50-1000 370n 300n Typical Parameter Values b = 300 • bH = 100-1000 bH = 1000 • L = 10-20 L = 20 • M = 3-20 M = 5

  33. Hasn’t this been done before? Hessian-free Newton method: Martens (2010), Byrd et al (2011) - claim: stochastic Newton not competitive with stochastic BFGS Prior work: Schraudolph et al. - similar, cannot ensure quality of y - change BFGS formula in one-sided form

  34. Supporting theory? Work in progress: Figen Oztoprak, Byrd, Soltnsev - combine classical analysis Murata, Nemirovsky et a - and asumptotic quasi-Newton theory - effect on constants (condition number) - invoke self-correction properties of BFGS Practical Implementation: limited memory BFGS - loses superlinear convergence property - enjoys self-correction mechanisms

  35. Small batches: RCV1 Problem bH=1000, M=5, L=200 SGD: b adp/iter bn work/iter SQN: b + bH/L adp/iter bn + bHn/L +4Mn work/iter Parameters L, M and bH provide freedom in adapting the SQN method to a specific application

  36. Alternative quasi-Newton framework BFGS method was not derived with noisy gradients in mind - how do we know it is an appropriate framework - Start from scratch - derive quasi-Newton updating formulas tolerant to noise

  37. Foundations Define quadratic model around a reference point z Using a collection indexed by I , natural to require i.e. residuals are zero in expectation Not enough information to determine the whole model

  38. Mean square error Given a collection I, choose model qto minimize Differentiating w.r.t. g: Encouraging: obtained residual condition

  39. The End

More Related