390 likes | 527 Views
A Stochastic Quasi-Newton Method for Large-Scale Learning. Jorge Nocedal Northwestern University With S. Hansen, R. Byrd and Y. Singer IPAM, UCLA, Feb 2014. Propose a robust quasi-Newton method that operates in the stochastic approximation regime
E N D
A Stochastic Quasi-Newton Method for Large-Scale Learning Jorge Nocedal Northwestern University With S. Hansen, R. Byrd and Y. Singer IPAM, UCLA, Feb 2014
Propose a robust quasi-Newton method that operates in the stochastic approximation regime • purely stochastic method (not batch) – to compete with stochatic gradient (SG) method • Full non-diagonal Hessian approximation • Scalable to millions of parameters Goal
Outline Are iterations of this following form viable? - theoretical considerations; iteration costs - differencing noisy gradients? Key ideas: compute curvature information pointwise at regular intervals, build on strength of BFGS updating recalling that it is an overwriting (and not averaging process) - results on text and speech problems - examine both training and testing errors
Problem • Applications • Simulation optimization • Machine learning Algorithm not (yet) applicable to simulation based optimization
Stochastic gradient method For loss function Robbins-Monro or stochastic gradient method using stochastic gradient (estimator) min-batch
Is there any reason to think that including a Hessian approximation will improve upon stochastic gradient method? • Iteration costs are so high that even if method is faster than SG in terms of training costs it will be a weaker learner Why it won’t work ….
Theoretical Considerations Number of iterations needed to compute an epsilon-accurate solution: Completely removes the dependency on the condition number (Murata 98); cf Bottou-Bousquet Depends on the condition number of the Hessian at the true solution Depends on the Hessian at true solution and the gradient covariance matrix
Computational cost • Assuming we obtain efficiencies of classical quasi-Newton • methods in limited memory form • Each iteration requires 4Md operations • M = memory in limited memory implementation; M=5 • d = dimension of the optimization problem Stochastic gradient method
Mini-batching • assuming a min-batch b=50 • cost of stochastic gradient = 50d • Use of small mini-batches will be a game-changer • b =10, 50, 100
Game changer? Not quite… • Mini-batching makes operation counts favorable but does not resolve challenges related to noise • Avoid differencing noise • Curvature estimates cannot suffer from sporadic spikes in noise (Schraudolph et al. (99), Ribeiro et at (2013) • Quasi-Newton updating is an overwriting process not an averaging process • Control quality of curvature information • Cost of curvature computation • Use of small mini-batches will be a game-changer • b =10,50,100
Desing of Stochastic Quasi-Newton Method • Propose a method based on the famous BFGS formula • all components seem to fit together well • numerical performance appears to be strong • Propose a new quasi-Newton updating formula • Specifically designed to deal with noisy gradients • Work in progress
The remarkable properties of BFGS method (convex case) Superlinear convergence; global convergence for strongly convex problems, self-correction properties Only need to approximate Hessian in a subspace Powell 76 Byrd-N 89
Adaptation to stochastic setting Cannot mimic classical approach and update after each iteration Since batch size b is small this will yield highly noisy curvature estimates Instead: Use a collection of iterates to define the correction pairs
Stochastic BFGS: Approach 1 Define two collections of size L: Define average iterate/gradient: New curvature pair:
Two sources of error • Sample variance • Lack of sample uniformity • Initial reaction • Control quality of average gradients • Use of sample variance … dynamic sampling Stochastic BFGS: Approach 1 We could not make this work in a robust manner! • Proposed Solution • Control quality of curvature yestimate directly
Key idea: avoid differencing • Standard definition • arises from • Hessian-vector products are often available Define curvature vector for L-BFGS via a Hessian-vector product perform only every L iterations
Structure of Hessian-vector product Code Hessian-vector product directly Achieve sample uniformity automatically (c.f. Schraudolph) Avoid numerical problems when ||s||is small Control cost of y computation Mini-batch stochastic gradient
Algorithmic Parameters • b: stochastic gradient batch size • bH: Hessian-vector batch size • L: controls frequency of quasi-Newton updating • M: memory parameter in L-BFGS updating M=5 • - use limited memory form
Need Hessian to implement a quasi-Newton method? ?? Œ ⌥ ä ϖ ś ħ Ŧ Are you out of your mind? We don’t need Hessian-vector product, but it has many Advantages: complete freedom in sampling and accuracy
Numerical Tests Stochastic gradient method (SGD) Stochastic quasi-Newton method (SQN) It is well know that SGD is highly sensitive to choice of steplength, and so will be the SQN method (though perhaps less)
n = 112919, N = 688329 RCV1 Problem • b = 50, 300, 1000 • M = 5, L = 20 bH = 1000 sgd sqn Accessed data points; includes Hessian- vector products
Speech Problem n= 30315, N = 191607 sgd sqn • b = 100, 500 , M = 5, L = 20, bH = 1000
Test Problems • Synthetically Generated Logistic Regression: Singer et al • n = 50, N = 7000 • Training data:. • RCV1 dataset • n = 112919, N = 688329 • Training data: • SPEECH dataset • NF = 235, |C| = 129 • n = NF x |C| --> n= 30315, N = 191607 • Training data:
Iteration Costs SGD SQN • mini-batch stochastic gradient • mini-batch stochastic gradient • Hessian-vector product every L iterations • matrix-vector product
Iteration Costs SGD SQN • b = 50-1000 370n 300n Typical Parameter Values b = 300 • bH = 100-1000 bH = 1000 • L = 10-20 L = 20 • M = 3-20 M = 5
Hasn’t this been done before? Hessian-free Newton method: Martens (2010), Byrd et al (2011) - claim: stochastic Newton not competitive with stochastic BFGS Prior work: Schraudolph et al. - similar, cannot ensure quality of y - change BFGS formula in one-sided form
Supporting theory? Work in progress: Figen Oztoprak, Byrd, Soltnsev - combine classical analysis Murata, Nemirovsky et a - and asumptotic quasi-Newton theory - effect on constants (condition number) - invoke self-correction properties of BFGS Practical Implementation: limited memory BFGS - loses superlinear convergence property - enjoys self-correction mechanisms
Small batches: RCV1 Problem bH=1000, M=5, L=200 SGD: b adp/iter bn work/iter SQN: b + bH/L adp/iter bn + bHn/L +4Mn work/iter Parameters L, M and bH provide freedom in adapting the SQN method to a specific application
Alternative quasi-Newton framework BFGS method was not derived with noisy gradients in mind - how do we know it is an appropriate framework - Start from scratch - derive quasi-Newton updating formulas tolerant to noise
Foundations Define quadratic model around a reference point z Using a collection indexed by I , natural to require i.e. residuals are zero in expectation Not enough information to determine the whole model
Mean square error Given a collection I, choose model qto minimize Differentiating w.r.t. g: Encouraging: obtained residual condition