590 likes | 756 Views
descent methods. Peter Richt á rik. Parallel coordinate. Simons Institute for the Theory of Computing, Berkeley Parallel and Distributed Algorithms for Inference and Optimization, October 23, 2013. Randomized Coordinate Descent in 2D. 2D Optimization. Contours of a function.
E N D
descent methods • Peter Richtárik Parallel coordinate Simons Institute for the Theory of Computing, Berkeley Parallel and Distributed Algorithms for Inference and Optimization, October 23, 2013
2D Optimization Contours of a function Find the minimizer of Goal:
Randomized Coordinate Descent in 2D N E W 1 S
Randomized Coordinate Descent in 2D 2 N E W 1 S
Randomized Coordinate Descent in 2D 2 N 3 E W 1 S
Randomized Coordinate Descent in 2D 4 2 N 3 E W 1 S
Randomized Coordinate Descent in 2D 4 5 2 N 3 E W 1 S
Randomized Coordinate Descent in 2D 6 4 5 2 N 3 E W 1 S
Randomized Coordinate Descent in 2D SOLVED! 6 4 7 5 2 N 3 E W 1 S
Convergence of Randomized Coordinate Descent Focus on n (big data = big n) Strongly convex F Smooth or ‘simple’ nonsmoothF ‘difficult’ nonsmoothF
Parallelization Dream Serial Parallel What do we actually get? WANT Depends on to what extent we can add up individual updates, which depends on the properties of F and the way coordinates are chosen at each iteration
“Naive” parallelization Do the same thing as before, but forMORE or ALL coordinates & ADD UP the updates
Failure of naive parallelization 1b 1a 0
Failure of naive parallelization 1 1b 1a 0
Failure of naive parallelization 1 2b 2a
Failure of naive parallelization 1 2b 2a 2
Failure of naive parallelization OOPS! 2
Idea: averaging updates may help 1b SOLVED! 1 1a 0
Averaging can be too conservative 2b and so on... 1b 2 2a 1 0 1a
Averaging may be too conservative 2 But we wanted: BAD!!! WANT
What to do? Update to coordinate i i-th unit coordinate vector Averaging: Summation: Figure out when one can safely use:
Problem Loss Regularizer Convex (smooth or nonsmooth) Convex (smooth or nonsmooth) - separable - allow
Regularizer: examples e.g., LASSO No regularizer Weighted L1 norm Box constraints Weighted L2 norm e.g., SVM dual
Loss: examples Quadratic loss BKBG’11 RT’11b TBRS’13 RT ’13a Logistic loss Square hinge loss L-infinity L1 regression FR’13 Exponential loss
3 models for f with small 1 Smooth partially separable f [RT’11b ] 2 Nonsmooth max-type f [FR’13] 3 f with ‘bounded Hessian’ [BKBG’11, RT’13a ]
Randomized Parallel Coordinate Descent Method New iterate Current iterate i-th unit coordinate vector Random set of coordinates (sampling) Update to i-th coordinate
ESO: Expected SeparableOverapproximation Shorthand: Definition [RT’11b] Minimize in h Separable in h Can minimize in parallel Can compute updates for only
Convergence rate: convex f Theorem [RT’11b] stepsize parameter # coordinates # iterations average # updated coordinates per iteration error tolerance implies
Convergence rate: strongly convex f Theorem [RT’11b] Strong convexity constant of the regularizer Strong convexity constant of the loss f implies
Serial uniform sampling Probability law:
-nice sampling Good for shared memory systems Probability law:
Doubly uniform sampling Can model unreliable processors / machines Probability law:
ESO for partially separable functions and doubly uniform samplings 1 Smooth partially separable f [RT’11b ] Theorem [RT’11b]
Theoretical speedup # coordinates degree of partial separability # coordinate updates / iter LINEAR OR GOOD SPEEDUP: Nearly separable (sparse) problems WEAK OR NO SPEEDUP: Non-separable (dense) problems Much of Big Data is here!
Theory n = 1000 (# coordinates)
Practice n = 1000 (# coordinates)
= Extreme* Mountain Climbing Optimization with Big Data * in a billion dimensional space on a foggy day