760 likes | 916 Views
A ccelerated, P arallel and PROX imal coordinate descent. Peter Richt á rik. A. P. PROX. Moscow February 2014. (Joint work with Olivier Fercoq - arXiv:1312.5799). Optimization Problem. Problem. Loss. Regularizer. Convex (smooth or nonsmooth ). Convex
E N D
Accelerated, Parallel and PROXimal coordinate descent • Peter Richtárik A P PROX Moscow February 2014 (Joint work with Olivier Fercoq - arXiv:1312.5799)
Problem Loss Regularizer Convex (smooth or nonsmooth) Convex (smooth or nonsmooth) - separable - allow
Regularizer: examples e.g., LASSO No regularizer Weighted L1 norm Box constraints Weighted L2 norm e.g., SVM dual
Loss: examples Quadratic loss BKBG’11 RT’11b TBRS’13 RT ’13a Logistic loss Square hinge loss L-infinity L1 regression FR’13 Exponential loss
2D Optimization Contours of a function Find the minimizer of Goal:
Randomized Coordinate Descent in 2D N E W 1 S
Randomized Coordinate Descent in 2D 2 N E W 1 S
Randomized Coordinate Descent in 2D 2 N 3 E W 1 S
Randomized Coordinate Descent in 2D 4 2 N 3 E W 1 S
Randomized Coordinate Descent in 2D 4 5 2 N 3 E W 1 S
Randomized Coordinate Descent in 2D 6 4 5 2 N 3 E W 1 S
Randomized Coordinate Descent in 2D SOLVED! 6 4 7 5 2 N 3 E W 1 S
Variants of Randomized Coordinate Descent Methods • Block • can operate on “blocks” of coordinates • as opposed to just on individual coordinates • General • applies to “general” (=smooth convex) functions • as opposed to special ones such as quadratics • Proximal • admits a “nonsmoothregularizer” that is kept intact in solving subproblems • regularizer not smoothed, nor approximated • Parallel • operates on multiple blocks / coordinates in parallel • as opposed to just 1 block / coordinate at a time • Accelerated • achieves O(1/k^2) convergence rate for convex functions • as opposed to O(1/k) • Efficient • complexity of 1 iteration is O(1) per processor on sparse problems • as opposed to O(# coordinates) : avoids adding two full vectors
Brief History of Randomized Coordinate Descent Methods + new long stepsizes
A P PROX “PARALLEL” “ACCELERATED” “PROXIMAL”
APPROX: Smooth Case Partial derivative of f Update for coordinate i Want this to be as large as possible
Convergence Rate Key assumption Theorem [FR’13b] # coordinates # iterations average # coordinates updated / iteration implies
Special Case: Fully Parallel Variant all coordinates are updated in each iteration # iterations # normalized weights (summing to n) implies
Special Case: Effect of New Stepsizes With the new stepsizes (will mention later!), we have: Average degree of separability “Average” of the Lipschitz constants
Cost of 1 Iteration of APPROX Scalar function: derivative = O(1) Assume N = n (all blocks are of size 1) and that Sparse matrix Then the average cost of 1 iteration of APPROX is arithmetic ops = average # nonzeros in a column of A
L1 Regularized L1 Regression Gradient Method Nesterov’s Accelerated Gradient Method SPCDM APPROX Dorothea dataset:
L1Regularized Least Squares (LASSO) PCDM APPROX KDDB dataset:
Training Linear SVMs Malicious URL dataset:
Choice of Stepsizes:How (not) to ParallelizeCoordinate Descent
Convergence of Randomized Coordinate Descent Strongly convex F (Simple Mehod) ‘Difficult’ nonsmoothF (Simple Method) Focus on n (big data = big n) ‘Difficult’ nonsmoothF (Accelerated Method) or smooth F (Simple method) Smooth or ‘simple’ nonsmoothF (Accelerated Method)
Parallelization Dream Serial Parallel What do we actually get? WANT Depends on to what extent we can add up individual updates, which depends on the properties of F and the way coordinates are chosen at each iteration
“Naive” parallelization Do the same thing as before, but forMORE or ALL coordinates & ADD UP the updates
Failure of naive parallelization 1b 1a 0
Failure of naive parallelization 1 1b 1a 0
Failure of naive parallelization 1 2b 2a
Failure of naive parallelization 1 2b 2a 2
Failure of naive parallelization OOPS! 2
Idea: averaging updates may help 1b SOLVED! 1 1a 0
Averaging can be too conservative 2b and so on... 1b 2 2a 1 0 1a
Averaging may be too conservative 2 But we wanted: BAD!!! WANT
What to do? Update to coordinate i i-th unit coordinate vector Averaging: Summation: Figure out when one can safely use:
5 Models for f Admitting Small 1 Smooth partially separable f [RT’11b ] 2 Nonsmooth max-type f [FR’13] 3 f with ‘bounded Hessian’ [BKBG’11, RT’13a ]
5 Models for f Admitting Small 4 Partially separable f with smooth components [NC’13] 5 Partially separable f with block smooth components [FR’13b]