A ccelerated, P arallel and PROX imal coordinate descent

Accelerated, Parallel and PROXimal coordinate descent • Peter Richtárik A P PROX Moscow February 2014 (Joint work with Olivier Fercoq - arXiv:1312.5799)

Optimization Problem

Problem Loss Regularizer Convex (smooth or nonsmooth) Convex (smooth or nonsmooth) - separable - allow

Regularizer: examples e.g., LASSO No regularizer Weighted L1 norm Box constraints Weighted L2 norm e.g., SVM dual

Loss: examples Quadratic loss BKBG’11 RT’11b TBRS’13 RT ’13a Logistic loss Square hinge loss L-infinity L1 regression FR’13 Exponential loss

RANDOMIZED COORDINATE DESCENTIN 2D

2D Optimization Contours of a function Find the minimizer of Goal:

Randomized Coordinate Descent in 2D N E W S

Randomized Coordinate Descent in 2D N E W 1 S

Randomized Coordinate Descent in 2D 2 N E W 1 S

Randomized Coordinate Descent in 2D 2 N 3 E W 1 S

Randomized Coordinate Descent in 2D 4 2 N 3 E W 1 S

Randomized Coordinate Descent in 2D 4 5 2 N 3 E W 1 S

Randomized Coordinate Descent in 2D 6 4 5 2 N 3 E W 1 S

Randomized Coordinate Descent in 2D SOLVED! 6 4 7 5 2 N 3 E W 1 S

CONTRIBUTIONS

Variants of Randomized Coordinate Descent Methods • Block • can operate on “blocks” of coordinates • as opposed to just on individual coordinates • General • applies to “general” (=smooth convex) functions • as opposed to special ones such as quadratics • Proximal • admits a “nonsmoothregularizer” that is kept intact in solving subproblems • regularizer not smoothed, nor approximated • Parallel • operates on multiple blocks / coordinates in parallel • as opposed to just 1 block / coordinate at a time • Accelerated • achieves O(1/k^2) convergence rate for convex functions • as opposed to O(1/k) • Efficient • complexity of 1 iteration is O(1) per processor on sparse problems • as opposed to O(# coordinates) : avoids adding two full vectors

Brief History of Randomized Coordinate Descent Methods + new long stepsizes

APPROX

A P PROX “PARALLEL” “ACCELERATED” “PROXIMAL”

PCDM (R. & Takáč, 2012) = APPROX if we force

APPROX: Smooth Case Partial derivative of f Update for coordinate i Want this to be as large as possible

CONVERGENCE RATE

Convergence Rate Key assumption Theorem [FR’13b] # coordinates # iterations average # coordinates updated / iteration implies

Special Case: Fully Parallel Variant all coordinates are updated in each iteration # iterations # normalized weights (summing to n) implies

Special Case: Effect of New Stepsizes With the new stepsizes (will mention later!), we have: Average degree of separability “Average” of the Lipschitz constants

“EFFICIENCY” OF APPROX

Cost of 1 Iteration of APPROX Scalar function: derivative = O(1) Assume N = n (all blocks are of size 1) and that Sparse matrix Then the average cost of 1 iteration of APPROX is arithmetic ops = average # nonzeros in a column of A

Bottleneck: Computation of Partial Derivatives maintained

PRELIMINARYEXPERIMENTS

L1 Regularized L1 Regression Gradient Method Nesterov’s Accelerated Gradient Method SPCDM APPROX Dorothea dataset:

L1 Regularized L1 Regression

L1Regularized Least Squares (LASSO) PCDM APPROX KDDB dataset:

Training Linear SVMs Malicious URL dataset:

Choice of Stepsizes:How (not) to ParallelizeCoordinate Descent

Convergence of Randomized Coordinate Descent Strongly convex F (Simple Mehod) ‘Difficult’ nonsmoothF (Simple Method) Focus on n (big data = big n) ‘Difficult’ nonsmoothF (Accelerated Method) or smooth F (Simple method) Smooth or ‘simple’ nonsmoothF (Accelerated Method)

Parallelization Dream Serial Parallel What do we actually get? WANT Depends on to what extent we can add up individual updates, which depends on the properties of F and the way coordinates are chosen at each iteration

“Naive” parallelization Do the same thing as before, but forMORE or ALL coordinates & ADD UP the updates

Failure of naive parallelization 1b 1a 0

Failure of naive parallelization 1 1b 1a 0

Failure of naive parallelization 1 2b 2a

Failure of naive parallelization 1 2b 2a 2

Failure of naive parallelization OOPS! 2

Idea: averaging updates may help 1b SOLVED! 1 1a 0

Averaging can be too conservative 2b and so on... 1b 2 2a 1 0 1a

Averaging may be too conservative 2 But we wanted: BAD!!! WANT

What to do? Update to coordinate i i-th unit coordinate vector Averaging: Summation: Figure out when one can safely use:

ESO:Expected SeparableOverapproximation

5 Models for f Admitting Small 1 Smooth partially separable f [RT’11b ] 2 Nonsmooth max-type f [FR’13] 3 f with ‘bounded Hessian’ [BKBG’11, RT’13a ]

5 Models for f Admitting Small 4 Partially separable f with smooth components [NC’13] 5 Partially separable f with block smooth components [FR’13b]

A ccelerated, P arallel and PROX imal coordinate descent

A ccelerated, P arallel and PROX imal coordinate descent

Presentation Transcript

Blockwise Coordinate Descent Procedures for the Multi -task Lasso

Lecture 12: P arallel Sorting

P arallel Systems

Introduction to P arallel Computing

p arallel lines

Parallel Coordinate Descent for L 1 -Regularized Loss Minimization

P.2 Cartesian Coordinate System

Prox-0.3

Prox-0.3

A ccelerated, P arallel and PROX imal coordinate descent

Planmeca ProX

I ntroduction to P arallel P rocessing

I ntroduction to P arallel P rocessing

P ARALLEL P ROCESSING I NSTITUTE · F UDAN U NIVERSITY

P ARALLEL M INING OF A SSOCIATION R ULES

P ROPERTIES OF P ARALLEL L INES