1 / 20

Projection-free Online Learning

Projection-free Online Learning. Dan Garber Elad Hazan. Matrix completion. Super-linear operations are infeasible!. x 2. f 1. x 1. Online convex optimization. Incurred loss. f 1 (x 1 ). f 2. f 2 (x 2 ). f T (x T ). linear (convex) bounded cost functions

thetis
Download Presentation

Projection-free Online Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Projection-free Online Learning Dan Garber Elad Hazan

  2. Matrix completion Super-linear operations are infeasible!

  3. x2 f1 x1 Online convex optimization Incurred loss f1(x1) f2 f2(x2) fT(xT) • linear (convex) bounded cost functions • Total loss = tft(xt) • Regret = tft(xt) - minx*tft(x*) • Matrix completion: set = low rank matrices, Xij = prediction user i movie jfunctions: f(X) = |X * Eij - ±1|^2

  4. Online gradient descent The algorithm: move in the direction of the vector -ct(gradient of the current cost function) yt+1 ct xt+1 xt yt+1 = xt- ct and project back to the convex set Thm [Zinkevich]: if  = 1/  t then this alg attains worst case regret of tft(xt) - tft(x*) = O( T)

  5. Computational efficiency? Gradient step: linear time Projection step: quadratic program !! Online mirror descent: general convex program The convex decision set K: • In general O(m½ n3 ) • Simplex / Euclidean ball / cube – linear time • Flow polytope – conic opt. O(m½n3) • PSD cone (matrix completion) – Cholesky decomposition O(n3)

  6. Matrix completion Projections out of the question!

  7. Computationally difficult learning problems • Matrix completion K = SDP coneCholesky decomposition • Online routing K = flow polytopeconic optimization over flow polytope • RotationsK = rotation matrices • MatroidsK = matroidpolytope

  8. Results part 1 (Hazan + Kale, ICML’12) • Projection-less stochastic/online algorithms with regret bounds: • Projections <-> Linear optimization • parameter free (no learning rate) • sparse predictions

  9. Linear opt. vs. Projections • Matrix completion K = SDP coneCholesky decomposition largest singular vector • Online routing K = flow polytopeconic optimization over flow polytopeshortest path computation • RotationsK = rotation matricesconvex opt.Wahba’s alg.

  10. The Frank-Wolfe algorithm vt+1 xt+1 xt

  11. The Frank-Wolfe algorithm (conditional grad.) Thm[ FW ’56]: rate of convergence = 1/Ct (C = smoothness) [Clarkson ‘06] – refined analysis [Hazan ‘07] - SDP [Jaggi ‘11] – generalization vt+1 xt+1 xt

  12. The Frank-Wolfe algorithm • At iteration t – convex comb. of <= t vertices = ((t,K)-sparse • No learning rate. Convex combination with 1/t (indep. Of diameter, gradients etc.) vt+1 xt+1 xt

  13. Online Conditional Gradient (OCG) xt xt+1 vt+1

  14. Projections <-> Linear optimization • parameter free (no learning rate) • sparse predictions • But can we get the optimal root(T) rate?? • Barrier: existing projection-free algs were not linearly converging (poly-time)

  15. New poly-time projection free alg[Garber, Hazan 2013] • New algorithm with convergence~ e-t/n rateCS: “poly time” Nemirovski: “linear rate” • Only linear optimization calls on the original polytope! (constantly many per iteration)

  16. Linearly converging Frank-Wolfe vt+1 xt+1 xt • Assume optimum is within Euclidean distance r: • Thm[ easy ]: rate of convergence = e-t • But useless: under a ball-intersection constraint – quadratic optimization equivalent to projection

  17. Polytopes are OK! • Can find a significantly smaller polytope (radius proportional to Euclidean distance to OPT) that: • Contains x* • Does not intersect original polytope • same shape

  18. Implications for online optimization • Projections <-> Linear optimization • parameter free (no learning rate) • sparse predictions • Optimal rate 

  19. More research / open questions • Projection free alg – for many problems linear step time vs. cubic or more • For main ML problems today – projection-free is the only feasible optimization method • Completely poly-time (log dependence on smoothness / strong convexity / diameter) • Can we attain poly-time optimization using only gradient information? Thank you!

More Related