320 likes | 483 Views
Optimization via (too much?) Randomization. Why parallelizing like crazy and being lazy can be good. Peter Richtarik. Optimization as Mountain Climbing. =. Extreme* Mountain Climbing. Optimization with Big Data. * in a billion dimensional space on a foggy day. Big Data.
E N D
Optimization via (too much?) Randomization Why parallelizing like crazy and being lazy can be good • Peter Richtarik
= Extreme* Mountain Climbing Optimization with Big Data * in a billion dimensional space on a foggy day
Big Data BIG Volume BIG Velocity BIG Variety • digital images & videos • transaction records • government records • health records • defence • internet activity (social media, wikipedia, ...) • scientific measurements (physics, climate models, ...)
If You Are Not a God... x2 x3 x0 x1
Randomized Parallel Coordinate Descent holy grail settle for this start
Arup (Truss Topology Design) Western General Hospital (Creutzfeldt-Jakob Disease) Ministry of Defence dstl lab (Algorithms for Data Simplicity) Royal Observatory (Optimal Planet Growth)
A Lock with 4 Dials A function representing the “quality” of a combination x =(x1, x2, x3, x4) F(x) = F(x1, x2, x3, x4) Setup: Combination maximizing F opens the lock Optimization Problem: Find combination maximizing F
A System of Billion Locks with Shared Dials x1 1) Nodes in the graph correspond to dials x2 2) Nodes in the graph also correspond to locks: each lock (=node) owns dials connected to it in the graph by an edge Lock x4 xn x3 # dials = n = # locks
How do we Measure the Quality of a Combination? • Each lock j has its own quality function Fjdepending on the dials it owns • However, it does NOT open when Fjis maximized • The system of locks opens when • is maximized F = F1 + F2 + ... + Fn F : RnR
An Algorithm with (too much?) Randomization 1) Randomly select a lock 2) Randomly select a dial belonging to the lock 3) Adjust the value on the selected dial based only on the info corresponding to the selected lock
Synchronous Parallelization J1 IDLE J2 IDLE J3 IDLE Processor 1 WASTEFUL J4 IDLE J5 J6 Processor 2 J7 J8 IDLE J9 IDLE Processor 3 time
Crazy (Lock-Free) Parallelization J1 J2 J3 Processor 1 NO WASTE J4 J5 J6 Processor 2 J7 J8 J9 Processor 3 time
Theoretical Result # Processors Average # of dials common between 2 locks # Locks Average # dials in a lock
Why parallelizing like crazy and being lazy can be good? Randomization • Effectivity • Tractability • Efficiency • Scalability (big data) • Parallelism • Distribution • Asynchronicity Parallelization
Optimization Methods for Big Data • Randomized Coordinate Descent • P. R. and M. Takac: Parallel coordinate descent methods for big data optimization, ArXiv:1212.0873 [can solve a problem with 1 billion variables in 2 hours using 24 processors] • Stochastic (Sub) Gradient Descent • P. R. and M. Takac: Randomized lock-free methods for minimizing partially separable convex functions [can be applied to optimize an unknown function] • Both of the above M. Takac, A.Bijral, P. R. and N. Srebro: Mini-batch primal and dual methods for support vector machines, ArXiv:1303.xxxx
Tools Probability HPC Matrix Theory Machine Learning