890 likes | 1.25k Views
Machine Learning in the Cloud. Carlos Guestrin Joe Hellerstein David O’Hallaron. Yucheng Low. Aapo Kyrola. Danny Bickson. Joey Gonzalez. Machine Learning in the Real World. 13 Million Wikipedia Pages. 500 Million Facebook Users. 3.6 Billion Flickr Photos. 24 Hours a Minute
E N D
Machine Learning in the Cloud Carlos Guestrin Joe Hellerstein David O’Hallaron Yucheng Low Aapo Kyrola Danny Bickson JoeyGonzalez
Machine Learning in the Real World 13 Million Wikipedia Pages 500 Million Facebook Users 3.6 Billion Flickr Photos 24 Hours a Minute YouTube
Parallelism is Difficult • Wide array of different parallel architectures: • Different challenges for each architecture GPUs Multicore Clusters Clouds Supercomputers High Level Abstractions to make things easier.
MapReduce – Map Phase 4 2 . 3 2 1 . 3 2 5 . 8 CPU 1 1 2 . 9 CPU 2 CPU 3 CPU 4 Embarrassingly Parallel independent computation No Communication needed
MapReduce – Map Phase 8 4 . 3 1 8 . 4 8 4 . 4 CPU 1 2 4 . 1 CPU 2 CPU 3 CPU 4 1 2 . 9 4 2 . 3 2 1 . 3 2 5 . 8 Embarrassingly Parallel independent computation No Communication needed
MapReduce – Map Phase 6 7 . 5 1 4 . 9 3 4 . 3 CPU 1 1 7 . 5 CPU 2 CPU 3 CPU 4 8 4 . 3 1 8 . 4 8 4 . 4 1 2 . 9 2 4 . 1 4 2 . 3 2 1 . 3 2 5 . 8 Embarrassingly Parallel independent computation No Communication needed
MapReduce – Reduce Phase 17 26 . 31 22 26 . 26 CPU 1 CPU 2 1 2 . 9 2 4 . 1 1 7 . 5 4 2 . 3 8 4 . 3 6 7 . 5 2 1 . 3 1 8 . 4 1 4 . 9 2 5 . 8 8 4 . 4 3 4 . 3 Fold/Aggregation
MapReduce and ML • Excellent for large data-parallel tasks! Data-Parallel Complex Parallel Structure Is there more to Machine Learning ? Map Reduce Feature Extraction Cross Validation Computing Sufficient Statistics
Iterative Algorithms? • We can implement iterative algorithms in MapReduce: Iterations Data Data Data Data CPU 1 CPU 1 CPU 1 Data Data Data Data Data Data Data Data CPU 2 CPU 2 CPU 2 Data Data Data Data Data Data Data Data CPU 3 CPU 3 CPU 3 Data Data Data Slow Processor Data Data Data Data Data Barrier Barrier Barrier
Iterative MapReduce • System is not optimized for iteration: Iterations Data Data Data Data CPU 1 CPU 1 CPU 1 Data Data Data Data Data Data Data Data CPU 2 CPU 2 CPU 2 Data Data Data Startup Penalty Disk Penalty Disk Penalty Startup Penalty Startup Penalty Disk Penalty Data Data Data Data Data CPU 3 CPU 3 CPU 3 Data Data Data Data Data Data Data Data
Iterative MapReduce • Only a subset of data needs computation: (multi-phase iteration) Iterations Data Data Data Data CPU 1 CPU 1 CPU 1 Data Data Data Data Data Data Data Data CPU 2 CPU 2 CPU 2 Data Data Data Data Data Data Data Data CPU 3 CPU 3 CPU 3 Data Data Data Data Data Data Data Data Barrier Barrier Barrier
MapReduce and ML • Excellent for large data-parallel tasks! Data-Parallel Complex Parallel Structure Is there more to Machine Learning ? Map Reduce Feature Extraction Cross Validation Computing Sufficient Statistics
Structured Problems Example Problem: Will I be successful in research? Success depends on the success of others. May not be able to safely update neighboring nodes. [e.g., Gibbs Sampling] Interdependent Computation: Not Map-Reducible
Space of Problems • Sparse Computation Dependencies • Can be decomposed into local “computation-kernels” • Asynchronous Iterative Computation • Repeated iterations over local kernel computations
Parallel Computing and ML • Not all algorithms are efficiently data parallel ? Data-Parallel Structured Iterative Parallel GraphLab Map Reduce Tensor Factorization Lasso Feature Extraction Cross Validation Kernel Methods Belief Propagation Computing Sufficient Statistics LearningGraphicalModels SVM Sampling Deep Belief Networks Neural Networks
GraphLab Goals • Designed for ML needs • Express data dependencies • Iterative • Simplifies the design of parallel programs: • Abstract away hardware issues • Addresses multiple hardware architectures • Multicore • Distributed • GPU and others
GraphLab Goals Simple Models Complex Models Now Small Data Data-Parallel Goal Large Data
GraphLab Goals Simple Models Complex Models Now Small Data Data-Parallel GraphLab Large Data
GraphLab A Domain-Specific Abstraction for Machine Learning
Everything on a Graph A Graph with data associated with every vertex and edge :Data
Update Functions Update Functions: operations applied on vertex transform data in scope of vertex
Update Functions Update Function can Schedule the computation of any other update function: - FIFO Scheduling - Prioritized Scheduling - Randomized Etc. Scheduled computation is guaranteed to execute eventually.
Example: Page Rank Graph = WWW multiply adjacent pagerank values with edge weights to get current vertex’s pagerank Update Function: “Prioritized” PageRank Computation? Skip converged vertices.
Example: K-Means Clustering Data (Fully Connected?) Bipartite Graph Clusters Cluster Update: compute average of data connected on a “marked” edge. Data Update: Pick the closest cluster and mark the edge. Unmark remaining edges. Update Function:
Example: MRF Sampling Graph = MRF - Read samples on adjacent vertices - Read edge potentials - Compute new sample for current vertex Update Function:
Not Message Passing! Graph is a data-structure. Update Functions perform parallel modifications to the data-structure.
Safety If adjacent update functions occur simultaneously?
Safety If adjacent update functions occur simultaneously?
Importance of Consistency ML resilient to soft-optimization? Permit Races? “Best-effort” computation? True for some algorithms. Not true for many. May work empirically on some datasets; may fail on others.
Importance of Consistency Many algorithms require strict consistency, or performs significantly better under strict consistency. Alternating Least Squares
Importance of Consistency Fast ML Algorithm development cycle: Build Test Debug Tweak Model Necessary for framework to behave predictably and consistently and avoid problems caused by non-determinism. Is the execution wrong? Or is the model wrong?
Sequential Consistency GraphLab guarantees sequential consistency parallel execution, sequential execution of update functions which produce same result time CPU 1 Parallel CPU2 CPU1 Sequential
Sequential Consistency GraphLab guarantees sequential consistency parallel execution, sequential execution of update functions which produce same result Formalization of the intuitive concept of a “correct program”. - Computation does not read outdated data from the past - Computation does not read results of computation that occurs in the future. Primary Property of GraphLab
Global Information What if we need global information? Algorithm Parameters? Sufficient Statistics? Sum of all the vertices?
Shared Variables • Global aggregation through Sync Operation • A global parallel reduction over the graph data. • Synced variables recomputed at defined intervals • Sync computationis Sequentially Consistent • Permits correct interleaving of Syncs and Updates Sync: Loglikelihood Sync: Sum of Vertex Values
Sequential Consistency GraphLab guarantees sequential consistency parallel execution, sequential execution of update functions and Syncs which produce same result time CPU 1 Parallel CPU2 CPU1 Sequential
Moving towards the cloud… • Purchasing and maintaining computers is very expensive • Most computing resources seldomly used • Only for deadlines… • Buy time, access hundreds or thousands of processors • Only pay for needed resources
Distributed GL Implementation • Mixed Multi-threaded / Distributed Implementation. (Each machine runs only one instance) • Requires all data to be in memory. Move computation to data. • MPI for management + TCP/IP for communication • Asynchronous C++ RPC Layer • Ran on 64 EC2 HPC Nodes = 512 Processors Skip Implementation
Underlying Network RPC Controller RPC Controller RPC Controller RPC Controller Execution Engine Execution Engine Execution Engine Execution Engine Execution Engine Shared Data Shared Data Shared Data Shared Data Shared Data Cache Coherent Distributed K-V Store Cache Coherent Distributed K-V Store Cache Coherent Distributed K-V Store Cache Coherent Distributed K-V Store Cache Coherent Distributed K-V Store Distributed Graph Distributed Graph Distributed Graph Distributed Graph Distributed Graph Distributed Locks Distributed Locks Distributed Locks Distributed Locks Distributed Locks Execution Threads Execution Threads Execution Threads Execution Threads Execution Threads
Write distributed programs easily • Asynchronous communication • Multithreaded support • Fast • Scalable • Easy To Use (Every machine runs the same binary)
I C++
Features • Easy RPC capabilities: • One way calls rpc.remote_call([target_machine ID], printf, “%s %d %d %d\n”, “hello world”, 1, 2, 3); • Requests (call with return value) std::vector<int>& sort_vector(std::vector<int> &v) { std::sort(v.begin(), v.end()); return v; } • vec = rpc.remote_request( • [target_machine ID], • sort_vector, • vec);
Features • Object Instance Context MPI-like primitives dc.barrier() dc.gather(...) dc.send_to([target machine], [arbitrary object]) dc.recv_from([source machine], [arbitrary object ref]) K-V Object RPC Controller K-V Object RPC Controller K-V Object RPC Controller K-V Object RPC Controller MPI-Like Safety
Request Latency Ping RTT = 90us
One-Way Call Rate 1Gbps physical peak
Serialization Performance 100,000 X One way call of vector of 10 X {"hello", 3.14, 100}
Distributed Computing Challenges Q1: How do we efficiently distribute the state ? - Potentially varying #machines Q2: How do we ensure sequential consistency ? Keeping in mind: Limited Bandwidth High Latency Performance