890 likes | 1.12k Views
2. A Distributed Abstraction for Large-Scale Machine Learning. Carlos Guestrin. Yucheng Low. Joseph Gonzalez. Aapo Kyrola. Danny Bickson. Haijie Gu. Needless to Say, We Need Machine Learning for Big Data. 750 Million Facebook Users. 24 Million Wikipedia Pages. 6 Billion
E N D
2 A Distributed Abstraction for Large-Scale Machine Learning Carlos Guestrin Yucheng Low Joseph Gonzalez Aapo Kyrola Danny Bickson Haijie Gu
Needless to Say, We Need Machine Learning for Big Data 750 Million Facebook Users 24 Million Wikipedia Pages 6 Billion Flickr Photos “… data a new class of economic asset, like currency or gold.” 48 Hours a Minute YouTube
How will wedesign and implementparallel learning systems? Big Learning
A Shift Towards Parallelism • ML experts repeatedly solve the same parallel design challenges: • Race conditions, distributed state, communication… • The resulting code is: • difficult to maintain, extend, debug… Graduatestudents GPUs Multicore Clusters Clouds Supercomputers Avoid these problems by using high-level abstractions
Data Parallelism (MapReduce) 6 7 . 5 4 2 . 3 8 4 . 3 1 4 . 9 1 8 . 4 2 1 . 3 3 4 . 3 8 4 . 4 2 5 . 8 CPU 1 1 7 . 5 1 2 . 9 2 4 . 1 CPU 2 CPU 3 CPU 4 Solve a huge number of independentsubproblems
MapReduce for Data-Parallel ML • Excellent for large data-parallel tasks! Data-Parallel Graph-Parallel Is there more to Machine Learning ? MapReduce Feature Extraction Cross Validation Graphical Models Gibbs Sampling Belief Propagation Variational Opt. Semi-Supervised Learning Label Propagation CoEM Computing Sufficient Statistics Collaborative Filtering Tensor Factorization Graph Analysis PageRank Triangle Counting
Label a Face and Propagate grandma
Pairwise similarity not enough… Not similar enough to be sure Who???? grandma
Propagate Similarities & Co-occurrences for Accurate Predictions grandma!!! grandma similarityedges co-occurring facesfurther evidence
Collaborative Filtering: Independent Case Lord of the Rings Star Wars IV Star Wars I recommend Harry Potter Pirates of the Caribbean
Collaborative Filtering: Exploiting Dependencies Women on the Verge of aNervous Breakdown The Celebration City of God What do I recommend??? recommend Wild Strawberries La Dolce Vita
Latent Topic Modeling (LDA) Cat Apple Growth Hat Plant
Machine Learning Pipeline Structured MachineLearningAlgorithm Data Value from Data ExtractFeatures GraphFormation facelabels doctopics movierecommend. similarfaces sharedwords ratedmovies beliefpropagation LDA collaborativefiltering faces important words side info images docs movie ratings
Parallelizing Machine Learning Structured MachineLearningAlgorithm Data Value from Data ExtractFeatures GraphFormation Graph-StructuredComputation graph-parallel Graph Ingress mostly data-parallel
ML Tasks Beyond Data-Parallelism Data-Parallel Graph-Parallel Map Reduce Feature Extraction Cross Validation Graphical Models Gibbs Sampling Belief Propagation Variational Opt. Semi-Supervised Learning Label Propagation CoEM Computing Sufficient Statistics Collaborative Filtering Tensor Factorization Graph Analysis PageRank Triangle Counting
PageRank Depends on rank of who follows them… Depends on rank of who follows her What’s the rank of this user? Rank? Loops in graph Must iterate!
PageRank Iteration • αis the random reset probability • wji is the prob. transitioning (similarity) from j to i R[j] Iterate until convergence: “My rank is weighted average of my friends’ ranks” wji R[i]
Properties of Graph Parallel Algorithms Dependency Graph LocalUpdates Iterative Computation My Rank Friends Rank
Addressing Graph-Parallel ML Data-Parallel Graph-Parallel Map Reduce Map Reduce? Graph-Parallel Abstraction Feature Extraction Cross Validation Graphical Models Gibbs Sampling Belief Propagation Variational Opt. Semi-Supervised Learning Label Propagation CoEM Computing Sufficient Statistics Collaborative Filtering Tensor Factorization Data-Mining PageRank Triangle Counting
Graph Computation: Synchronous v. Asynchronous
Bulk Synchronous Parallel Model: Pregel (Giraph) [Valiant ‘90] Compute Communicate Barrier
Bulk synchronous parallel model provably inefficient for some ML tasks
Analyzing Belief Propagation [Gonzalez, Low, G. ‘09] focus here A B Priority Queue Smart Scheduling importantinfluence Asynchronous Parallel Model (rather than BSP) fundamental for efficiency
Asynchronous Belief Propagation Challenge = Boundaries Many Updates Synthetic Noisy Image Few Updates Cumulative Vertex Updates Algorithm identifies and focuses on hidden sequential structure Graphical Model
BSP ML Problem: Synchronous Algorithms can be Inefficient Theorem: Bulk Synchronous BP O(#vertices) slower than Asynchronous BP Bulk Synchronous (e.g., Pregel) Asynchronous Splash BP Efficient parallel implementation was painful, painful, painful…
The Need for a New Abstraction • Need: Asynchronous, Dynamic Parallel Computations Data-Parallel Graph-Parallel Map Reduce BSP, e.g., Pregel Feature Extraction Cross Validation Graphical Models Gibbs Sampling Belief Propagation Variational Opt. Semi-Supervised Learning Label Propagation CoEM Computing Sufficient Statistics Collaborative Filtering Tensor Factorization Data-Mining PageRank Triangle Counting
The GraphLabGoals • Designed specifically for ML • Graph dependencies • Iterative • Asynchronous • Dynamic • Simplifies design of parallel programs: • Abstract away hardware issues • Automatic data synchronization • Addresses multiple hardware architectures Efficient parallelpredictions Know how to solve ML problem on1 machine
Data Graph Data associated with vertices and edges • Graph: • Social Network • Vertex Data: • User profile text • Current interests estimates • Edge Data: • Similarity weights
Update Functions User-defined program: applied to vertex transforms data in scopeof vertex pagerank(i, scope){ // Get Neighborhood data (R[i], wij, R[j]) scope; // Update the vertex data // Reschedule Neighbors if needed if R[i] changes then reschedule_neighbors_of(i); } Update function applied (asynchronously) in parallel until convergence Many schedulers available to prioritize computation Dynamic computation
Ensuring Race-Free Code How much can computation overlap?
Consistency in Collaborative Filtering Inconsistent updates Consistent updates GraphLab guarantees consistent updates User-tunable consistency levels trades off parallelism & consistency Netflix data, 8 cores
The GraphLab Framework Scheduler Graph Based Data Representation Update Functions User Computation Consistency Model
Alternating Least Squares SVD Splash Sampler CoEM Bayesian Tensor Factorization Lasso Belief Propagation PageRank LDA SVM Gibbs Sampling Dynamic Block Gibbs Sampling K-Means Matrix Factorization …Many others… Linear Solvers
Never Ending Learner Project (CoEM) Optimal GraphLabCoEM Better 6x fewer CPUs! 15x Faster! 0.3% of Hadoop time 44
The Cost of the Wrong Abstraction Log-Scale!
Thus far… GraphLab 1 provided excitingscaling performance But… We couldn’t scale up to AltavistaWebgraph 2002 1.4B vertices, 6.7B edges
Natural Graphs [Image from WikiCommons]
Assumptions of Graph-Parallel Abstractions Idealized Structure Natural Graph Large Neighborhoods High degree vertices Power-Law degree distribution Difficult to partition • Smallneighborhoods • Low degree vertices • Vertices have similar degree • Easy to partition
Natural Graphs Power Law Top 1% of vertices is adjacent to 53% of the edges! “Power Law” -Slope = α≈ 2 Altavista Web Graph: 1.4B Vertices, 6.7B Edges
High Degree Vertices are Common Popular Movies “Social” People • Netflix Users Movies Hyper Parameters Common Words B α • LDA Docs θ θ θ θ Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Obama w w w w w w w w w w w w w w w w Words
Problem: High Degree Vertices Limit Parallelism Edge informationtoo large for singlemachine Touches a large fraction of graph (GraphLab 1) Produces many messages (Pregel) Sequential Vertex-Updates Asynchronous consistencyrequires heavy locking (GraphLab 1) Synchronous consistency is prone tostragglers (Pregel)
Problem: High Degree Vertices High Communication for Distributed Updates Datatransmitted across network O(# cut edges) Y • Natural graphs do not have low-cost balanced cuts • [Leskovec et al. 08, Lang 04] • Popular partitioning tools (Metis, Chaco,…) perform poorly • [Abou-Rjeili et al. 06] • Extremely slow and require substantial memory Machine 1 Machine 2