1 / 89

Carlos Guestrin

2. A Distributed Abstraction for Large-Scale Machine Learning. Carlos Guestrin. Yucheng Low. Joseph Gonzalez. Aapo Kyrola. Danny Bickson. Haijie Gu. Needless to Say, We Need Machine Learning for Big Data. 750 Million Facebook Users. 24 Million Wikipedia Pages. 6 Billion

Download Presentation

Carlos Guestrin

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 2 A Distributed Abstraction for Large-Scale Machine Learning Carlos Guestrin Yucheng Low Joseph Gonzalez Aapo Kyrola Danny Bickson Haijie Gu

  2. Needless to Say, We Need Machine Learning for Big Data 750 Million Facebook Users 24 Million Wikipedia Pages 6 Billion Flickr Photos “… data a new class of economic asset, like currency or gold.” 48 Hours a Minute YouTube

  3. How will wedesign and implementparallel learning systems? Big Learning

  4. A Shift Towards Parallelism • ML experts repeatedly solve the same parallel design challenges: • Race conditions, distributed state, communication… • The resulting code is: • difficult to maintain, extend, debug… Graduatestudents GPUs Multicore Clusters Clouds Supercomputers Avoid these problems by using high-level abstractions

  5. Data Parallelism (MapReduce) 6 7 . 5 4 2 . 3 8 4 . 3 1 4 . 9 1 8 . 4 2 1 . 3 3 4 . 3 8 4 . 4 2 5 . 8 CPU 1 1 7 . 5 1 2 . 9 2 4 . 1 CPU 2 CPU 3 CPU 4 Solve a huge number of independentsubproblems

  6. MapReduce for Data-Parallel ML • Excellent for large data-parallel tasks! Data-Parallel Graph-Parallel Is there more to Machine Learning ? MapReduce Feature Extraction Cross Validation Graphical Models Gibbs Sampling Belief Propagation Variational Opt. Semi-Supervised Learning Label Propagation CoEM Computing Sufficient Statistics Collaborative Filtering Tensor Factorization Graph Analysis PageRank Triangle Counting

  7. What is this?

  8. It’s next to this…

  9. The Power of Dependencieswhere the value is!

  10. Label a Face and Propagate grandma

  11. Pairwise similarity not enough… Not similar enough to be sure Who???? grandma

  12. Propagate Similarities & Co-occurrences for Accurate Predictions grandma!!! grandma similarityedges co-occurring facesfurther evidence

  13. Collaborative Filtering: Independent Case Lord of the Rings Star Wars IV Star Wars I recommend Harry Potter Pirates of the Caribbean

  14. Collaborative Filtering: Exploiting Dependencies Women on the Verge of aNervous Breakdown The Celebration City of God What do I recommend??? recommend Wild Strawberries La Dolce Vita

  15. Latent Topic Modeling (LDA) Cat Apple Growth Hat Plant

  16. Example Topics Discovered from Wikipedia

  17. Machine Learning Pipeline Structured MachineLearningAlgorithm Data Value from Data ExtractFeatures GraphFormation facelabels doctopics movierecommend. similarfaces sharedwords ratedmovies beliefpropagation LDA collaborativefiltering faces important words side info images docs movie ratings

  18. Parallelizing Machine Learning Structured MachineLearningAlgorithm Data Value from Data ExtractFeatures GraphFormation Graph-StructuredComputation graph-parallel Graph Ingress mostly data-parallel

  19. ML Tasks Beyond Data-Parallelism Data-Parallel Graph-Parallel Map Reduce Feature Extraction Cross Validation Graphical Models Gibbs Sampling Belief Propagation Variational Opt. Semi-Supervised Learning Label Propagation CoEM Computing Sufficient Statistics Collaborative Filtering Tensor Factorization Graph Analysis PageRank Triangle Counting

  20. Example of Graph Parallelism

  21. PageRank Depends on rank of who follows them… Depends on rank of who follows her What’s the rank of this user? Rank? Loops in graph  Must iterate!

  22. PageRank Iteration • αis the random reset probability • wji is the prob. transitioning (similarity) from j to i R[j] Iterate until convergence: “My rank is weighted average of my friends’ ranks” wji R[i]

  23. Properties of Graph Parallel Algorithms Dependency Graph LocalUpdates Iterative Computation My Rank Friends Rank

  24. Addressing Graph-Parallel ML Data-Parallel Graph-Parallel Map Reduce Map Reduce? Graph-Parallel Abstraction Feature Extraction Cross Validation Graphical Models Gibbs Sampling Belief Propagation Variational Opt. Semi-Supervised Learning Label Propagation CoEM Computing Sufficient Statistics Collaborative Filtering Tensor Factorization Data-Mining PageRank Triangle Counting

  25. Graph Computation: Synchronous v. Asynchronous

  26. Bulk Synchronous Parallel Model: Pregel (Giraph) [Valiant ‘90] Compute Communicate Barrier

  27. Bulk synchronous parallel model provably inefficient for some ML tasks

  28. Analyzing Belief Propagation [Gonzalez, Low, G. ‘09] focus here A B Priority Queue Smart Scheduling importantinfluence Asynchronous Parallel Model (rather than BSP) fundamental for efficiency

  29. Asynchronous Belief Propagation Challenge = Boundaries Many Updates Synthetic Noisy Image Few Updates Cumulative Vertex Updates Algorithm identifies and focuses on hidden sequential structure Graphical Model

  30. BSP ML Problem: Synchronous Algorithms can be Inefficient Theorem: Bulk Synchronous BP O(#vertices) slower than Asynchronous BP Bulk Synchronous (e.g., Pregel) Asynchronous Splash BP Efficient parallel implementation was painful, painful, painful…

  31. The Need for a New Abstraction • Need: Asynchronous, Dynamic Parallel Computations Data-Parallel Graph-Parallel Map Reduce BSP, e.g., Pregel Feature Extraction Cross Validation Graphical Models Gibbs Sampling Belief Propagation Variational Opt. Semi-Supervised Learning Label Propagation CoEM Computing Sufficient Statistics Collaborative Filtering Tensor Factorization Data-Mining PageRank Triangle Counting

  32. The GraphLabGoals • Designed specifically for ML • Graph dependencies • Iterative • Asynchronous • Dynamic • Simplifies design of parallel programs: • Abstract away hardware issues • Automatic data synchronization • Addresses multiple hardware architectures Efficient parallelpredictions Know how to solve ML problem on1 machine

  33. 1

  34. Data Graph Data associated with vertices and edges • Graph: • Social Network • Vertex Data: • User profile text • Current interests estimates • Edge Data: • Similarity weights

  35. Update Functions User-defined program: applied to vertex transforms data in scopeof vertex pagerank(i, scope){ // Get Neighborhood data (R[i], wij, R[j]) scope; // Update the vertex data // Reschedule Neighbors if needed if R[i] changes then reschedule_neighbors_of(i); } Update function applied (asynchronously) in parallel until convergence Many schedulers available to prioritize computation Dynamic computation

  36. Ensuring Race-Free Code How much can computation overlap?

  37. Need for Consistency?

  38. Consistency in Collaborative Filtering Inconsistent updates Consistent updates GraphLab guarantees consistent updates User-tunable consistency levels trades off parallelism & consistency Netflix data, 8 cores

  39. The GraphLab Framework Scheduler Graph Based Data Representation Update Functions User Computation Consistency Model

  40. Alternating Least Squares SVD Splash Sampler CoEM Bayesian Tensor Factorization Lasso Belief Propagation PageRank LDA SVM Gibbs Sampling Dynamic Block Gibbs Sampling K-Means Matrix Factorization …Many others… Linear Solvers

  41. Never Ending Learner Project (CoEM) Optimal GraphLabCoEM Better 6x fewer CPUs! 15x Faster! 0.3% of Hadoop time 44

  42. The Cost of the Wrong Abstraction Log-Scale!

  43. Thus far… GraphLab 1 provided excitingscaling performance But… We couldn’t scale up to AltavistaWebgraph 2002 1.4B vertices, 6.7B edges

  44. Natural Graphs [Image from WikiCommons]

  45. Assumptions of Graph-Parallel Abstractions Idealized Structure Natural Graph Large Neighborhoods High degree vertices Power-Law degree distribution Difficult to partition • Smallneighborhoods • Low degree vertices • Vertices have similar degree • Easy to partition

  46. Natural Graphs  Power Law Top 1% of vertices is adjacent to 53% of the edges! “Power Law” -Slope = α≈ 2 Altavista Web Graph: 1.4B Vertices, 6.7B Edges

  47. High Degree Vertices are Common Popular Movies “Social” People • Netflix Users Movies Hyper Parameters Common Words B α • LDA Docs θ θ θ θ Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Obama w w w w w w w w w w w w w w w w Words

  48. Problem: High Degree Vertices Limit Parallelism Edge informationtoo large for singlemachine Touches a large fraction of graph (GraphLab 1) Produces many messages (Pregel) Sequential Vertex-Updates Asynchronous consistencyrequires heavy locking (GraphLab 1) Synchronous consistency is prone tostragglers (Pregel)

  49. Problem: High Degree Vertices High Communication for Distributed Updates Datatransmitted across network O(# cut edges) Y • Natural graphs do not have low-cost balanced cuts • [Leskovec et al. 08, Lang 04] • Popular partitioning tools (Metis, Chaco,…) perform poorly • [Abou-Rjeili et al. 06] • Extremely slow and require substantial memory Machine 1 Machine 2

More Related