Carlos Guestrin

2 A Distributed Abstraction for Large-Scale Machine Learning Carlos Guestrin Yucheng Low Joseph Gonzalez Aapo Kyrola Danny Bickson Haijie Gu

Needless to Say, We Need Machine Learning for Big Data 750 Million Facebook Users 24 Million Wikipedia Pages 6 Billion Flickr Photos “… data a new class of economic asset, like currency or gold.” 48 Hours a Minute YouTube

How will wedesign and implementparallel learning systems? Big Learning

A Shift Towards Parallelism • ML experts repeatedly solve the same parallel design challenges: • Race conditions, distributed state, communication… • The resulting code is: • difficult to maintain, extend, debug… Graduatestudents GPUs Multicore Clusters Clouds Supercomputers Avoid these problems by using high-level abstractions

Data Parallelism (MapReduce) 6 7 . 5 4 2 . 3 8 4 . 3 1 4 . 9 1 8 . 4 2 1 . 3 3 4 . 3 8 4 . 4 2 5 . 8 CPU 1 1 7 . 5 1 2 . 9 2 4 . 1 CPU 2 CPU 3 CPU 4 Solve a huge number of independentsubproblems

MapReduce for Data-Parallel ML • Excellent for large data-parallel tasks! Data-Parallel Graph-Parallel Is there more to Machine Learning ? MapReduce Feature Extraction Cross Validation Graphical Models Gibbs Sampling Belief Propagation Variational Opt. Semi-Supervised Learning Label Propagation CoEM Computing Sufficient Statistics Collaborative Filtering Tensor Factorization Graph Analysis PageRank Triangle Counting

What is this?

It’s next to this…

The Power of Dependencieswhere the value is!

Label a Face and Propagate grandma

Pairwise similarity not enough… Not similar enough to be sure Who???? grandma

Propagate Similarities & Co-occurrences for Accurate Predictions grandma!!! grandma similarityedges co-occurring facesfurther evidence

Collaborative Filtering: Independent Case Lord of the Rings Star Wars IV Star Wars I recommend Harry Potter Pirates of the Caribbean

Collaborative Filtering: Exploiting Dependencies Women on the Verge of aNervous Breakdown The Celebration City of God What do I recommend??? recommend Wild Strawberries La Dolce Vita

Latent Topic Modeling (LDA) Cat Apple Growth Hat Plant

Example Topics Discovered from Wikipedia

Machine Learning Pipeline Structured MachineLearningAlgorithm Data Value from Data ExtractFeatures GraphFormation facelabels doctopics movierecommend. similarfaces sharedwords ratedmovies beliefpropagation LDA collaborativefiltering faces important words side info images docs movie ratings

Parallelizing Machine Learning Structured MachineLearningAlgorithm Data Value from Data ExtractFeatures GraphFormation Graph-StructuredComputation graph-parallel Graph Ingress mostly data-parallel

ML Tasks Beyond Data-Parallelism Data-Parallel Graph-Parallel Map Reduce Feature Extraction Cross Validation Graphical Models Gibbs Sampling Belief Propagation Variational Opt. Semi-Supervised Learning Label Propagation CoEM Computing Sufficient Statistics Collaborative Filtering Tensor Factorization Graph Analysis PageRank Triangle Counting

Example of Graph Parallelism

PageRank Depends on rank of who follows them… Depends on rank of who follows her What’s the rank of this user? Rank? Loops in graph  Must iterate!

PageRank Iteration • αis the random reset probability • wji is the prob. transitioning (similarity) from j to i R[j] Iterate until convergence: “My rank is weighted average of my friends’ ranks” wji R[i]

Properties of Graph Parallel Algorithms Dependency Graph LocalUpdates Iterative Computation My Rank Friends Rank

Addressing Graph-Parallel ML Data-Parallel Graph-Parallel Map Reduce Map Reduce? Graph-Parallel Abstraction Feature Extraction Cross Validation Graphical Models Gibbs Sampling Belief Propagation Variational Opt. Semi-Supervised Learning Label Propagation CoEM Computing Sufficient Statistics Collaborative Filtering Tensor Factorization Data-Mining PageRank Triangle Counting

Graph Computation: Synchronous v. Asynchronous

Bulk Synchronous Parallel Model: Pregel (Giraph) [Valiant ‘90] Compute Communicate Barrier

Bulk synchronous parallel model provably inefficient for some ML tasks

Analyzing Belief Propagation [Gonzalez, Low, G. ‘09] focus here A B Priority Queue Smart Scheduling importantinfluence Asynchronous Parallel Model (rather than BSP) fundamental for efficiency

Asynchronous Belief Propagation Challenge = Boundaries Many Updates Synthetic Noisy Image Few Updates Cumulative Vertex Updates Algorithm identifies and focuses on hidden sequential structure Graphical Model

BSP ML Problem: Synchronous Algorithms can be Inefficient Theorem: Bulk Synchronous BP O(#vertices) slower than Asynchronous BP Bulk Synchronous (e.g., Pregel) Asynchronous Splash BP Efficient parallel implementation was painful, painful, painful…

The Need for a New Abstraction • Need: Asynchronous, Dynamic Parallel Computations Data-Parallel Graph-Parallel Map Reduce BSP, e.g., Pregel Feature Extraction Cross Validation Graphical Models Gibbs Sampling Belief Propagation Variational Opt. Semi-Supervised Learning Label Propagation CoEM Computing Sufficient Statistics Collaborative Filtering Tensor Factorization Data-Mining PageRank Triangle Counting

The GraphLabGoals • Designed specifically for ML • Graph dependencies • Iterative • Asynchronous • Dynamic • Simplifies design of parallel programs: • Abstract away hardware issues • Automatic data synchronization • Addresses multiple hardware architectures Efficient parallelpredictions Know how to solve ML problem on1 machine

Data Graph Data associated with vertices and edges • Graph: • Social Network • Vertex Data: • User profile text • Current interests estimates • Edge Data: • Similarity weights

Update Functions User-defined program: applied to vertex transforms data in scopeof vertex pagerank(i, scope){ // Get Neighborhood data (R[i], wij, R[j]) scope; // Update the vertex data // Reschedule Neighbors if needed if R[i] changes then reschedule_neighbors_of(i); } Update function applied (asynchronously) in parallel until convergence Many schedulers available to prioritize computation Dynamic computation

Ensuring Race-Free Code How much can computation overlap?

Need for Consistency?

Consistency in Collaborative Filtering Inconsistent updates Consistent updates GraphLab guarantees consistent updates User-tunable consistency levels trades off parallelism & consistency Netflix data, 8 cores

The GraphLab Framework Scheduler Graph Based Data Representation Update Functions User Computation Consistency Model

Alternating Least Squares SVD Splash Sampler CoEM Bayesian Tensor Factorization Lasso Belief Propagation PageRank LDA SVM Gibbs Sampling Dynamic Block Gibbs Sampling K-Means Matrix Factorization …Many others… Linear Solvers

Never Ending Learner Project (CoEM) Optimal GraphLabCoEM Better 6x fewer CPUs! 15x Faster! 0.3% of Hadoop time 44

The Cost of the Wrong Abstraction Log-Scale!

Thus far… GraphLab 1 provided excitingscaling performance But… We couldn’t scale up to AltavistaWebgraph 2002 1.4B vertices, 6.7B edges

Natural Graphs [Image from WikiCommons]

Assumptions of Graph-Parallel Abstractions Idealized Structure Natural Graph Large Neighborhoods High degree vertices Power-Law degree distribution Difficult to partition • Smallneighborhoods • Low degree vertices • Vertices have similar degree • Easy to partition

Natural Graphs  Power Law Top 1% of vertices is adjacent to 53% of the edges! “Power Law” -Slope = α≈ 2 Altavista Web Graph: 1.4B Vertices, 6.7B Edges

High Degree Vertices are Common Popular Movies “Social” People • Netflix Users Movies Hyper Parameters Common Words B α • LDA Docs θ θ θ θ Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Obama w w w w w w w w w w w w w w w w Words

Problem: High Degree Vertices Limit Parallelism Edge informationtoo large for singlemachine Touches a large fraction of graph (GraphLab 1) Produces many messages (Pregel) Sequential Vertex-Updates Asynchronous consistencyrequires heavy locking (GraphLab 1) Synchronous consistency is prone tostragglers (Pregel)

Problem: High Degree Vertices High Communication for Distributed Updates Datatransmitted across network O(# cut edges) Y • Natural graphs do not have low-cost balanced cuts • [Leskovec et al. 08, Lang 04] • Popular partitioning tools (Metis, Chaco,…) perform poorly • [Abou-Rjeili et al. 06] • Extremely slow and require substantial memory Machine 1 Machine 2

Carlos Guestrin

Carlos Guestrin

Presentation Transcript

Max-Margin Markov Networks by Ben Taskar, Carlos Guestrin, and Daphne Koller

Jonathan Huang Carlos Guestrin Carnegie Mellon University ICML 2010 Haifa, Israel

Don Carlos

Carlos Guestrin

SELECT lab meeting: Jonathan Huang ( jch1@cs.cmu ) Advisor: Carlos Guestrin 4/25/2006

Presented by: Jonathan Huang (jch1@cs.cmu) Advisor: Carlos Guestrin 1/24/2006

PAC Learning VC Dimension (Most slides by courtesy of Prof . Carlos Guestrin )

Jonathan Huang (jch1@cs.cmu) Advisor: Carlos Guestrin 11/15/2005

CARLOS MORENO

Carlos Romero

Juan Carlos

Carlos Acutis