720 likes | 855 Views
@. The Next Generation of the GraphLab Abstraction. Joseph Gonzalez Joint work with. Yucheng Low. Aapo Kyrola. Danny Bickson. Carlos Guestrin. Guy Blelloch. Joe Hellerstein. David O’Hallaron. Alex Smola. How will we design and implement parallel learning systems?.
E N D
@ The Next Generation of the GraphLab Abstraction. Joseph Gonzalez Joint work with Yucheng Low Aapo Kyrola Danny Bickson Carlos Guestrin Guy Blelloch Joe Hellerstein David O’Hallaron Alex Smola
... a popular answer: Map-Reduce / Hadoop Build learning algorithms on-top of high-level parallel abstractions
Map-Reduce for Data-Parallel ML • Excellent for large data-parallel tasks! Data-Parallel Graph-Parallel Is there more to Machine Learning ? Map Reduce Label Propagation Lasso Feature Extraction Cross Validation Belief Propagation Kernel Methods Computing Sufficient Statistics Tensor Factorization PageRank Neural Networks Deep Belief Networks
Concrete Example Label Propagation
Label Propagation Algorithm • Social Arithmetic: • Recurrence Algorithm: • iterate until convergence • Parallelism: • Compute all Likes[i] in parallel Sue Ann 50% What I list on my profile 40% Sue Ann Likes 10% Carlos Like 80% Cameras 20% Biking 40% + I Like: 60% Cameras, 40% Biking Profile 50% 50% Cameras 50% Biking Me Carlos 30% Cameras 70% Biking 10%
Properties of Graph Parallel Algorithms Dependency Graph Factored Computation Iterative Computation What I Like What My Friends Like
Map-Reduce for Data-Parallel ML • Excellent for large data-parallel tasks! Data-Parallel Graph-Parallel Map Reduce Map Reduce? ? Label Propagation Lasso Feature Extraction Cross Validation Belief Propagation Kernel Methods Computing Sufficient Statistics Tensor Factorization PageRank Neural Networks Deep Belief Networks
MapAbuse: Iterative MapReduce • Only a subset of data needs computation: Iterations Data Data Data Data CPU 1 CPU 1 CPU 1 Data Data Data Data Data Data Data Data CPU 2 CPU 2 CPU 2 Data Data Data Data Data Data Data Data CPU 3 CPU 3 CPU 3 Data Data Data Data Data Data Data Data Barrier Barrier Barrier
MapAbuse: Iterative MapReduce • System is not optimized for iteration: Iterations Data Data Data Data CPU 1 CPU 1 CPU 1 Data Data Data Data Data Data Data Data CPU 2 CPU 2 CPU 2 Data Data Data Startup Penalty Disk Penalty Disk Penalty Startup Penalty Startup Penalty Disk Penalty Data Data Data Data Data CPU 3 CPU 3 CPU 3 Data Data Data Data Data Data Data Data
Map-Reduce for Data-Parallel ML • Excellent for large data-parallel tasks! Data-Parallel Graph-Parallel Map Reduce Pregel (Giraph)? Map Reduce? SVM Lasso Feature Extraction Cross Validation Belief Propagation Kernel Methods Computing Sufficient Statistics Tensor Factorization PageRank Neural Networks Deep Belief Networks
Pregel (Giraph) • Bulk Synchronous Parallel Model: Compute Communicate Barrier
Problem with Bulk Synchronous • Example Algorithm: If Red neighbor then turn Red • Bulk Synchronous Computation : • Evaluate condition on all vertices for every phase 4 Phases each with 9 computations 36 Computations • Asynchronous Computation (Wave-front) : • Evaluate condition only when neighbor changes 4 Phases each with 2 computations 8 Computations Time 0 Time 2 Time 3 Time 4 Time 1
The Need for a New Abstraction • Map-Reduce is not well suited for Graph-Parallelism Data-Parallel Graph-Parallel Map Reduce Pregel (Giraph) Feature Extraction Cross Validation Belief Propagation Kernel Methods SVM Computing Sufficient Statistics Tensor Factorization PageRank Lasso Neural Networks Deep Belief Networks
The GraphLab Framework Scheduler Graph Based Data Representation Update Functions User Computation Consistency Model
Data Graph A graph with arbitrary data (C++ Objects) associated with each vertex and edge. • Graph: • Social Network • Vertex Data: • User profile text • Current interests estimates • Edge Data: • Similarity weights
Update Functions An update function is a user defined program which when applied to a vertex transforms the data in the scopeof the vertex label_prop(i, scope){ // Get Neighborhood data (Likes[i], Wij, Likes[j]) scope; // Update the vertex data // Reschedule Neighbors if needed if Likes[i] changes then reschedule_neighbors_of(i); }
The Scheduler The scheduler determines the order that vertices are updated. b d a c CPU 1 c b e f g Scheduler e f b a i k h j i h i j CPU 2 The process repeats until the scheduler is empty.
The GraphLab Framework Scheduler Graph Based Data Representation Update Functions User Computation Consistency Model
Ensuring Race-Free Code • How much can computation overlap?
GraphLab Ensures Sequential Consistency For each parallel execution, there exists a sequential execution of update functions which produces the same result. time CPU 1 Parallel CPU 2 Single CPU Sequential
Consistency Rules Full Consistency Data Guaranteed sequential consistency for all update functions
Full Consistency Full Consistency
Obtaining More Parallelism Full Consistency Edge Consistency
Edge Consistency Edge Consistency Safe Read CPU 1 CPU 2
Consistency Through R/W Locks • Read/Write locks: • Full Consistency • Edge Consistency Write Write Write Canonical Lock Ordering Read Read Write Read Write
GraphLab for Natural Graphs the Achilles heel
Problem: High Degree Vertices • Graphs with high degree vertices are common: • Power-Law Graphs (Social Networks): • Affects algorithms like label-propagation • Probabilistic Graphical Models: • Hyper-parameters which couple large sets of data • Connectivity structure induced by natural phenomena • High degree vertices kill parallelism: Pull a Large Amount of State Requires Heavy Locking Processed Sequentially
Proposed Solutions • Decomposable Update Functors • Expose greater parallelism by further factoring update functions • Abelian Group Caching (concurrent revisions) • Allows for controllable races through diff operations • Stochastic Scopes • Reduce degree through sampling
Decomposable Update Functors Breaking computation over edges of the graph.
Decomposable Update Functors • Decompose update functions into 3 phases: • Locks are acquired only for region within a scope Relaxed Consistency Gather Apply Scatter Scope Y Y Y Y Apply the accumulated value to center vertex Parallel Sum + + … + Δ Update adjacent edgesand vertices. Y Y Y Y User Defined: User Defined: User Defined: Apply( , Δ) Gather( ) Δ Scatter( ) Y Y Δ1 + Δ2 Δ3
Decomposable Update Functors • Implementing Label Propagation with Factorized Update Functions: • Implemented using update functors state as accumulator: • Uses same (+) operator to merge partial gathers: Scatter(i, scope){ // Get Neighborhood data (Likes[i], Wij,Likes[j]) scope; // Reschedule if changed if Likes[i] changed then reschedule(j); } Gather(i,j, scope) // Get Neighborhood data (Likes[i], Wij,Likes[j]) // Emit accumulator emit Wij x Likes[j] as Δ; Apply(i, scope, Δ){ // Get Neighborhood (Likes[i]) scope; // Update the vertex Likes[i] Δ; } Δ1 + Δ2 Δ3 ( + )( ) F1 F2 F1 F2 Y Y
Decomposable Functors • Fits many algorithms • Loopy Belief Propagation, Label Propagation, PageRank… • Addresses the earlier concerns • Problem:High degree vertices will still get locked frequently and will need to be up-to-date Large State Heavy Locking Sequential Parallel Gather and Scatter Fine Grained Locking Distributed Gather and Scatter
Abelian Group Caching Enabling eventually consistent data races
Abelian Group Caching • Issue: • All earlier methods maintain a sequentially consistent view of data across all processors. • Proposal: Try to split data instead of computation. • How can we split the graph without changing the update function? CPU 4 CPU 3 CPU 1 CPU 2 CPU 2 CPU 3 CPU 4 CPU 1
Insight from WSDM paper • Answer: Allow Eventually Consistent data races • High degree vertices admit slightly “stale” values: • Changes in a few elements negligible effect • High degreevertex updates typically a form of “sum” operation which has an “inverse” • Example: Counts, Averages, Sufficient statistics • Counter Example: Max • Goal:Lazily synchronize duplicate data • Similar to a version control system • Intermediate values partially consistent • Final value at termination must be consistent
Example • Every processor initial has a copy of the same central value: Master 10 10 10 10 Current Current Current Processor 1 Processor 2 Processor 3
Example • Each processor makes a small change to its value: Master 10 11 10 7 10 13 10 Current Old Current Old Current Old Processor 1 Processor 2 Processor 3 True Value: 10+ 1 - 3 + 3 = 11
Example • Send delta values (Diffs) to the master: Master 10 1 -3 11 10 7 10 13 10 Current Old Current Old Current Old Processor 1 Processor 2 Processor 3 True Value: 10+ 1 - 3 + 3 = 11
Example • Send delta values (Diffs) to the master: Master 10 1 -3 11 7 13 10 Current Current Current Old Processor 1 Processor 2 Processor 3 True Value: 10+ 1 - 3 + 3 = 11
Example • Send delta values (Diffs) to the master: Master 8 1 -3 11 7 13 10 Current Current Current Old Processor 1 Processor 2 Processor 3 True Value: 10+ 1 - 3 + 3 = 11
Example • Master is consistent with first two processors changes Master 8 11 7 13 10 Current Current Current Old Processor 1 Processor 2 Processor 3 True Value: 10+ 1 - 3 + 3 = 11
Example • Master decides to refresh other processors Master 8 8 8 8 8 11 7 13 10 Current Current Current Old Processor 1 Processor 2 Processor 3 True Value: 10+ 1 - 3 + 3 = 11
Example • Master decides to refresh other processors Master 8 8 8 8 8 13 10 Current Current Current Old Processor 1 Processor 2 Processor 3 True Value: 10+ 1 - 3 + 3 = 11
Example • Master decides to refresh other processors Master 8 8 8 +3 8 8 13 10 Current Current Current Old Processor 1 Processor 2 Processor 3 True Value: 10+ 1 - 3 + 3 = 11
Example • Master decides to refresh other processors Master 8 8 8 +3 8 8 8+3 8 Current Current Current Old Processor 1 Processor 2 Processor 3 True Value: 10+ 1 - 3 + 3 = 11
Example • Master decides to refresh other processors Master 8 8 8 8 11 8 Current Current Current Old Processor 1 Processor 2 Processor 3 True Value: 10+ 1 - 3 + 3 = 11
Example • Processor 3 decides to update the master Master 8 8 3 8 8 11 8 Current Current Current Old Processor 1 Processor 2 Processor 3 True Value: 10+ 1 - 3 + 3 = 11