Joseph Gonzalez Postdoc, UC Berkeley AMPLab jegonzal@eecs.berkeley

Joseph Gonzalez Postdoc, UC Berkeley AMPLab jegonzal@eecs.berkeley.edu 2 A System for Distributed Graph-Parallel Machine Learning The Team: Yucheng Low Aapo Kyrola Danny Bickson Haijie Gu Carlos Guestrin Alex Smola Guy Blelloch

Graphs Data Big More Noise More Signal

It’s all about the graphs…

Social Media Web Advertising Science • Graphsencoderelationships between: • Big: billions of vertices and edgesand rich metadata People Products Ideas Facts Interests

Graphs are Essential to Data-Mining and Machine Learning • Identify influential people and information • Find communities • Target ads and products • Model complex data dependencies

Estimate Political Bias ? ? Liberal Conservative ? ? ? ? ? ? ? ? ? Post ? Post Post Post ? Post Post Post Post ? Post Post Post Post ? ? ? Post ? Post ? ? ? Post ? Post Post Post Post Conditional Random Field Belief Propagation Post Post Post ? ? ? ? ? ? ? ?

Triangle Counting • For each vertex in graph, countnumber of triangles containing it • Measures both “popularity” of the vertex and “cohesiveness” of the vertex’s community: Fewer Triangles Weaker Community More Triangles Stronger Community

Collaborative Filtering: Exploiting Dependencies Women on the Verge of aNervous Breakdown The Celebration City of God What do I recommend??? recommend Wild Strawberries La Dolce Vita

PageRank • Everyone starts with equal ranks • Update ranks in parallel • Iterate until convergence Rank of user i Weighted sum of neighbors’ ranks

How should we programgraph-parallel algorithms? Low-level tools like MPI and Pthreads? - Me, during my first years of grad school

Threads, Locks, and MPI Graduatestudents • ML experts repeatedly solve the same parallel design challenges: • Implement and debug complex parallel system • Tune for a single parallel platform • Six months later the conference paper contains: “We implemented ______ in parallel.” • The resulting code: • is difficult to maintain and extend • couples learning modeland implementation

How should we programgraph-parallel algorithms? High-level Abstractions! - Me, now

The Graph-Parallel Pattern • A user-defined Vertex-Programruns on each vertex • Graph constrains interaction along edges • Using messages (e.g. Pregel[PODC’09, SIGMOD’10]) • Through shared state (e.g., GraphLab[UAI’10, VLDB’12, OSDI’12]) • Parallelism: run multiple vertex programs simultaneously “Think like a Vertex.” -Malewicz et al. [SIGMOD’10]

Better for Machine Learning Graph-parallel Abstractions Pregel Messaging Shared State i i Synchronous Dynamic Asynchronous

The GraphLab Vertex Program Vertex Programs directly access adjacent vertices and edges R[4] * w41 GraphLab_PageRank(i) // Compute sum over neighbors total = 0 foreach( j in neighbors(i)): total = total + R[j] * wji // Update the PageRank R[i] = 0.15 + total // Trigger neighbors to run again if R[i] not converged then signal nbrsOf(i) to be recomputed + 4 1 R[2] * w21 R[3] * w31 + 3 2 Signaled vertices are recomputed eventually.

GraphLabAsynchronousExecution The scheduler determines the order that vertices are executed b d a c CPU 1 c b e f g Scheduler e f b a i k h j i h i j CPU 2 Scheduler can prioritize vertices.

Benefit of Dynamic PageRank Better 51% updated only once!

Asynchronous Belief Propagation Challenge = Boundaries Many Updates Synthetic Noisy Image Few Updates Cumulative Vertex Updates Algorithm identifies and focuses on hidden sequential structure Graphical Model

GraphLab Ensures a Serializable Execution • Needed for Gauss-Seidel, Gibbs Sampling, Graph Coloring, … Conflict Edge Conflict Edge

Never Ending Learner Project (CoEM) • Language modeling: named entity recognition 6x fewer CPUs! 15x Faster! 0.3% of Hadoop time

Thus far… GraphLabprovided apowerful new abstraction But… We couldn’t scale up to AltavistaWebgraphfrom 2002 1.4B vertices, 6.6B edges

NaturalGraphsGraphs derived from natural phenomena

Properties of Natural Graphs Regular Mesh Natural Graph Power-Law Degree Distribution

Power-Law Degree Distribution High-Degree Vertices Top 1% of vertices are adjacent to 50% of the edges! -Slope = α≈ 2 Number of Vertices More than 108 vertices have one neighbor. AltaVista WebGraph 1.4B Vertices, 6.6B Edges Degree

Power-Law Degree Distribution “Star Like” Motif President Obama Followers

Challenges of High-Degree Vertices Sequentially processedges Touches a large fraction of graph Provably Difficult to Partition CPU 1 CPU 2

Random Partitioning • GraphLab resorts to random (hashed) partitioning onnatural graphs Machine 1 Machine 2 10 Machines  90% of edges cut 100 Machines  99% of edges cut!

GraphLab2 Program For This • Split High-Degree vertices • New AbstractionEquivalence on Split Vertices Run on This Machine 1 Machine 2

A Common Pattern forVertex-Programs GraphLab_PageRank(i) // Compute sum over neighbors total = 0 foreach( j in neighbors(i)): total = total + R[j] * wji // Update the PageRank R[i] = total // Trigger neighbors to run again priority = |R[i] – oldR[i]| if R[i] not converged then signal neighbors(i) with priority Gather InformationAbout Neighborhood Update Vertex Signal Neighbors & Modify Edge Data

Pattern Abstraction • Gather(SrcV, Edge, DstV)  A • Collect information from a neighbor • Sum(A, A)  A • Commutative associative sum • Apply(V, A)  V • Update the vertex • Scatter(SrcV, Edge, DstV)  (Edge, signal) • Update edges and signal neighbors Gather InformationAbout Neighborhood Update Vertex Signal Neighbors & Modify Edge Data

PageRank in GraphLab2 GraphLab2_PageRank(i) Gather( j  i ) : return wji* R[j] sum(a, b) : return a + b; Apply(i, Σ) : R[i] = 0.15 + Σ Scatter( i j ) :if R[i]changed then trigger jto be recomputed

GAS Decomposition Mirror Mirror Master Mirror Machine 1 Machine 2 Gather Y’ Y’ Y Y Y’ Y’ Y Σ Σ1 Σ2 Y + + + Apply Machine 3 Machine 4 Σ3 Σ4 Scatter

Minimizing Communication in PowerGraph Communication is linear in the number of machines each vertex spans New Theorem:For anyedge-cut we can directly construct a vertex-cut which requires strictly less communication and storage. Y Y Y A vertex-cut minimizes machines each vertex spans • Percolation theory suggests that power law graphs have good vertex cuts. [Albert et al. 2000]

Constructing Vertex-Cuts • Evenly assign edges to machines • Minimize machines spanned by each vertex • Assign each edge as itis loaded • Touch each edge only once • Propose two distributed approaches: • Random Vertex Cut • Greedy Vertex Cut

Random Vertex-Cut • Randomly assign edges to machines Machine 1 Machine 2 Machine 3 Balanced Vertex-Cut Y Spans 3 Machines Y Z Z Spans 2 Machines Y Y Y Z Y Y Y Y Y Z Not cut!

Analysis Random Edge-Placement • Expected number of machines spanned by a vertex: Twitter Follower Graph 41 Million Vertices 1.4 Billion Edges Accurately Estimate Memory and Comm. Overhead

Random Vertex-Cuts vs. Edge-Cuts • Expected improvement from vertex-cuts: Order of Magnitude Improvement

Streaming Greedy Vertex-Cuts • Place edges on machines which already have the vertices in that edge. Machine1 Machine 2 A B B C A B D E

Greedy Vertex-Cuts Improve Performance Greedy partitioning improves computation performance.

System Design • Implemented as C++ API • Uses HDFS for Graph Input and Output • Fault-tolerance is achieved by check-pointing • Snapshottime < 5 seconds for twitter network PowerGraph (GraphLab2) System MPI/TCP-IP PThreads HDFS EC2 HPC Nodes

Implemented Many Algorithms • Collaborative Filtering • Alternating Least Squares • Stochastic Gradient Descent • Statistical Inference • Loopy Belief Propagation • Max-Product Linear Programs • Gibbs Sampling • Graph Analytics • PageRank • Triangle Counting • Shortest Path • Graph Coloring • K-core Decomposition • Computer Vision • Image stitching • Language Modeling • LDA

PageRank on Twitter Follower Graph Natural Graph with 40M Users, 1.4 Billion Links Order of magnitude by exploiting properties of Natural Graphs GraphLab2 Hadoop results from [Kang et al. '11] Twister (in-memory MapReduce) [Ekanayake et al. ‘10]

GraphLab2 is Scalable Yahoo Altavista Web Graph (2002): One of the largest publicly available web graphs 1.4Billion Webpages, 6.6 Billion Links 7 Seconds per Iter. 1B links processed per second 30 lines of user code 64 HPC Nodes 1024 Cores (2048 HT)

Topic Modeling • English language Wikipedia • 2.6M Documents, 8.3M Words, 500M Tokens • Computationally intensive algorithm 100 Yahoo! Machines Specifically engineered for this task 64 cc2.8xlarge EC2 Nodes 200 lines of code & 4 human hours GraphLab2

Triangle Counting on The Twitter Graph Identify individuals with strong communities. Counted: 34.8 Billion Triangles Hadoop[WWW’11] 1536 Machines 423 Minutes 282 x Faster 64 Machines 1.5 Minutes GraphLab2 Why? Wrong Abstraction  Broadcast O(degree2) messages per Vertex S. Suri and S. Vassilvitskii, “Counting triangles and the curse of the last reducer,” WWW’11

Machine Learning and Data-Mining Toolkits Graph Analytics GraphicalModels ComputerVision Clustering TopicModeling CollaborativeFiltering MPI/TCP-IP PThreads HDFS GraphLab2 System EC2 HPC Nodes http://graphlab.org Apache 2 License

GraphChi: Going small with GraphLab Solve huge problems on small or embedded devices? Key: Exploit non-volatile memory (starting with SSDs and HDs)

Triangle Counting in Twitter Graph Total: 34.8 Billion Triangles 40M Users 1.2B Edges 1536 Machines 423 Minutes 59 Minutes, 1 Mac Mini! 64 Machines, 1024 Cores 1.5 Minutes GraphLab2 Hadoop results from [Suri & Vassilvitskii '11]

2 Machine 1 Machine 2 New ways to representreal-world graphs New ways executegraph algorithms Orders of magnitude improvements over existing systems

Ongoing Work

Joseph Gonzalez Postdoc, UC Berkeley AMPLab jegonzal@eecs.berkeley