990 likes | 1.24k Views
Introduction to Large-Scale Graph Computation. + GraphLab and GraphChi Aapo Kyrola , akyrola@cs.cmu.edu Feb 27, 2013. Acknowledgments. Many slides (the pretty ones) are from Joey Gonzalez’ lecture (2012) Many people involved in the research:. Haijie Gu. Danny Bickson. Arthur
E N D
Introduction to Large-Scale Graph Computation + GraphLab and GraphChi Aapo Kyrola, akyrola@cs.cmu.edu Feb 27, 2013
Acknowledgments • Many slides (the pretty ones) are from Joey Gonzalez’ lecture (2012) • Many people involved in the research: Haijie Gu Danny Bickson Arthur Gretton Yucheng Low Joey Gonzalez Carlos Guestrin Guy Blelloch Joe Hellerstein David O’Hallaron Alex Smola
Contents • Introduction to Big Graphs • Properties of Real-world Graphs • Why Map-Reduce not good for big graphs specialized systems • Vertex-Centric Programming Model • GraphLab -- distributed computation • GraphChi -- disk-based
Basic vocabulary • Graph (network) • Vertex (node) • Edge (link), in-edge, out-edge • Sparse graph / matrix A e B Terms: e is an out-edge of A, and in-edge of B.
What is a“Big” Graph? • Definition changes rapidly: • GraphLab paper 2009: biggest graph 200M edges • Graphlab & GraphChi papers 2012: biggest graph 6.7B edges • GraphChi @ Twitter: many times bigger. • Depends on the computation as well • matrix factorization (collaborative filtering) or Belief Propagation much more expensive than PageRank
What is “Big” Graph Big Graphs are always extremely sparse. • Biggest graphs available to researchers • Altavista: 6.7B edges, 1.4B vertices • Twitter 2010: 1.5B edges, 68M vertices • Common Crawl (2012): 5 billion web pages • But the industry has even bigger ones: • Facebook (Oct 2012): 144B friendships, 1B users • Twitter (2011): 15B follower-edges • When reading about graph processing systems, be critical of the problem sizes – are they really big? • Shun, Blelloch (2013, PPoPP): use single machine (256 gb RAM) for in-memory computation on same graphs as the GraphLab/GraphChi papers.
Examples of Big Graphs • Twitter – what kind of graphs? • follow-graph engagement graph list-members graph topic-authority (consumers -> producers)
Example of Big Graphs • Facebook: extended social graph • FB friend-graph: differences to Twitter’s graph? Slide from Facebook Engineering’s presentation
Other Big Networks • WWW • Academic Citations • Internet traffic • Phone calls
What can we compute from social networks / web graphs? • Influence ranking • PageRank, TunkRank, SALSA, HITS • Analysis • triangle counting (clustering coefficient), community detection, information propagation, graph radii, ... • Recommendations • who-to-follow, who-to-follow for topic T • similarities • Search enhancements • Facebook’s Graph Search • But actually: it is a hard question by itself!
Sparse Matrices How to represent sparse matrices as graphs? • User x Item/Product matrices • explicit feedback (ratings) • implicit feedback (seen or not seen) • typically very sparse
Product – Item bipartite graph Women on the Verge of aNervous Breakdown 4 3 The Celebration City of God 2 Wild Strawberries 5 La Dolce Vita
What can we compute from user-item graphs? • Collaborative filtering (recommendations) • Recommend products that users with similar tests have recommended. • Similarity / distance metrics • Matrix factorization • Random walk based methods • Lots of algorithms available. See Danny Bickson’s CF toolkit for GraphChi: • http://bickson.blogspot.com/2012/08/collaborative-filtering-with-graphchi.html
Probabilistic Graphical Models • Each vertex represents a random variable • Edges between vertices represent dependencies • modelled with conditional probabilities • Bayes networks • Markov Random Fields • Conditional Random Fields • Goal: given evidence (observed variables), compute likelihood of the unobserved variables • Exact inference generally intractable • Need to use approximations.
Cooking Cameras Shopper 2 Shopper 1
Image Denoising Synthetic Noisy Image Few Updates Graphical Model
Still more examples • CompBio • Protein-Protein interaction network • Activator/deactivator gene network • DNA assembly graph • Text modelling • word-document graphs • Knowledge bases • NELL project at CMU • Planar graphs • Road network • Implicit Graphs • k-NN graphs
Resources • Stanford SNAP datasets: • http://snap.stanford.edu/data/index.html • ClueWeb (CMU): • http://lemurproject.org/clueweb09/ • Univ. of Milan’s repository: • http://law.di.unimi.it/datasets.php
Twitter network visualization, by Akshay Java, 2009 properties of real world graphs
Natural Graphs [Image from WikiCommons]
Natural Graphs • Grids and other Planar Graphs are “easy” • Easy to find separators • The fundamental properties of natural graphs make them computationally challenging
Power-Law • Degree of a vertex = number of adjacent edges • in-degree and out-degree
Power-Law = Scale-free • Fraction of vertices having k neighbors: • P(k) ~ k-alpha • Generative models: • rich-get-richer (preferential attachment) • copy-model • Kronecker graphs (Leskovec, Faloutsos, et al.) • Other phenomena with power-law characteristics?
Natural Graphs Power Law Top 1% of vertices is adjacent to 53% of the edges! “Power Law” -Slope = α≈ 2 Altavista Web Graph: 1.4B Vertices, 6.7B Edges
Properties of Natural Graphs Great talk by M. Mahoney :“Extracting insight from large networks: implications of small-scale and large-scale structure” • Small diameter • expected distance between two nodes in Facebook: 4.74 (2011) • Nice local structure, but no global structure from Michael Mahoney’s (Stanford) presentation
Graph Compression • Local structure helps compression: • Blelloch et. al. (2003): compress web graph to 3-4 bits / link • WebGraph framework from Univ of Milano • social graphs ~ 10 bits / edge (2009) • Basic idea: • order the vertices so that topologically close vertices have ids close to each other • difference encoding
Computational Challenge • Natural Graphs are very hard to partition • Hard to distribute computation to many nodes in balanced way, so that the number of edges crossing partitions is minimized • Why? Think about stars. • Graph partitioning algorithms: • METIS • Spectral clustering • Not feasible on very large graphs! • Vertex-cuts better than edge cuts (talk about this later with GraphLab)
Why MapReduce is not enough large-scale graph computation systems
Parallel Graph Computation • Distributed computation and/or multicore parallelism • Sometimes confusing. We will talk mostly about distributed computation. • Are classic graph algorithms parallelizable? What about distributed? • Depth-first search? • Breadth-first search? • Priority-queue based traversals (Djikstra’s, Prim’s algorithms)
MapReduce for Graphs • Graph computation almost always iterative • MapReduce ends up shipping the whole graph on each iteration over the network (map->reduce->map->reduce->...) • Mappers and reducers are stateless
Iterative Computation is Difficult • System is not optimized for iteration: Iterations Data Data Data Data CPU 1 CPU 1 CPU 1 Data Data Data Data Data Data Data Data CPU 2 CPU 2 CPU 2 Data Data Data StartupPenalty Disk Penalty Startup Penalty Disk Penalty Startup Penalty Disk Penalty Data Data Data Data Data CPU 3 CPU 3 CPU 3 Data Data Data Data Data Data Data Data
MapReduce and Partitioning • Map-Reduce splits the keys randomly between mappers/reducers • But on natural graphs, high-degree vertices (keys) may have million-times more edges than the average • Extremely uneven distribution • Time of iteration = time of slowest job.
Curse of the Slow Job Iterations Data Data Data Data CPU 1 CPU 1 CPU 1 Data Data Data Data Data Data Data Data CPU 2 CPU 2 CPU 2 Data Data Data Data Data Data Data Data CPU 3 CPU 3 CPU 3 Data Data Data Data Data Data Data Data Barrier Barrier Barrier http://www.www2011india.com/proceeding/proceedings/p607.pdf
Map-Reduce is Bulk-Synchronous Parallel • Bulk-Synchronous Parallel = BSP (Valiant, 80s) • Each iteration sees only the values of previous iteration. • In linear systems literature: Jacobi iterations • Pros: • Simple to program • Maximum parallelism • Simple fault-tolerance • Cons: • Slower convergence • Iteration time = time taken by the slowest node
Asynchronous Computation • Alternative to BSP • Linear systems: Gauss-Seidel iterations • When computing value for item X, can observe the most recently computed values of neighbors • Often relaxed: can see most recent values available on a certain node • Consistency issues: • Prevent parallel threads from over-writing or corrupting values (race conditions)
MapReduce’s (Hadoop’s) poor performance on huge graphs has motivated the development of special graph-computation systems
Specialized Graph Computation Systems (Distributed) • Common to all: Graph partitions resident in memory on the computation nodes • Avoid shipping the graph over and over • Pregel (Google, 2010): • “Think like a vertex” • Messaging model • BSP • Open source: Giraph, Hama, Stanford GPS,.. • GraphLab (2010, 2012) [CMU] • Asynchronous (also BSP) • Version 2.1 (“PowerGraph”) uses vertex-partitioning extremely good performance on natural graphs + Others • But do you need a distributed framework?
“Think like a vertex” vertex-centric programming
Vertex-Centric Programming • “Think like a Vertex” (Google, 2010) • Historically, similar idea used before in systolic-computation, data-flow systemsthe Connection Machine and others. • Basic idea: each vertex computes individually its value [in parallel] • Program state = vertex (and edge) values • Pregel: vertices send messages to each other • GraphLab/Chi: vertex reads its neighbors’ and edge values, modifies edge values (can be used to simulate messaging) • Iterative • Fixed-point computations are typical: iterate until the state does not change (much).
Computational Model (GraphLab and GraphChi) • Graph G = (V, E) • directed edges: e = (source, destination) • each edge and vertex associated with a value (user-defined type) • vertex and edge values can be modified • (GraphChi: structure modification also supported) A e B Data Data Data Data Data Data Terms: e is an out-edge of A, and in-edge of B. Data Data Data Data GraphChi – Aapo Kyrola
Vertex Update Function Data Data Data Data Data Data Data Data MyFunc(vertex) { // modify neighborhood } Data Data Data Data Data Data Data
Parallel Computation • Bulk-Synchronous: All vertices update in parallel (note: need 2x memory – why?) • Asynchronous: • Basic idea: if two vertices are not connected, can update them in parallel • Two-hop connections • GraphLab supports different consistency models allowing user to specify the level of “protection” = locking • Efficient locking is complicated on distributed computation (hidden from user) – why?
Scheduling • Often, some parts of the graph require more iterations to converge than others: • Remember power-law structure • Wasteful to update all vertices equal number of times.
The Scheduler The scheduler determines the order that vertices are updated b d a c CPU 1 c b e f g Scheduler e f b a i k h j i h i j CPU 2 The process repeatsuntil the scheduler is empty
Types of Schedulers (GraphLab) • Round-robin • Selective scheduling (skipping): • round robin but jump over un-scheduled vertice • FIFO • Priority scheduling • Approximations used in distributed computation (each node has its own priority queue) • Rarely used in practice (why?)
Example: Pagerank • Express Pagerank in words in the vertex-centric model
Example: Connected Components 1 2 5 3 7 4 6 First iteration: Each vertex chooses label = its id.
Example: Connected Components 1 1 5 1 5 2 6 Update: my vertex id = minimum of neighbors id.
Example: Connected Components How many iterations needed for convergence? (In synchronous model) 1 1 5 1 5 1 5 What about asynchronous model? Component id = leader id (smallest id in the component)