GraphChi: Big Data – small machine

GraphChi:Big Data – small machine Aapo Kyrölä Ph.D. candidate @ CMU http://www.cs.cmu.edu/~akyrola Twitter: @kyrpov What is it good for – and what’s new?

GraphChi can compute on the full Twitter follow-graph with just a standard laptop. ~ as fast as a very large Hadoop cluster! (size of the graph Fall 2013, > 20B edges [Gupta et al 2013])

What is GraphChi • 2 Both in OSDI’12!

Details: Kyrola, Blelloch, Guestrin: “Large-scale graph computation on just a PC” (OSDI 2012) Parallel Sliding Windows or Only P large reads for each interval (sub-graph). P2 reads on one full pass.

Why GraphChi

See the paper for more comparisons. Performance Comparison PageRank WebGraphBelief Propagation (U Kang et al.) • On a Mac Mini: • GraphChi can solve as big problems as existing large-scale systems. • Comparable performance. Matrix Factorization (Alt. Least Sqr.) Triangle Counting Notes: comparison results do not include time to transfer the data to cluster, preprocessing, or the time to load the graph from disk. GraphChi computes asynchronously, while all but GraphLab synchronously.

Scalability / Input Size [SSD] • Throughput: number of edges processed / second. Conclusion: the throughput remains roughly constant when graph size is increased Performance  No worries of running out of memory, or buying more machines when your data grows. Graph size 

GraphChi^2 Distributed Graph System Single-computer system (capable of big tasks) Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 6 machines (Significantly) less than 2x throughput with 2x machines Task 1 Exactly 2x throughput with 2x machines Task 2 Task 3 Task 4 Task 5 Task 6 Task 10 Task 11 Task 12 12 machines Time Time T T

Applications for GraphChi Graph Mining • Connected components • Approx. shortest paths • Triangle counting • Community Detection SpMV • PageRank • Generic Recommendations • Random walks Collaborative Filtering (by Danny Bickson) • ALS • SGD • Sparse-ALS • SVD, SVD++ • Item-CF + many more Probabilistic Graphical Models • Belief Propagation

Easy to Get Started • Java and C++ versions available • No installation, just run • Any machine, SSD or HD • http://graphchi.org • http://code.google.com/p/graphchi • http://code.google.com/p/graphchi-java

What’s New

Extensions Block 1 Block 1 • 1. Dynamic Edge and Vertex Values • Divide shards into small (4 mb) blocks that can be resized separately. Block 2 Block 2 Block 3 Block 3 Block N Shard(j) 2. Integration with Hadoop / Pig • 3. Fast neighborhood queries over shards • Sparse indices 4. DrunkardMob: Random Walks (next…)

Random Walk Simulations • Personalized PageRank • Problem: using the power method would require O(V2) of memory to compute for all vertices. • Can be approximated by simulating random walks and computing the sample distribution. • Other applications: • Recommender systems: FolkRank (Hotho 2006), finding candidates • Knowledge-base inference (Lao, Cohen 2009)

Random walk in an in-memory graph • Compute one walk a time (multiple in parallel, of course): parfor walk in walks: for i=1 to numsteps: vertex = walk.atVertex() walk.takeStep(vertex.randomNeighbor()) Extremely slow in GraphChi / PSW ! Each hop might require loading of a new interval.

Random walks in GraphChi • DrunkardMob –algorithm • Reverse thinking parforvertex in graph: mywalks = walkManager.getWalksAtVertex(vertex.id) foreach walk in mywalks: walkManager.addHop(walk, vertex.randomNeighbor()) Need to encode only current vertex and source vertex for each walk: • 4-byte integer sufficient / walk • With 144 GB RAM, could run 15 billion walks simultaneously (on Java) – recommendations for 15 million users

Keeping track of walks GraphChi Walk Distribution Tracker (DrunkardCompanion) Source A top-N visits Source B top-N visits Execution interval Vertex walks table (WalkManager)

Keeping track of walks GraphChi Walk Distribution Tracker (DrunkardCompanion) Source A top-N visits Source A top-N visits Source B top-N visits Source B top-N visits Execution interval Vertex walks table (WalkManager)

Application: Twitter’s Who-to-Follow • Based on WWW’13 paper by Gupta et. al. Step 3: Compute SALSA and pick top scored users as recommendations. Neighborhood queries over shards. Step 1: Compute Circle of Trust (CoT) for each user Step 2: Bipartite graph with CoT + CoT’sfollowees. DrunkardMob

Conclusion • GraphChi can run your favorite graph computation on extremely large graphs – on your laptop • Unique features such as random walk simulations and dynamic graphs • Most popular: Collaborative Filtering toolkit (by Danny Bickson)

Thank you! Aapo Kyrölä Ph.D. candidate @ CMU – soon to graduate! http://www.cs.cmu.edu/~akyrola Twitter: @kyrpov

GraphChi: Big Data – small machine