1 / 21

GraphChi: Big Data – small machine

GraphChi: Big Data – small machine. Aapo Kyrölä Ph.D. candidate @ CMU http://www.cs.cmu.edu/~ akyrola Twitter: @ kyrpov. What is it good for – and what’s new?. GraphChi can compute on the full Twitter follow-graph with just a standard laptop. ~ as fast as a very large Hadoop cluster!

vinaya
Download Presentation

GraphChi: Big Data – small machine

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. GraphChi:Big Data – small machine Aapo Kyrölä Ph.D. candidate @ CMU http://www.cs.cmu.edu/~akyrola Twitter: @kyrpov What is it good for – and what’s new?

  2. GraphChi can compute on the full Twitter follow-graph with just a standard laptop. ~ as fast as a very large Hadoop cluster! (size of the graph Fall 2013, > 20B edges [Gupta et al 2013])

  3. What is GraphChi • 2 Both in OSDI’12!

  4. Details: Kyrola, Blelloch, Guestrin: “Large-scale graph computation on just a PC” (OSDI 2012) Parallel Sliding Windows or Only P large reads for each interval (sub-graph). P2 reads on one full pass.

  5. Why GraphChi

  6. See the paper for more comparisons. Performance Comparison PageRank WebGraphBelief Propagation (U Kang et al.) • On a Mac Mini: • GraphChi can solve as big problems as existing large-scale systems. • Comparable performance. Matrix Factorization (Alt. Least Sqr.) Triangle Counting Notes: comparison results do not include time to transfer the data to cluster, preprocessing, or the time to load the graph from disk. GraphChi computes asynchronously, while all but GraphLab synchronously.

  7. Scalability / Input Size [SSD] • Throughput: number of edges processed / second. Conclusion: the throughput remains roughly constant when graph size is increased Performance  No worries of running out of memory, or buying more machines when your data grows. Graph size 

  8. GraphChi^2 Distributed Graph System Single-computer system (capable of big tasks) Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 6 machines (Significantly) less than 2x throughput with 2x machines Task 1 Exactly 2x throughput with 2x machines Task 2 Task 3 Task 4 Task 5 Task 6 Task 10 Task 11 Task 12 12 machines Time Time T T

  9. Applications for GraphChi Graph Mining • Connected components • Approx. shortest paths • Triangle counting • Community Detection SpMV • PageRank • Generic Recommendations • Random walks Collaborative Filtering (by Danny Bickson) • ALS • SGD • Sparse-ALS • SVD, SVD++ • Item-CF + many more Probabilistic Graphical Models • Belief Propagation

  10. Easy to Get Started • Java and C++ versions available • No installation, just run • Any machine, SSD or HD • http://graphchi.org • http://code.google.com/p/graphchi • http://code.google.com/p/graphchi-java

  11. What’s New

  12. Extensions Block 1 Block 1 • 1. Dynamic Edge and Vertex Values • Divide shards into small (4 mb) blocks that can be resized separately. Block 2 Block 2 Block 3 Block 3 Block N Shard(j) 2. Integration with Hadoop / Pig • 3. Fast neighborhood queries over shards • Sparse indices 4. DrunkardMob: Random Walks (next…)

  13. Random Walk Simulations • Personalized PageRank • Problem: using the power method would require O(V2) of memory to compute for all vertices. • Can be approximated by simulating random walks and computing the sample distribution. • Other applications: • Recommender systems: FolkRank (Hotho 2006), finding candidates • Knowledge-base inference (Lao, Cohen 2009)

  14. Random walk in an in-memory graph • Compute one walk a time (multiple in parallel, of course): parfor walk in walks: for i=1 to numsteps: vertex = walk.atVertex() walk.takeStep(vertex.randomNeighbor()) Extremely slow in GraphChi / PSW ! Each hop might require loading of a new interval.

  15. Random walks in GraphChi • DrunkardMob –algorithm • Reverse thinking parforvertex in graph: mywalks = walkManager.getWalksAtVertex(vertex.id) foreach walk in mywalks: walkManager.addHop(walk, vertex.randomNeighbor()) Need to encode only current vertex and source vertex for each walk: • 4-byte integer sufficient / walk • With 144 GB RAM, could run 15 billion walks simultaneously (on Java) – recommendations for 15 million users

  16. Keeping track of walks GraphChi Walk Distribution Tracker (DrunkardCompanion) Source A top-N visits Source B top-N visits Execution interval Vertex walks table (WalkManager)

  17. Keeping track of walks GraphChi Walk Distribution Tracker (DrunkardCompanion) Source A top-N visits Source A top-N visits Source B top-N visits Source B top-N visits Execution interval Vertex walks table (WalkManager)

  18. Application: Twitter’s Who-to-Follow • Based on WWW’13 paper by Gupta et. al. Step 3: Compute SALSA and pick top scored users as recommendations. Neighborhood queries over shards. Step 1: Compute Circle of Trust (CoT) for each user Step 2: Bipartite graph with CoT + CoT’sfollowees. DrunkardMob

  19. Conclusion • GraphChi can run your favorite graph computation on extremely large graphs – on your laptop • Unique features such as random walk simulations and dynamic graphs • Most popular: Collaborative Filtering toolkit (by Danny Bickson)

  20. Thank you! Aapo Kyrölä Ph.D. candidate @ CMU – soon to graduate! http://www.cs.cmu.edu/~akyrola Twitter: @kyrpov

More Related