Running Large Graph Algorithms – Evaluation of Current State-of-the-Art

Running Large Graph Algorithms – Evaluation of Current State-of-the-Art Andy Yoo Lawrence Livermore National Laboratory – Google Tech Talk Feb 2010. Summarized by Todd Hoff http://highscalability.com March 2010 Andy Yoo and Ian Kaplan. Evaluating use of data flow systems for large graph analysis. In Proceedings of the 2nd Workshop on Many-Task Computing on Grids and Supercomputers, pg. 1-9, New York, NY, 2009. ACM Presented by K. Sheldon April 2011

Related Papers: Andy Yoo and Ian Kaplan. Evaluating use of data flow systems for large graph analysis. In Proceedings of the 2nd Workshop on Many-Task Computing on Grids and Supercomputers, pg. 1-9, New York, NY, 2009. ACM. Edmond Chow, Keith Henderson, Andy Yoo. Distributed Breadth-First Search with 2-D Partitioning, Lawrence Livermore National Laboratory, LLNL Technical report UCRL-CONF-210829 . Timothy D. R. Hartley, UmitCatalyurek, FusunOzguner, Andy Yoo, Scott Kohn, and Keith Henderson "MSSG: A Framework for Massive Scale Semantic Graphs,'' Cluster 2006, 2006. Andy Yoo, Edmond Chow, Keith Henderson, William McLendon, Bruce Hendrickson, UmitCatalyurek, "A Scalable Distributed Parallel Breadth-First Search Algorithm on BlueGene/L,'' Proc. SC2005, 2005 Presented by K. Sheldon April 2011

Abstract The talk gives an overview of findings from experiments at Lawrence Livermore Labs (Center for Applied Scientific Computing) on the performance and scalablity of graph algorithms on various computing platforms. Challenges: • Large data size requires out-of-core approaches • Graphs with 109 nodes and edges are increasingly common • Intermediate results increases exponentially Presented by K. Sheldon April 2011

Overview Strategy: Pick a likely technology, make the example graph search run on the platform, and collect statistics. • Distributed-memory Parallel architectures (IBM BLueGene/L) • Shared-memory Multi-threading Architecture (Sun UltraSPARC T2 Niagra) • Relational Database (Netezza) • Custom Software (MSSG program) • Cloud Computing (Hadoop Map/Reduce) • Dataflow System (Data Analytic Supercomputer (DAS)) Presented by K. Sheldon April2011

Distributed-memory parallel architectures (IBM BLueGene/L) • Massively parallel: 130,000 low power processors. 32TB total memory. • MPI message-passing between nodes. Fast interconnects. • All-to-all communication pattern. As the message size increases (graph gets larger) this communication becomes the impediment. To alleviate this problem: Remove redundant vertecies Minimize communication Optimize memory management Reduce communication time Reduce message volume • Conclusions: • Difficult to program with message passing and scaling the systems. • Communication with all the processors is the slow point. Optimizing is expensive. • Existing tools break on large algorithms so not that useful a platform. • Scale-free graphs may not scale due to the “hubs.” Presented by K. Sheldon April 2011

Shared-memory Multi-threading Architecture (Sun UltraSPARC T2 Niagra) • Server-on-a-chip design, power efficient, throughput oriented design, 128GB RAM, PCI-Express, 8 cores, 8 threads per core, chip multithreading. • Goal to have threads running all the time, thread context switching is in hardware. • Conclusions: • Shared state between threads killed performance because of lock contention. • Easier to program than the message passing model. • Memory, number hardware threads, and lock contention were impediments. • Asynchronous lock-free algorithms need to be developed. . Presented by K. Sheldon April 2011

Relational Database (Netezza) • Relational database available and easy to use. • Netezza moves the computation to where the data is stored. A FGPA is used as an accelerator. • RDB can store a lot of data. 300 billion edge graph with 11 billion nodes for a total of 13.3TB storage on a 673 node NPS configuration 80% of the queries returned in 5 minutes using a bidirectional search algorithm. • Conclusions: • Graph algorithms require a lot of joins. Join performance was not that good so the performance was not that good. • Easy to program and relatively inexpensive. • Could not impact optimization because the SQL compiler is hidden and changes between compilers. • Initialization of data may take hours for large graphs. Presented by K. Sheldon April 2011

Custom Software (MSSG program) • Stream graph clustering engine built at LLNL to solve the graph search problem. It can run on any cluster. Data is stored on disk. • Lessons learned: • Well designed custom software scale to billion edge graphs performs well relatively inexpensive compared to other options. Cost : long software development time lack of generality across other graph algorithm domains. Again the communication and disk writes are impediments. Presented by K. Sheldon April 2011

Cloud Computing (Hadoop Map/Reduce) • Map/Reduce Model: • Many map and reduce tasks working independently. • Data between mapping and reducing is via intermediate files • Full PubMed graph BFS 30 million vertices, 500 million edges • Cost effective. • Performed better that Netezza, but was slow. • Model limited to parallel applications, not ideal for complex graph algorithms. • Poor handling of intermediate results affects performance. It’s all stored to disk and read back again. • Map/reduce will have to evolve into a dataflow model to increase data parallelism. Presented by K. Sheldon April 2011

Dataflow System (Data Analytic Supercomputer (DAS)) • DAS is a parallel dataflow engine. It is expensive. It has highly optimized software. • More flexible and more complex than Map/Reduce. Data is operated on in parallel and data is independent. • Tasks are triggered by the availability of data. No flow of control. • Enables asynchronous data parallelism. Streaming data for handling large intermediate results, pipelines data flow, flexible user optimization. • Software is very expensive, but hardware is off the shelf. • Memory for in-core processing is the performance impediment. • To handle graph algorithms there needs to be some sort of flow control for the data. Tell it to stop here or go there. This was very limiting. • Search performance was 5 times better MSSG due to data parallelism and a highly optimized library Presented by K. Sheldon April 2011

Conclusions • No real winner for large graphs. • For real-time response the BlueGene type systems are required. • Hadoop won on cost. • MSSG and dataflow won on performance. • Andy Yoo thinks none of the systems will scale to very large graphs. • The future is the dataflow model because it supports the asynchronous data parallel model. • Graph algorithms require a lot of global states and intermediate data. These are the limiting feature. New kinds of algorithms are necessary. Talk available at: http://www.youtube.com/watch?v=PBLgUBGWcz8 Presented by K. Sheldon April 2011

Running Large Graph Algorithms – Evaluation of Current State-of-the-Art