220 likes | 398 Views
Engineering Distributed Graph Algorithms in PGAS languages. Guojing Cong, IBM research Joint work with George Almasi and Vijay Saraswat. Programming language from the perspective of a not-so-distant admirer. Mapping graph algorithms onto distributed memory machines has been a challenge.
E N D
Engineering Distributed Graph Algorithms in PGAS languages Guojing Cong, IBM research Joint work with George Almasi and Vijay Saraswat
Programming language from the perspective of a not-so-distant admirer
Mapping graph algorithms onto distributed memory machines has been a challenge • Efficient mapping PRAM algorithm onto SMPs is hard • Mapping onto a cluster of SMPs is even harder • Optimizations are available and shown to improve performance • Can these be somehow automated with help from the language design, compiler and runtime development? • Expectations of the languages • Expressiveness • SPMD, task parallelism (spawn/async), pipeline, future, virtual shared-memory abstraction, work-stealing, data distribution, … • Ease of programming • Efficiency • Mapping high level constructs to run fast on the target machine • SMP • Multi-core, multi-threaded • MPP • Heterogeneous with accelerators • Leverage for tuning
A case study with connected components on a cluster of SMPs with UPC • A connected component of an undirected graph G=(V,E), |V|=n, |E|=m, is a maximal connected subgraph • Connected components algorithm find all such components in G • Sequential algorithms • Breadth-first traversal (BFS) • Depth-first traversal (DFS) • One parallel algorithm -- Shiloach-Vishkin algorithm (SV82) • Edge list as input • Adopts the graft and shortcut approach • Start with n isolated vertices. • Graft vertex v to a neighbor u with (u < v) • Shortcut the connected components into super-vertices and continue on the reduced graph
Example: SV 4 2 4 2 1,4 2,3 1st iter. 1 1 3 3 Input graph graft shortcut 1 2 1 2 2nd iter.
Simple? Yes, performs poorly Sun enterprise E4500 • Memory-intensive, irregular accesses, poor temporal locality
Typical behavior of graph algorithms • CPI construction • LRU stack distance plot • BC – betweeness centrality • BiCC – Biconnected components • MST – Minimum spanning tree
On distributed-memory machines • Random access and indirection make it hard to • implement, e.g, no fast MPI implementation • Optimize, i.e., random access creates problems for both communication and cache performance • The partitioned global address space (PGAS) paradigm • presents a shared-memory abstraction to the programmer for distributed-memory machines. receives a fair amount of attention recently. • allows the programmer to control the data layout and work assignment • improve ease of programming, and also give the programmer leverage to tune for high performance
Implementation in UPC is straightforward UPC implementation Pthread implementation
Communication efficient algorithms • Proposed to address the “bottleneck of processor-to-processor communication” • Goodrich [96] presented a communication-efficient sorting algorithmon weak-CREWBSP that runs in O(log n/ log(h + 1)) communication rounds and O((n log n)/p) local computation time, for h = Θ(n/p) • Adler et. al. [98] presented a communication-optimal MST algorithm • Dehne et al. [02] designed an efficient list ranking algorithm for coarse-grained multicomputers (CGM) and BSP that takes O(log p) communication rounds with O(n/p) local computation • Common approach • simulating several (e.g., O(log p) or O(log log p) ) steps of the PRAM algorithms to reduce the input size so that it fits in the memory of a single node • A “sequential” algorithm is then invoked to process the reduced input of size O(n/p) • finally the result is broadcast to all processors for computing the final solution • Question • How well do communication efficient algorithms work on practice? • How fast can optimized shared-memory based algorithms run? Cache performace vs. communication performance • Can these optimizations be automated through necessary language/compiler support
Locality-central optimization • Improve locality behavior of the algorithm • The key performance issues are communication and cache performance • Determined by locality • Many prior cache-friendly results, but no tangible practical evidence • Fine-grain parallelism makes it hard to optimize for temporal locality • Focus on spatial locality • To take advantage of large cache lines, hardware prefetching, software prefetching
Scheduling of the memory accesses in a parallel loop Typical loop in CC Generic loop
Mapping to the distributed environments • All remote accesses are consecutive in our scheduling • If the runtime provides remote prefetching or coalescing, then communication efficiency can be improved • If not, coalescing can be easily done at the program level as shown on right
Applying the approach to single-node for cache-friendly design • Apply as many levels of recursions as necessary • Simulate the recursions with virtual threads • Assuming a large-enough, one level, fully associative cache Original execution time Optimized execution time
Graph-specific optimization • Compact edge list • the size of the list determines the number of elements to request from remote nodes • edges within components no longer contribute to the merging of connected components, and can be filtered out • Avoid communication hotspot • Grafting in CC shoots a pointer from a vertex with larger numbering to one with smaller numbering. • Thread thr0 owns vertex 0, and may quickly become a communication hotspot • Avoid querying thr0 about D[0]
UPC specific optimization • Avoid runtime cost on local data • After optimization, all direct access to the shared arrays are local • Yet the compiler is not able to recognize • With UPC, we use private pointer arithmetics for • Avoid intrinsics • It is costly to invoke compiler intrinsics to determine the target thread id • Computing target thread ids is done for every iteration. • we compute these ids directly instead of invoking the intrinsics. • Noticing that the target ids do not change across iteration, we compute them once and store them in a global buffer.
So, how helpful is UPC • Straightforward mapping of shared-memory algorithm is easy • quick prototyping • Quick profiling • Incremental optimization (10 versions for CC) • All other optimizations are manual • Many of them can be automated, though • UPC is not flexible enough to expose the hierarchy of nodes and processors to the programmer
Conclusion and future work • We show that with appropriate optimizations, shared-memory graph algorithms can be mapped to the PGAS environment with high performance. • On inputs that fit in the main memory on one node, our implementation achieves good speedups over the best SMP implementation and the best sequential implementation. • Our results suggest that effective use of processors and caches can bring better performance than simply reducing the communication rounds • Automating these optimizations is our future work