370 likes | 575 Views
Multithreaded Algorithms for Graph Coloring. Alex Pothen Purdue University CSCAPES Institute www.cs.purdue.edu/homes/apothen/ Assefaw Gebremedhin , Mahantesh Halappanavar (PNNL), John Feo (PNNL), Umit Catalyurek (Ohio State) CSC’11 Workshop May 2011. References.
E N D
Multithreaded Algorithms for Graph Coloring Alex Pothen Purdue University CSCAPES Institute www.cs.purdue.edu/homes/apothen/ AssefawGebremedhin, MahanteshHalappanavar (PNNL), John Feo (PNNL), UmitCatalyurek (Ohio State) CSC’11 Workshop May 2011
References • Multithreaded algorithms for graph coloring. Catalyurek, Feo, Gebremedhin, Halappanavar and Pothen, 40 pp., Submitted to Parallel Computing. • New multithreaded ordering and coloring algorithms formulticore architectures. Patwary, Gebremedhin, Pothen, 12pp., EuroPar 2011. • Graph coloring for derivative computation and beyond: Algorithms, software and analysis. Gebremedhin, Nguyen, Pothen, and Patwary, 32 pp., Submitted to TOMS. • Distributed Memory Parallel Algorithms for Matching and Coloring. Catalyurek, Dobrian, Gebremedhin, Halappanavar, and Pothen, 10pp., IPDPS Workshop PCO, 2011.
Algorithm Thread Granularity, Synchronization Concurrency Performance Architecture Graph Latency Tolerance
Outline • The many-core and multi-threaded world • Intel Nehalem • Sun Niagara • Cray XMT • A case study on multithreaded graph coloring • An Iterative and Speculative Coloring Algorithm • A Dataflow algorithm • RMAT graphs: ER, G, and B • Experimental results • Conclusions
Architectural Features • B = max back degree • over entire seq. • B+1 colors suffice • to color G. O(|E|)-time implementations possible for all four
Multithreaded: Iterative Greedy Algorithm Forbidden Colors v v V ci
Multi-threaded: Data Flow Algorithm ci Forbidden Colors v v V
RMAT Graphs • R-MAT: Recursive MATrix method • Experiments • RMAT-ER (0.25, 0.25, 0.25, 0.25) • RMAT-G (0.45, 0.15, 0.15, 0.25) • RMAT-B (0.55, 0.15, 0.15, 0.15) • Chakrabarti, D. and Faloutsos, C. 2006. Graph mining: Laws, generators, and algorithms. ACM Comput. Surv. 38, 1.
Nehalem: Strong Scaling (Niagara) RMAT-ER RMAT-G RMAT-B
Cray XMT: Strong and Weak Scaling Iter-G Iter-B DF-G DF-B
Comparing Three Platforms a) ER c) Good e) Bad
No. Colors in Parallel Algorithms b) G a) ER c) B
Computing SL Orderings in Parallel: RMAT-G graphs (Nehalem) SL Ordering Relaxed SL Ordering
Our contributions: Multithreaded Coloring • Massive multithreading • Can tolerate memory latency for graphs/sparse matrices • Dataflow algorithms easier to implement than distributed memory versions • Thread concurrency ameliorates lack of caches, and lower clock speeds • Thread parallelism can be exploited at fine grain if supported by lightweight synchronization • Graph structure critically influences performance • Many-core machines • Developed an iterative algorithm for greedy coloring (distance-1 and -2) and ordering algorithms that port to different machines • Simultaneous multithreading can hide latency (X threads on 1 core vs. 1 thread on X cores) • Decomposition into tasks at a finer grain than distributed-memory version, andrelax synchronization to enhance concurrency • Will form nodes of Peta- and Exa-scale machines, so single node performance studies are needed
Time Figure from Robert Golla, Sun • Memory access times determine performance • By issuing multiple threads, mask memory latency if a ready thread is available when a functional unit becomes free • Interleaved vs. Simultaneous multithreading (IMT or SMT)
Multi-core: Sun Niagara 2 • Two 8-core sockets, • 8 hw threads per core • 1.2 GHz processors linked by 8 x 9 crossbar to L2 cache banks • Simultaneous multithreading • Two threads from a core can be issued in a cycle • Shallow pipeline
Multicore: Intel Nehalem • Two quad-core sockets, 2.5 GHz • Two hyperthreads per core support SMT • Off chip-data latency 106 cycles • Advanced architectural features: • Cache coherence protocol to reduce traffic, loop-stream detection, improved branch prediction, • out-of-order execution
Massive Multithreading: Cray XMT • Latency tolerance via massive multi-threading • Context switch between threads in a single clock cycle • Global address space, hashed to memory banks to reduce hot-spots • No cache or local memory, average latency 600 cycles • Memory request doesn’t stall processor • Other threads work while the request is fulfilled • Light-weight, word-level synchr. (full/empty bits) • Notes: • 500 MHz clock • 128 Hardware thread streams/proc., • Interleaved multithreading
Multithreaded Algorithms for Graph Coloring • We developed two kinds of multithreaded algorithms for graph coloring: • An iterative, coarse-grained method for generic shared-memory architectures • A dataflow algorithm designed for massively multithreaded architectures with hardware support for fine-grain synchronization, such as the Cray XMT • Benchmarked the algorithms on three systems: • Cray XMT, Sun Niagara 2 andIntel Nehalem • Excellent speedup observed on all three platforms
Greedy coloring algorithms • Distance-k, star, and acyclic coloring are NP-hard • Approximating coloring to within O(n1-e) is NP-hard for any e>0 GREEDY(G=(V,E)) Order the vertices in V for i = 1 to |V| do Determine colors forbidden to vi Assign vi the smallest permissible color end-for • A greedy heuristic usually gives a near-optimal solution • The key is to find good orderingsfor coloring, and many have been developed Ref: Gebremedhin, Tarafdar, Manne, Pothen, SIAM J. Sci. Compt. 29:1042--1072, 2007.
Distance-1Coloring, Greedy Alg. a v a v
Many-core greedy coloring • Given a graph, parallelize greedy coloring on many-core machines such that Speedup is attained, and Number of colors is roughly same as in serial • Difficult task since greedy is inherently sequential, computation small relative to communication, and data accesses are irregular • D1 coloring: Approaches based on Luby’s parallel algorithm for maximal independent set had limited success • Gebremedhin and Manne (2000) developed a parallel greedy coloring algorithm on shared memory machines • Uses speculative coloring to enhance concurrency, randomized partitioning to reduce conflicts, and serial conflict resolution • Number of conflicts bounded, so this approach yields an effective algorithm • Extended to distance-2 coloring by G, M and P (2002) • We adapt this approach to implement the greedy algorithm for many-core computing
Parallel Coloring: Speculation w a v w a v
Experimental results Dataflow Iterative Cray XMT: RMAT-G with 224, …, 227 vertices and 134M, …, 1B edges
Experimental results Niagara 2 Iterative Perf. With doubling threads on a core = Doubling cores!
Experimental results RMAT-G with 224 = 16M vertices and 134M edges RMAT-B, 224 vertices,134M edges All Platforms
Iterative Greedy Coloring: Multithreaded Algorithm Adj(v), color(w), forbidden(v): d(v) reads each forbidden(v): d(v) writes Adj(v), color(w): d(v) reads each
Experimental results RMAT-G with 224 = 16M vertices and 134M edges All Platforms
Future Plans: Multithreaded Coloring • Massive multithreading • Microbechmarking to understand where the cycles go: thread management, data accesses, synchronization, instruction scheduling, function unit limitations… • Develop a performance model of the computation • Experiment with other graph classes • Consider new algorithmic paradigms • Many-core machines • Four items as above • Ordering for coloring: Archetype of a problem for computing a sequential ordering in a parallel environment (MostofaPatwary and AssefawGebremedhin) • Extend to nodes of Peta-scale machines, so single node performance is enhanced, and complete our work on the Blue Gene and the Cray XT5
Thanks • Rob Bisseling, Erik Boman,ÜmitÇatalürek, Karen Devine, Florin Dobrian, John Feo, AssefawGebremedhin,MahanteshHalappanavar, Bruce Hendrickson, Paul Hovland, Gary Kumfert, Fredrik Manne, AliPınar, Sivan Toledo, Jean Utke
Further readingwww.cscapes.org • Gebremedhin and Manne, Scalable parallel graph coloring algorithms,Concurrency: Practice and Experience, 12: 1131-1146, 2000. • Gebremedhin, Manne and Pothen, Parallel distance-k coloring algorithms for numerical optimization, Lecture Notes in Computer Science, 2400: 912-921, 2002. • Bozdag, Gebremedhin, Manne, Boman and Catalyurek. A framework for scalable greedy coloring on distributed-memory parallel computers. J. Parallel Distrib. Comput. 68(4):515-535, 2008. • Catalyurek, Feo, Gebremedhin, Halappanavar and Pothen, Multi-threaded algorithms for graph coloring, Preprint, Aug. 2010.