10 likes | 130 Views
General-Purpose vs. GPU: Comparison of Many-Cores on Irregular Benchmarks. George C. Caragea. Fuat Keceli. Alexandros Tzannes. Uzi Vishkin. XMT: An Easy-to-Program Many-Core. XMT: Motivation and Background. XMT Programming Model. Ease of programming.
E N D
General-Purpose vs. GPU: Comparison of Many-Cores on Irregular Benchmarks George C. Caragea FuatKeceli AlexandrosTzannes Uzi Vishkin XMT: An Easy-to-Program Many-Core XMT: Motivation and Background XMT Programming Model Ease of programming • Many-cores are coming. But 40yrs of parallel computing: • Never a successful general-purpose parallel computer (easy to program, good speedups, up & down scalable). • IF you could program it great speedups. • XMT: Fix the IF • XMT: Designed from the ground up to address that for on-chip parallelism • Tested HW & SW prototypes • Builds on PRAM algorithmics. Only really successful parallel algorithmic theory. Latent, though not widespread, knowledgebase • At each step, provide all instructions that can execute concurrently (not dependent on each other) • PRAM/XMT abstraction: all such instructions execute immediately (“uniform cost”) • PRAM-like programming: using reduced synchrony • Main construct: spawn-join block. Can start any number of virtual-threads at once • Necessary condition for success of a general-purpose platform • In von Neumann’s 1947 specs • Indications that XMT is easy to program: • XMT is based on rich algorithmic theory (PRAM) • Ease-of-teaching as a benchmark: • Successfully taught parallel programming to middle-school, high-school and up • Evaluated by education experts (SIGCSE 2010) • XMT superior to MPI, OpenMP and CUDA • Programmer’s workflow for deriving efficient programs from PRAM algorithms • DARPA HPCS productivity study: XMT development time half of MPI • Virtual-Threads advance at own speed, not lockstep • Prefix-sum (ps): similar to atomic fetch-and-add Paraleap: XMT PRAM-on-chip silicon XMTC Programming Language Arrzz int A[N],B[N] int base=0; spawn(0,N-1) { int inc=1; if (A[$]!=0) { ps(inc,base); B[inc]=A[$]; } } • C with simple SPMD extensions • spawn: start any number of virtual threads • $: unique thread ID • ps/psm: atomic prefix sum. Efficient hardware implementation • XMTC Example: Array Compaction • Non-zero elements of A copied into B • Order is not necessarily preserved • After atomically executing ps(inc,base) • base = base + inc • inc gets original value of base • Elements copied into unique locations in B • Built FPGA prototype • Announced in SPAA’07 • Built using 3 FPGA chips • 2 Virtex-4 LX200, 1 Virtex-4 FX100 Tesla vs. XMT: Comparison of Architectures TESLA XMT Tested Configurations: GTX280 vs. XMT-1024 • Need configurations with equivalent area constraints (576 mm2 in 65nm) • Can not simply set the number of functional units and memory to the same values • Area estimation of the envisioned XMT chip is based on the 64 TCU XMT ASIC prototype (designed in 90nm IBM technology) • More area intensive side is emphasized in each category. Experimental Evaluation Benchmarks Performance Comparison • When using 1024-TCU XMT configuration: • 6.05x average speedup on irregular applications • 2.07x average slowdown on regular applications • When using 512-TCU XMT configuration • 4.57x average speedup on irregular • 3.06x average slowdown on regular • Case study: BFS on low parallelism dataset • Speedup of 73.4x over Rodinia implementation • Speedup of 6.89x over UIUC implementation • Speedup of 110.6x when using only 64 TCUs (lower latencies for the smaller design) Conclusion and Future Work • SPAA’09: 10X over Intel Core Duo with same silicon area • Current work: • XMT outperforms GPU on all irregular workloads • XMT does not fall behind significantly on regular workloads • No need to pay high performance penalty for ease-of-programming • Promising candidate for pervasive platform of the future: • Highly parallel general-purpose CPUcoupled with: • Parallel GPU • Future work: • Power/energy comparison of XMT and GPU Experimental Platform • XMTSim: The cycle-accurate XMT simulator • Timing modeled after the 64-TCU FPGA prototype • Highly configurable to simulate any configuration • Modular design, enables architectural exploration • Part of XMT Software Release: • http://www.umiacs.umd.edu/users/vishkin/XMT/sw-release.html