140 likes | 227 Views
Performance Potential of an Easy-to-Program PRAM-On-Chip Prototype Versus State-of-the-Art Processor. George C. Caragea – University of Maryland A. Beliz Saybasili – LCB Branch, NHLBL, NIH Xingzhi Wen – NVIDIA Corporation Uzi Vishkin – University of Maryland.
E N D
Performance Potential of an Easy-to-Program PRAM-On-Chip Prototype Versus State-of-the-Art Processor George C. Caragea – University of Maryland A. Beliz Saybasili – LCB Branch, NHLBL, NIH Xingzhi Wen – NVIDIA Corporation Uzi Vishkin – University of Maryland www.umiacs.umd.edu/users/vishkin/XMT
Hardware prototypes of PRAM-On-Chip 64-core, 75MHz FPGA prototype [SPAA’07, Computing Frontiers’08] Original explicit multi-threaded (XMT)architecture [SPAA98] (Cray started to use “XMT” ~7 years later) Objective of current paperMeaningful comparison of • Our FPGA design, with • State-of-the-Art (Intel) Processor Interconnection Network for 128-core. 9mmX5mm, IBM90nm process. 400 MHz prototype [HotInterconnects’07] Same design as 64-core FPGA. 10mmX10mm, IBM90nm process. 150 MHz prototype The design scales to 1000+ cores on-chip
XMT: A PRAM-On-Chip Vision • Manycores are coming. But 40yrs of parallel computing: • Never a successful general-purpose parallel computer (easy to program, good speedups, up&down scalable). IF you could program it great speedups. XMT: Fix the IF • XMT: Designed from the ground up to address that for on-chip parallelism • Unlike matching current HW (Some other SPAA papers) • Tested HW & SW prototypes • Builds on PRAM algorithmics. Only really successful parallel algorithmic theory. Latent, though not widespread, knowledgebase • This paper:~10X relative to Intel Core 2 Duo If there is time: Really serious about ease of programming
Objective for programmer’s model What could I do in parallel at each step assuming unlimited hardware . . # ops Serial Paradigm Natural (Parallel) Paradigm . . # ops . . .. .. .. .. time time Time = Work Work = total #ops Time << Work • Emerging: not sure, but the analysis should be work-depth. But, why not design for your analysis? (like serial) • XMT: Design for work-depth. Unique among manycores. - 1 operation now. Any #ops next time unit. - Competitive on nesting. (To be published.) - No need to program for locality.
Programmer’s Model: Engineering Workflow • Arbitrary CRCW Work-depth algorithm. Reason about correctness & complexity in synchronous model • SPMD reduced synchrony • Threads advance at own speed, not lockstep • Main construct: spawn-join block. Note: can start any number of processes at once • Prefix-sum (ps). Independence of order semantics (IOS). • Establish correctness & complexity by relating to WD analyses. • Circumvents “The problem with threads”, e.g., [Lee]. • Tune (compiler or expert programmer): (i) Length of sequence of round trips to memory, (ii) QRQW, (iii) WD. [VCL07] • Trial&error contrast: similar startwhileinsufficient inter-thread bandwidth do{rethink algorithm to take better advantage of cache} spawn join spawn join
XMT Architecture Overview • One serial core – master thread control unit (MTCU) • Parallel cores (TCUS) grouped in clusters • Global memory space evenly partitioned in cache banks using hashing • No local caches at TCU • Avoids expensive cache coherence hardware … MTCU Hardware Scheduler/Prefix-Sum Unit Cluster 1 Cluster 2 Cluster C … Parallel Interconnection Network … Shared Memory (L1 Cache) Memory Bank 1 Memory Bank 2 Memory Bank M DRAM Channel 1 DRAM Channel D
Paraleap: XMT PRAM-on-chip silicon • Built FPGA prototype • Announced in SPAA’07 • Built using 3 FPGA chips • 2 Virtex-4 LX200 • 1 Virtex-4 FX100 With no prior design experience, X. Wen completed synthesizable Verilog description AND the new FPGA-based XMT computer in slightly more than two years. X. Wen is one person.. basic simplicity of the XMT architecture simple faster time to market, lower implementation cost.
Benchmarks • Sparse Matrix – Vector Multiplication (SpMV) • Matrix stored in Compact Sparse Row (CSR) format • Serial version: iterate through rows • Parallel version: one thread per row • 1-D FFT • Fixed-point arithmetic implementation • Serial version: Radix-2 Cooley-Tukey Algorithm • Parallel version: Parallelized each stage of serial algorithm • Quicksort • Serial version: standard textbook implementation • Parallel version: two phases • Phase 1: For large sub-arrays, parallelize partitioning operation using atomic prefix-sum • Phase 2: Process all partitions in parallel using serial partitioning algorithm
Experimental Platforms For meaningful comparison: compare cycle count
Input Datasets • Large dataset represents realistic input sizes • Recommended by Intel engineer for comparison • Gives Intel Core 2 advantage because of larger cache • Small dataset • Fits in both Paraleap and Intel Core 2 cache • Provides most fair comparison for current XMT generation
Clock-Cycle Speedup • Computed as: speedup = #ClockCycles for Core 2 / #ClockCylces for Paraleap • Paraleap outperforms Intel Core 2 on all benchmarks • Lower speed-ups for Large dataset because of smaller cache size • Will not be an issue for future implementations of XMT • Silicon area of 64-TCU XMT roughly the same as one core of Intel Core 2 Duo • No reason for clock frequency of XMT to fall behind
Conclusion • XMT provides viable answer to biggest challenges for the field • Ease of programming • Scalability (up&down) • Preliminary evaluation shows good result of XMT architecture versus state-of-the art Intel Core 2 platform • ICPP’08 paper compares with GPUs.
Software release Allows to use your own computer for programming on an XMT environment and experimenting with it, including: Cycle-accurate simulator of the XMT machine Compiler from XMTC to that machine Also provided, extensive material for teaching or self-studying parallelism, including Tutorial + manual for XMTC (150 pages) Classnotes on parallel algorithms (100 pages) Video recording of 9/15/07 HS tutorial (300 minutes) Video recording of grad Parallel Algorithms lectures (30+hours) www.umiacs.umd.edu/users/vishkin/XMT/sw-release.html Next Major Objective Industry-grade chip. Requires 10X in funding.
Ease of Programming • Benchmark: can any CS major program your manycore? - cannot really avoid it. Teachability demonstrated so far: - To freshman class with 11 non-CS students. Some prog. assignments: merge-sort, integer-sort & samples-sort. Other teachers: - Magnet HS teacher. Downloaded simulator, assignments, class notes, from XMT page. Self-taught. Recommends: Teach XMT first. Easiest to set up (simulator), program, analyze: ability to anticipate performance (as in serial). Can do not just for embarrassingly parallel. Teaches also OpenMP, MPI, CUDA. Lookup keynote at CS4HS’09@CMU + interview with teacher. - High school & Middle School (some 10 year olds) students from underrepresented groups by HS Math teacher.