1 / 37

Merrimac Supercomputing with Streams

Merrimac Supercomputing with Streams. Mattan Erez Concurrent VLSI Architecture Group Computer Systems Lab, Stanford University September 9, 2003. Streaming Scientific Computation Exploits the Capabilities of VLSI. Modern VLSI technology makes arithmetic cheap

kael
Download Presentation

Merrimac Supercomputing with Streams

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MerrimacSupercomputing with Streams Mattan Erez Concurrent VLSI Architecture Group Computer Systems Lab, Stanford University September 9, 2003

  2. Streaming Scientific Computation Exploits the Capabilities of VLSI • Modern VLSI technology makes arithmetic cheap • 100s of GFLOPS/chip today TFLOPS in 2010 • But bandwidth is expensive • Streams change the ratio of arithmetic to bandwidth • By exposing producer-consumer locality • Not exploited by caches – no reuse, no spatial locality • Streams also expose parallelism • To keep 100s of FPUs per processor busy • High-radix networks reduce cost of bandwidth when it is needed • Simplifies programming Merrimac

  3. Computation is Inexpensive and Plentiful nVidia GeForce4 ~120 Gflops/sec ~1.2 Tops/sec Velio VC3003 1Tb/s I/O BW DRAM < $0.20/MB Merrimac

  4. To Exploit VLSI Technology We Need: • Parallelism • To keep 100s of ALUs per chip (thousands/board millions/system) busy • Latency tolerance • To cover 500 cycle remote memory access time • Locality • To match 20Tb/s ALU bandwidth to ~100Gb/s chip bandwidth • Moore’s Law • Growth of transistors, not performance Arithmetic is cheap, global bandwidth is expensive Local << global on-chip << off-chip << global system Merrimac

  5. Stream Architecture Makes Communication Explicit – Exploits Parallelism and Locality Merrimac

  6. Stream Programming Exposes Parallelism and Locality • Locality • Producer-consumer within a kernel (local register file) • Producer-consumer across kernels (on-chip SRF) • Stream locality (spatial) • Parallelism • Data Level Parallelism across stream elements • Task Parallelism across kernels Merrimac

  7. Streamed Applications Exploit Producer-Consumer Locality to Exploit BW Hierarchy Stream program matches application to BW Hierarchy 1:1:6:100 Merrimac

  8. A streaming supercomputer exploits the arithmetic density of VLSI to realize an efficiency of $6/GFLOPS and $3/M-GUPS Capability AND Capacity A PFLOPS machine with only 8,192 nodes (or less) A TFLOPS workstation for $20,000 (parts cost)

  9. Merrimac - a Streaming Supercomputer Merrimac

  10. Merrimac Node Merrimac

  11. Merrimac Processor • 90nm tech (1 V) • ASIC technology • 1 GHz (37 FO4) • 128 GFLOPs • Inter-cluster switch between clusters • 127.5 mm2 (small ~12x10) • Stanford Imagine is 16mm x 16mm • MIT Raw is 18mm x 18mm • 32 Watts (P4 = 75 W) Merrimac

  12. Merrimac Processor • 90nm tech (1 V) • ASIC technology • 1 GHz (37 FO4) • 128 GOPs • Inter-cluster switch between clusters • 127.5 mm2 (small ~12x10) • Stanford Imagine is 16mm x 16mm • MIT Raw is 18mm x 18mm • 32 Watts (P4 = 75 W) r r r r e e e e t t t t s s s s u u u u Mips64 Mips64 l l l l C C C C 20kc 20kc r r r r e e e e t t t t s s s s $ bank u u u u r n e n l l l l f e e C C C C f s u G G $ bank B e m s s r c s s e m e e a d r r C f $ bank Microcontroller r d d r 2 o d d C . h e e A A t 0 c E R t n i $ bank 1 I w d r r r r s r e e e e M m t t t t a $ bank s s s s r A e w n e n u u u u f M R r e e f l l l l u o G G D C C C C $ bank B F s s R r s s e e e d r r 6 $ bank r d d 1 o d d e r r r r A A e e e e R $ bank t t t t s s s s u u u u l l l l C C C C Network Merrimac 12.5 mm

  13. High Radix Routers Enable Economical Global Memory • Flat memory bandwidth within a 16-node board • 4:1 Concentration within a 32-node backplane, 8:1 across a 32 backplane system • Routers with bandwidth B=640Gb/s route messages with length L=128b • Requires high radix to exploit Merrimac

  14. Bandwidth Hierarchy Enabled by Streams Streaming Network Merrimac

  15. Rough Per-Node Budget Preliminary numbers, parts cost only, no I/O included Merrimac

  16. Gentle Slope to Port Applications to Merrimac like OpenMP

  17. Brook and Brooktran – Stream Extensions with Multiple Dimensions and Irregular Grids • Stream code intermixed with “scalar” code • C– Brook Fortran– Brooktran • Easy migration - only port the time-consuming kernels • Locality and Parallelism • Producer-consumer locality • No global memory references within a kernel • No retained state in kernels (reductions allowed) • Stream elements are processed independently • Stream operators for defining neighbors • Gather/Scatter and GatherOp/ScatterOp Merrimac

  18. Stream Compiler Achieves Near Optimum Kernel Performance 0 10 0 20 30 10 40 50 60 20 70 80 90 30 100 110 40 120 20 30 50 40 50 60 60 70 80 70 90 100 110 80 120 20 90 30 40 50 100 60 70 110 80 90 100 120 110 120 single iteration software pipeline shown ComputeCellInteriorKernel from StreamFEM application Merrimac

  19. Stream Compiler Reduces Bandwidth Demand Compared to Caching StreamFEM application Merrimac

  20. Capable Hardware Makes Software Easier • Nearly flat global bandwidth (10:1) • Supports non-local applications • Eliminates need to partition and place application • Fine grain remote access • Single word gather/scatter • Eliminates need to restructure application for spatial locality • Leaves the programmer to focus on • Writing a correct program • Expressing parallelism Merrimac

  21. Several scientific applications have been demonstrated on a stream processor simulator They all exhibit stream locality and achieve a high fraction of peak performance

  22. Scientific Streamed Applications • StreamFLO is a streaming version of FLO82, [Jameson], for the solution of the inviscid flow around an airfoil • Uses a cell centered finite volume formulation with a multi-grid acceleration to solve the 2D Euler equations • StreamFEM implementation of the Discontinuous Galerkin (DG) Finite Element Method (FEM) (Tim Barth, NASA) • 2D or 3D triangulated domains • Increasingly complex PDEs • Scalar advection (1 PDE), Euler (4 PDEs), Magnetohydrodynamics (6 PDEs) • Piecewise polynomial function • Constant (1 dof), Linear (3 dof), Quadratic (6 dof), Cubic (10 dof) • StreamMD molecular dynamics simulation • Box of water molecules • Electrostatic and Van der Waals interactions • Gridded to accelerate approximate force calculation Merrimac

  23. Scientific Applications Stream Well 1. Simulated on a machine with 64GFLOPS peak performance 2. The low numbers are a result of many divide and square-root operations Scientific stream applications match BW Hierarchy >93% local, <1.5% off-chip memory, ~50% of peak performance Merrimac

  24. Conclusions • The problem is bandwidth – arithmetic is cheap • Forget GFLOPS, • forget traditional P:M:C ratios – balance by incremental return • Streams expose and exploit locality and concurrency • Makes communication explicit • Keeps most (>90%) of data operations local (>1TB/s, 10pJ) • Increases arithmetic intensity (arithmetic rate/bandwidth) • Enables compiler optimization at a larger scale than scalar processing • $15/GFLOPS sustained on ‘local’ parts of applications • A capable network provides high global bandwidth • 20GBytes/sec on board (16 processors) • 2.5GBytes/sec globally (16K processors) • 25% of machine cost • $4/M-GUPS on ‘non-local’ parts of applications • Must be scalable to be economically feasible • Scalable – 2 TFLOPS board for $40K to 32-backplane 2 PFLOPS for $40M Merrimac

  25. Differences from Imagine • System • Designed to be used in systems with many nodes • Network • Shared Memory • I/O • Memory • Stream cache • Scatter/Gather-Op • Integrated host processor • SRF • Indexable SRF • Clusters • Uniform fused MULTIPLY-ADD units • Support for iterative operations Merrimac

  26. Why Scatter/Gather-Op ScatterOp Histograms GatherOp Allocation (distribute stream elements) 0 2 1 0 Mocl Mocl 1 0 0 1 Mocl Mocl Mocl Mocl 0 0 1 2 Mocl Mocl Mocl Mocl 0 1 2 3 Mocl Mocl Mocl Mocl 0 0 0 0 0 0 0 Mocl Mocl Mocl 1 1 1 1 1 2 2 2 Mocl Mocl 3 Mocl Mocl Mocl Merrimac

  27. Scatter-Add • Brook : streamScatterOp(s, s2(sref), STREAM_SCATTER_FLOATADD); • StreamC : streamCopy(s, s2(sref), memop_fp32_add); Merrimac

  28. Histogram • With ScatterAdd • streamCopy(sAcc, sHisto(sImage), add); • add : memop_int32_add • sAcc : value to accumulate, here stream of all ones • sImage : brightness of Image (for size x size pixels) • sHisto : histogram stream • O(N) • Without ScatterAdd • Use scratchpad (Privatization) • Take elements within subset of buckets • Size of subset = size of scratchpad • Create a histogram of that range • Iterate until it covers all the buckets • O(NB) – bucket sort algorithm • Merge Sort & Segmented Scan • O(NlogN) – general sort algorithm • O(N) – segmented Scan Merrimac

  29. Comparison Merrimac

  30. Comparison (cont.) Merrimac

  31. Fetch-Op • StreamC : streamCopy(s, s2(sref), ret, memop_fp32_add); Merrimac

  32. Indexable SRF Architecture • 128 K-word capacity, 512GB/s BW at 1 GHz • Access modes • Sequential stream access • one 64 word access per cycle • In-lane indexed access • up to 64 single word accesses per cycle • Overhead of indexed access • < 15% area overhead for in-lane indexing • < 25% area overhead for cross-lane indexing SRF Bank n Sub-bank 0 Sub-bank 1 Sub-bank 2 Sub-bank 3 Compute Cluster n Merrimac

  33. Indexable SRF Architecture • 128 K-word capacity, 512GB/s BW at 1 GHz • Access modes • Sequential stream access • one 64 word access per cycle • In-lane indexed access • up to 64 single word accesses per cycle • Overhead of indexed access • < 15% area overhead for in-lane indexing • < 25% area overhead for cross-lane indexing SRF Bank n Sub-bank 0 Sub-bank 1 Sub-bank 2 Sub-bank 3 Compute Cluster n Merrimac

  34. Memory Bandwidth Reduction • FFT 2D: 256x256 2D FFT • Rijndael: Rijndael encryption • Sort: 4096-element merge sort • Filter: 5x5 convolution filter over 256x256 image • IG_*: Parameterized irregular grid synthetic benchmark • Cache (sequential SRF backed by cache) provided for comparison • Indexed SRF captures intra-stream reuse and statically analyzable inter-stream reuse • FFT, Rijndael, IG_* • Cache captures inter-stream reuse in irregular access patterns as well • IG_* Merrimac

  35. Execution Time Reduction • FFT, Rijndael, IG_SML, IG_DMS benefit from memory traffic reductions • IG_DMS and IG_DCS, and to a lesser extent, IG_SML and IG_SCL benefit from longer strip sizes • Sort and Filter benefit from efficient data-dependant accesses in SRF Merrimac

  36. Iterative Operations • Use iterative operations (MUL and ADD) to implement Divide, Invsqrt and other elementary functions • Bare Bones Solution • Use Indexable SRF to store 256 lookup values for each required elementary function • Use normal MADD and others to Merrimac

  37. Iterative Operations • Fancy Solution • If better performance for elementary functions is required • Custom ALU based on Albert Liddicoat’s thesis that can speed up division / elementary functions • Extra complexity, HW not used all the time Merrimac

More Related