Tomorrow’s Computing Engines February 3, 1998 Symposium on High-Performance Computer Architecture

Tomorrow’s Computing EnginesFebruary 3, 1998Symposium on High-Performance Computer Architecture William J. Dally Computer Systems Laboratory Stanford University billd@csl.stanford.edu Tomorrow's Computing Engines

Focus on Tomorrow, not Yesterday General’s tend to always fight the last war Computer architects tend to always design the last computer old programs old technology assumptions Tomorrow's Computing Engines

Some Previous “Wars” (1/3) Reliable Router 1994 Torus Routing Chip 1985 MARS Router 1984 Network Design Frame 1988

Some Previous “Wars” (2/3) MDP Chip J-Machine Cray T3D MAP Chip

Some Previous “Wars” (3/3)

Tomorrow’s Computing Engines • Driven by tomorrow’s applications - media • Constrained by tomorrow’s technology Tomorrow's Computing Engines

90% of Desktop Cycles will Be Spent on ‘Media’ Applications by 2000 • Quote from Scott Kirkpatric of IBM (talk abstract) • Media applications include • video encode/decode • polygon & image-based graphics • audio processing - compression, music, speech - recognition/synthesis • modulation/demodulation at audio and video rates • These applications involve stream processing • So do • radar processing: SAR, STAP, MTI ...

Typical Media KernelImage Warp and Composite • Read 10,000 pixels from memory • Perform 100 16-bit integer operations on each pixel • Test each pixel • Write 3,000 result pixels that pass to memory • Little reuse of data fetched from memory • each pixel used once • Little interaction between pixels • very insensitive to operation latency • Challenge is to maximize bandwidth Tomorrow's Computing Engines

Telepresence: A Driving Application Acquire 2D Images Extract Depth (3D Images) Segmentation Model Extraction Compression Channel Decompression Rendering Display 3D Scene Most kernels: Latency insensitive High ratio of arithmetic to memory references Tomorrow's Computing Engines

Tomorrow’s Technology is Wire Limited • Lots of devices • A little faster • Slow wires Tomorrow's Computing Engines

Technology scaling makes communication the scarce resource 1997 2007 0.35mm 64Mb DRAM 16 64b FP Proc 400MHz 0.10mm 4Gb DRAM 1K 64b FP Proc 2.5GHz P 18mm 12,000 tracks 1 clock 32mm 90,000 tracks 20 clocks

On-chip wires are getting slower x2 = s x1 0.5x R2 = R1/s2 4x C2 = C1 1x tw2 = R2C2y2 = tw1/s2 4x tw2/tg2= tw1/(tg1s3) 8x v = 0.5(tgRC)-1/2 (m/s) v2 = v1s1/2 0.7x vtg = 0.5(tg/RC)1/2 (m/gate) v2tg2 = v1tg1s3/2 0.35x y y x1 x2 tw = RCy2 RCy2 RCy2 tg tg tg

Bandwidth and Latency of Modern VLSI 103 1 Bandwidth 100 0.01 Bandwidth Latency 10 10-4 Latency 1 10-6 1 10 100 103 104 105 Size Chip Boundary Tomorrow's Computing Engines

Architecture for LocalityExploit high on-chip bandwidth Pin-Bandwidth, 2GB/s Off-chip RAM Vector Reg File 104 32-bit ALUs Switch 50GB/s 500GB/s

Aimed at media processing stream based latency tolerant low-precision little reuse lots of conditionals Use the large number of devices available on future chips Make efficient use of scarce communication resources bandwidth hierarchy no centralized resources Approach the performance of a special-purpose processor Tomorrow’s Computing Engines Tomorrow's Computing Engines

Why do Special-Purpose Processors Perform Well? Lots (100s) of ALUs Fed by dedicated wires/memories Tomorrow's Computing Engines

Care and Feeding of ALUs Instr. Cache IP Instruction Bandwidth IR Data Bandwidth Regs ‘Feeding’ Structure Dwarfs ALU Tomorrow's Computing Engines

Three Key Problems • Instruction bandwidth • Data bandwidth • Conditional execution Tomorrow's Computing Engines

A Bandwidth Hierarchy 13 ALUs per cluster SDRAM ALU Cluster ALU Cluster SDRAM Streaming Memory Vector Register File SDRAM 500GB/s SDRAM ALU Cluster 1.6GB/s 50GB/s • Solves data bandwidth problem • Matched to bandwidth curve of technology Tomorrow's Computing Engines

A Streaming Memory System Reorder Queue SDRAM Bank IX Address Generator D Crossbar Address Generator Reorder Queue SDRAM Bank

Streaming Memory Performance • Exploit latency insensitivity for improved bandwidth • 1.75:1 Performance improvement from relatively short reorder queue Tomorrow's Computing Engines

Compound Vector Operations1 Instruction does lots of work Memory Instructions Compound Vector Instruction 1 CV Inst (50b) LD Vd Vx Op V0 V1 V2 V3 V4 V5 V6 V7 uInst (300b) x 20uInst/Op x 1000el/vec ------------------ 6 x 106 b Control Store uIP Mem AG VRF Op Ra Rb Op Ra Rb Op Ra Rb Tomorrow's Computing Engines

List scheduling assumes global communication does poorly when communication exposed View scheduling as a CAD problem (place and route) generate naïve ‘feasible’ schedule iteratively improve schedule by moving operations. Scheduling by Simulated Annealing ALUs Ready Ops Time Tomorrow's Computing Engines

Typical Annealing Schedule 166 Energy function changed 13 Tomorrow's Computing Engines

Conventional Approaches to Data-Dependent Conditional Execution A A A x>0 y=(x>0) Y N x>0 Y B Speculative Loss D x W ~1000 if y Exponentially Decreasing Duty Factor B B J J if ~y C C K C if y Whoops Data-Dependent Branch J K if ~y K Tomorrow's Computing Engines

Zero-Cost Conditionals • Most Approaches to Conditional Operations are Costly • Branching control flow - dead issue slots on mispredicted branches • Predication (SIMD select, masked vectors) - large fraction of execution ‘opportunities’ go idle. • Conditional Vectors • append an element to an output stream depending on a case variable. Output Stream 0 0 Result Stream Output Stream 1 1 Case Stream {0,1} Tomorrow's Computing Engines

Application Sketch - Polygon Rendering V3 Vertex V1 V2 V3 X Y RGB UV V2 V1 Y X1 X2 RGB1 DRGB UV1 DUV Y Span X1 X2 X Y RGB UV Pixel Y UV RGB X Textured Pixel X Y RGB Tomorrow's Computing Engines

Status • Working simulator of Imagine • Simple kernels running on simulator • FFT • Applications being developed • Depth extraction, video compression, polygon rendering, image-based graphics • Circuit/Layout studies underway

Students/Staff Don Alpert (Intel) Chris Buehler (MIT) J.P Grossman (MIT) Brad Johanson Ujval Kapasi Brucek Khailany Abelardo Lopez-Lagunas Peter Mattson John Owens Scott Rixner Helpful Suggestions Henry Fuchs (UNC) Pat Hanrahan Tom Knight (MIT) Marc Levoy Leonard McMillan (MIT) John Poulton (UNC) Acknowledgements Tomorrow's Computing Engines

Conclusion • Work toward tomorrow’s computing engines • Targeted toward media processing • streams of low-precision samples • little reuse • latency tolerant • Matched to the capabilities of communication-limited technology • explicit bandwidth hierarchy • explicit communication between units • communication exposed • Insight not numbers

Tomorrow’s Computing Engines February 3, 1998 Symposium on High-Performance Computer Architecture