300 likes | 427 Views
Tomorrow’s Computing Engines February 3, 1998 Symposium on High-Performance Computer Architecture. William J. Dally Computer Systems Laboratory Stanford University billd@csl.stanford.edu. Focus on Tomorrow, not Yesterday. General’s tend to always fight the last war
E N D
Tomorrow’s Computing EnginesFebruary 3, 1998Symposium on High-Performance Computer Architecture William J. Dally Computer Systems Laboratory Stanford University billd@csl.stanford.edu Tomorrow's Computing Engines
Focus on Tomorrow, not Yesterday General’s tend to always fight the last war Computer architects tend to always design the last computer old programs old technology assumptions Tomorrow's Computing Engines
Some Previous “Wars” (1/3) Reliable Router 1994 Torus Routing Chip 1985 MARS Router 1984 Network Design Frame 1988
Some Previous “Wars” (2/3) MDP Chip J-Machine Cray T3D MAP Chip
Tomorrow’s Computing Engines • Driven by tomorrow’s applications - media • Constrained by tomorrow’s technology Tomorrow's Computing Engines
90% of Desktop Cycles will Be Spent on ‘Media’ Applications by 2000 • Quote from Scott Kirkpatric of IBM (talk abstract) • Media applications include • video encode/decode • polygon & image-based graphics • audio processing - compression, music, speech - recognition/synthesis • modulation/demodulation at audio and video rates • These applications involve stream processing • So do • radar processing: SAR, STAP, MTI ...
Typical Media KernelImage Warp and Composite • Read 10,000 pixels from memory • Perform 100 16-bit integer operations on each pixel • Test each pixel • Write 3,000 result pixels that pass to memory • Little reuse of data fetched from memory • each pixel used once • Little interaction between pixels • very insensitive to operation latency • Challenge is to maximize bandwidth Tomorrow's Computing Engines
Telepresence: A Driving Application Acquire 2D Images Extract Depth (3D Images) Segmentation Model Extraction Compression Channel Decompression Rendering Display 3D Scene Most kernels: Latency insensitive High ratio of arithmetic to memory references Tomorrow's Computing Engines
Tomorrow’s Technology is Wire Limited • Lots of devices • A little faster • Slow wires Tomorrow's Computing Engines
Technology scaling makes communication the scarce resource 1997 2007 0.35mm 64Mb DRAM 16 64b FP Proc 400MHz 0.10mm 4Gb DRAM 1K 64b FP Proc 2.5GHz P 18mm 12,000 tracks 1 clock 32mm 90,000 tracks 20 clocks
On-chip wires are getting slower x2 = s x1 0.5x R2 = R1/s2 4x C2 = C1 1x tw2 = R2C2y2 = tw1/s2 4x tw2/tg2= tw1/(tg1s3) 8x v = 0.5(tgRC)-1/2 (m/s) v2 = v1s1/2 0.7x vtg = 0.5(tg/RC)1/2 (m/gate) v2tg2 = v1tg1s3/2 0.35x y y x1 x2 tw = RCy2 RCy2 RCy2 tg tg tg
Bandwidth and Latency of Modern VLSI 103 1 Bandwidth 100 0.01 Bandwidth Latency 10 10-4 Latency 1 10-6 1 10 100 103 104 105 Size Chip Boundary Tomorrow's Computing Engines
Architecture for LocalityExploit high on-chip bandwidth Pin-Bandwidth, 2GB/s Off-chip RAM Vector Reg File 104 32-bit ALUs Switch 50GB/s 500GB/s
Aimed at media processing stream based latency tolerant low-precision little reuse lots of conditionals Use the large number of devices available on future chips Make efficient use of scarce communication resources bandwidth hierarchy no centralized resources Approach the performance of a special-purpose processor Tomorrow’s Computing Engines Tomorrow's Computing Engines
Why do Special-Purpose Processors Perform Well? Lots (100s) of ALUs Fed by dedicated wires/memories Tomorrow's Computing Engines
Care and Feeding of ALUs Instr. Cache IP Instruction Bandwidth IR Data Bandwidth Regs ‘Feeding’ Structure Dwarfs ALU Tomorrow's Computing Engines
Three Key Problems • Instruction bandwidth • Data bandwidth • Conditional execution Tomorrow's Computing Engines
A Bandwidth Hierarchy 13 ALUs per cluster SDRAM ALU Cluster ALU Cluster SDRAM Streaming Memory Vector Register File SDRAM 500GB/s SDRAM ALU Cluster 1.6GB/s 50GB/s • Solves data bandwidth problem • Matched to bandwidth curve of technology Tomorrow's Computing Engines
A Streaming Memory System Reorder Queue SDRAM Bank IX Address Generator D Crossbar Address Generator Reorder Queue SDRAM Bank
Streaming Memory Performance • Exploit latency insensitivity for improved bandwidth • 1.75:1 Performance improvement from relatively short reorder queue Tomorrow's Computing Engines
Compound Vector Operations1 Instruction does lots of work Memory Instructions Compound Vector Instruction 1 CV Inst (50b) LD Vd Vx Op V0 V1 V2 V3 V4 V5 V6 V7 uInst (300b) x 20uInst/Op x 1000el/vec ------------------ 6 x 106 b Control Store uIP Mem AG VRF Op Ra Rb Op Ra Rb Op Ra Rb Tomorrow's Computing Engines
List scheduling assumes global communication does poorly when communication exposed View scheduling as a CAD problem (place and route) generate naïve ‘feasible’ schedule iteratively improve schedule by moving operations. Scheduling by Simulated Annealing ALUs Ready Ops Time Tomorrow's Computing Engines
Typical Annealing Schedule 166 Energy function changed 13 Tomorrow's Computing Engines
Conventional Approaches to Data-Dependent Conditional Execution A A A x>0 y=(x>0) Y N x>0 Y B Speculative Loss D x W ~1000 if y Exponentially Decreasing Duty Factor B B J J if ~y C C K C if y Whoops Data-Dependent Branch J K if ~y K Tomorrow's Computing Engines
Zero-Cost Conditionals • Most Approaches to Conditional Operations are Costly • Branching control flow - dead issue slots on mispredicted branches • Predication (SIMD select, masked vectors) - large fraction of execution ‘opportunities’ go idle. • Conditional Vectors • append an element to an output stream depending on a case variable. Output Stream 0 0 Result Stream Output Stream 1 1 Case Stream {0,1} Tomorrow's Computing Engines
Application Sketch - Polygon Rendering V3 Vertex V1 V2 V3 X Y RGB UV V2 V1 Y X1 X2 RGB1 DRGB UV1 DUV Y Span X1 X2 X Y RGB UV Pixel Y UV RGB X Textured Pixel X Y RGB Tomorrow's Computing Engines
Status • Working simulator of Imagine • Simple kernels running on simulator • FFT • Applications being developed • Depth extraction, video compression, polygon rendering, image-based graphics • Circuit/Layout studies underway
Students/Staff Don Alpert (Intel) Chris Buehler (MIT) J.P Grossman (MIT) Brad Johanson Ujval Kapasi Brucek Khailany Abelardo Lopez-Lagunas Peter Mattson John Owens Scott Rixner Helpful Suggestions Henry Fuchs (UNC) Pat Hanrahan Tom Knight (MIT) Marc Levoy Leonard McMillan (MIT) John Poulton (UNC) Acknowledgements Tomorrow's Computing Engines
Conclusion • Work toward tomorrow’s computing engines • Targeted toward media processing • streams of low-precision samples • little reuse • latency tolerant • Matched to the capabilities of communication-limited technology • explicit bandwidth hierarchy • explicit communication between units • communication exposed • Insight not numbers