1 / 30

Tomorrow’s Computing Engines February 3, 1998 Symposium on High-Performance Computer Architecture

Tomorrow’s Computing Engines February 3, 1998 Symposium on High-Performance Computer Architecture. William J. Dally Computer Systems Laboratory Stanford University billd@csl.stanford.edu. Focus on Tomorrow, not Yesterday. General’s tend to always fight the last war

susane
Download Presentation

Tomorrow’s Computing Engines February 3, 1998 Symposium on High-Performance Computer Architecture

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Tomorrow’s Computing EnginesFebruary 3, 1998Symposium on High-Performance Computer Architecture William J. Dally Computer Systems Laboratory Stanford University billd@csl.stanford.edu Tomorrow's Computing Engines

  2. Focus on Tomorrow, not Yesterday General’s tend to always fight the last war Computer architects tend to always design the last computer old programs old technology assumptions Tomorrow's Computing Engines

  3. Some Previous “Wars” (1/3) Reliable Router 1994 Torus Routing Chip 1985 MARS Router 1984 Network Design Frame 1988

  4. Some Previous “Wars” (2/3) MDP Chip J-Machine Cray T3D MAP Chip

  5. Some Previous “Wars” (3/3)

  6. Tomorrow’s Computing Engines • Driven by tomorrow’s applications - media • Constrained by tomorrow’s technology Tomorrow's Computing Engines

  7. 90% of Desktop Cycles will Be Spent on ‘Media’ Applications by 2000 • Quote from Scott Kirkpatric of IBM (talk abstract) • Media applications include • video encode/decode • polygon & image-based graphics • audio processing - compression, music, speech - recognition/synthesis • modulation/demodulation at audio and video rates • These applications involve stream processing • So do • radar processing: SAR, STAP, MTI ...

  8. Typical Media KernelImage Warp and Composite • Read 10,000 pixels from memory • Perform 100 16-bit integer operations on each pixel • Test each pixel • Write 3,000 result pixels that pass to memory • Little reuse of data fetched from memory • each pixel used once • Little interaction between pixels • very insensitive to operation latency • Challenge is to maximize bandwidth Tomorrow's Computing Engines

  9. Telepresence: A Driving Application Acquire 2D Images Extract Depth (3D Images) Segmentation Model Extraction Compression Channel Decompression Rendering Display 3D Scene Most kernels: Latency insensitive High ratio of arithmetic to memory references Tomorrow's Computing Engines

  10. Tomorrow’s Technology is Wire Limited • Lots of devices • A little faster • Slow wires Tomorrow's Computing Engines

  11. Technology scaling makes communication the scarce resource 1997 2007 0.35mm 64Mb DRAM 16 64b FP Proc 400MHz 0.10mm 4Gb DRAM 1K 64b FP Proc 2.5GHz P 18mm 12,000 tracks 1 clock 32mm 90,000 tracks 20 clocks

  12. On-chip wires are getting slower x2 = s x1 0.5x R2 = R1/s2 4x C2 = C1 1x tw2 = R2C2y2 = tw1/s2 4x tw2/tg2= tw1/(tg1s3) 8x v = 0.5(tgRC)-1/2 (m/s) v2 = v1s1/2 0.7x vtg = 0.5(tg/RC)1/2 (m/gate) v2tg2 = v1tg1s3/2 0.35x y y x1 x2 tw = RCy2 RCy2 RCy2 tg tg tg

  13. Bandwidth and Latency of Modern VLSI 103 1 Bandwidth 100 0.01 Bandwidth Latency 10 10-4 Latency 1 10-6 1 10 100 103 104 105 Size Chip Boundary Tomorrow's Computing Engines

  14. Architecture for LocalityExploit high on-chip bandwidth Pin-Bandwidth, 2GB/s Off-chip RAM Vector Reg File 104 32-bit ALUs Switch 50GB/s 500GB/s

  15. Aimed at media processing stream based latency tolerant low-precision little reuse lots of conditionals Use the large number of devices available on future chips Make efficient use of scarce communication resources bandwidth hierarchy no centralized resources Approach the performance of a special-purpose processor Tomorrow’s Computing Engines Tomorrow's Computing Engines

  16. Why do Special-Purpose Processors Perform Well? Lots (100s) of ALUs Fed by dedicated wires/memories Tomorrow's Computing Engines

  17. Care and Feeding of ALUs Instr. Cache IP Instruction Bandwidth IR Data Bandwidth Regs ‘Feeding’ Structure Dwarfs ALU Tomorrow's Computing Engines

  18. Three Key Problems • Instruction bandwidth • Data bandwidth • Conditional execution Tomorrow's Computing Engines

  19. A Bandwidth Hierarchy 13 ALUs per cluster SDRAM ALU Cluster ALU Cluster SDRAM Streaming Memory Vector Register File SDRAM 500GB/s SDRAM ALU Cluster 1.6GB/s 50GB/s • Solves data bandwidth problem • Matched to bandwidth curve of technology Tomorrow's Computing Engines

  20. A Streaming Memory System Reorder Queue SDRAM Bank IX Address Generator D Crossbar Address Generator Reorder Queue SDRAM Bank

  21. Streaming Memory Performance • Exploit latency insensitivity for improved bandwidth • 1.75:1 Performance improvement from relatively short reorder queue Tomorrow's Computing Engines

  22. Compound Vector Operations1 Instruction does lots of work Memory Instructions Compound Vector Instruction 1 CV Inst (50b) LD Vd Vx Op V0 V1 V2 V3 V4 V5 V6 V7 uInst (300b) x 20uInst/Op x 1000el/vec ------------------ 6 x 106 b Control Store uIP Mem AG VRF Op Ra Rb Op Ra Rb Op Ra Rb Tomorrow's Computing Engines

  23. List scheduling assumes global communication does poorly when communication exposed View scheduling as a CAD problem (place and route) generate naïve ‘feasible’ schedule iteratively improve schedule by moving operations. Scheduling by Simulated Annealing ALUs Ready Ops Time Tomorrow's Computing Engines

  24. Typical Annealing Schedule 166 Energy function changed 13 Tomorrow's Computing Engines

  25. Conventional Approaches to Data-Dependent Conditional Execution A A A x>0 y=(x>0) Y N x>0 Y B Speculative Loss D x W ~1000 if y Exponentially Decreasing Duty Factor B B J J if ~y C C K C if y Whoops Data-Dependent Branch J K if ~y K Tomorrow's Computing Engines

  26. Zero-Cost Conditionals • Most Approaches to Conditional Operations are Costly • Branching control flow - dead issue slots on mispredicted branches • Predication (SIMD select, masked vectors) - large fraction of execution ‘opportunities’ go idle. • Conditional Vectors • append an element to an output stream depending on a case variable. Output Stream 0 0 Result Stream Output Stream 1 1 Case Stream {0,1} Tomorrow's Computing Engines

  27. Application Sketch - Polygon Rendering V3 Vertex V1 V2 V3 X Y RGB UV V2 V1 Y X1 X2 RGB1 DRGB UV1 DUV Y Span X1 X2 X Y RGB UV Pixel Y UV RGB X Textured Pixel X Y RGB Tomorrow's Computing Engines

  28. Status • Working simulator of Imagine • Simple kernels running on simulator • FFT • Applications being developed • Depth extraction, video compression, polygon rendering, image-based graphics • Circuit/Layout studies underway

  29. Students/Staff Don Alpert (Intel) Chris Buehler (MIT) J.P Grossman (MIT) Brad Johanson Ujval Kapasi Brucek Khailany Abelardo Lopez-Lagunas Peter Mattson John Owens Scott Rixner Helpful Suggestions Henry Fuchs (UNC) Pat Hanrahan Tom Knight (MIT) Marc Levoy Leonard McMillan (MIT) John Poulton (UNC) Acknowledgements Tomorrow's Computing Engines

  30. Conclusion • Work toward tomorrow’s computing engines • Targeted toward media processing • streams of low-precision samples • little reuse • latency tolerant • Matched to the capabilities of communication-limited technology • explicit bandwidth hierarchy • explicit communication between units • communication exposed • Insight not numbers

More Related