1 / 15

Roadmap and Change How Much and How Fast

Presentation to 2004 Workshop on Extreme Supercomputing Panel :. Roadmap and Change How Much and How Fast. Thomas Sterling California Institute of Technology and NASA Jet Propulsion Laboratory October 12, 2004. 2 9 Years Ago Today. Linpack Zettaflops in 2032. 10 Zflops. 1 Zflops.

ornice
Download Presentation

Roadmap and Change How Much and How Fast

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Presentation to 2004 Workshop on Extreme Supercomputing Panel: Roadmap and ChangeHow Much and How Fast Thomas Sterling California Institute of Technology and NASA Jet Propulsion Laboratory October 12, 2004 Thomas Sterling - Caltech & JPL

  2. 29 Years Ago Today Thomas Sterling - Caltech & JPL

  3. Linpack Zettaflops in 2032 10 Zflops 1 Zflops 100 Eflops 10 Eflops 1 Eflops 100 Pflops 10 Pflops 1 Pflops SUM N=1 N=500 100 Tflops 10 Tflops 1 Tflops Courtesy of Thomas Sterling 100 Gflops 10 Gflops 1 Gflops 100 Mflops 1993 1995 2000 2005 2010 2015 2020 2025 2030 2035 2040 2043 Thomas Sterling - Caltech & JPL

  4. Thomas Sterling - Caltech & JPL

  5. The Way We Were: 1974 • IBM 370 market mainstream • Approx. 1 Mflops • DEC PDP-11 geeks delight • Seymour Cray started working on Cray-1 • Approx. 100 Mflops • 2nd generation microprocessor • e.g. Intel 8008 • Core memory • 1103 1Kx1 DRAM chips • Punch cards, paper tapes, teletypes, selectrics Thomas Sterling - Caltech & JPL

  6. L1 JJ1 JJ2 What Will Be Different • Moore’s Law will have flatlined • Nano-scale atomic level devices • Assuming we solve lithography problem • Local clock rates ~100 GHz • Fastest today is > 700 GHz • Local actions strongly preferential to global actions • Non-conventional technologies may be employed • Optical • Quantum dots • Rapid Single Flux Quantum (RSFQ) gates Thomas Sterling - Caltech & JPL

  7. What we will need • 1 nano-watt per Megaflops • Energy received from Tau Ceti (per m2) • Approximately 1 square meter for 1 Zetaflops ALUs • 10 billion execution sites • > 10 billion-way parallelism • Including memory and communications: 2000 m2 • 3-D packaging (4m)3 • Global latency of ~ 10,000 cycles • Including average latency, => 1 trillion-way parallelism Thomas Sterling - Caltech & JPL

  8. Remote Memory Requests Remote Memory Requests ALU Remote Memory Requests ALU Input Parcels Output Parcels Local Memory Remote Memory Requests Local Memory Process Driven Node Parcel Driven Node Parcel Simulation Latency Hiding Experiment Nodes Flat Network Nodes Nodes Control Experiment Test Experiment Nodes Thomas Sterling - Caltech & JPL

  9. Latency Hiding with Parcelswith respect to System Diameter in cycles Thomas Sterling - Caltech & JPL

  10. Latency Hiding with ParcelsIdle Time with respect to Degree of Parallelism Thomas Sterling - Caltech & JPL

  11. Architecture Innovation • Extreme memory bandwidth • Active latency hiding • Extreme parallelism • Message-driven split-transaction computations (parcels) • PIM • e.g. Kogge, Draper, Sterling, … • Very high memory bandwidth • Lower memory latency (on chip) • Higher execution parallelism (banks and row-wide) • Streaming • Dally, Keckler, … • Very high functional parallelism • Low latency (between functional units) • Higher execution parallelism (high ALU density) Thomas Sterling - Caltech & JPL

  12. Continuum Computer Architecture • Merges state, logic, and communication in single building block • Parcel driven computation • Fine grain split transaction computing • Move data through vectors of instructions in store • Move instruction stream through vector of data • Gather-scatter an intrinsic • Very efficient Futures for produces-multi-consumer computing • Combines strengths of PIM and Streaming • All register architecture (fully associative) • Functional units within a cycle of neighbors • Extreme parallelism • Intrinsic latency hiding Thomas Sterling - Caltech & JPL

  13. Assoc. Memory Inst. Reg. ALU Control ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU Thomas Sterling - Caltech & JPL

  14. Conclusions • Zettaflops at nano-scale technology is possible • Size requirements tolerable • But packaging is a challenge; • Latency challenge does not sink the idea • Major obstacles • Power • Latency • Parallelism • Reliability • Programming • Architecture can address many of these • Continuum Computing Architecture • Combines advantages of PIM and streaming • Strong candidate for future Zetaflops computer Thomas Sterling - Caltech & JPL

  15. Thomas Sterling - Caltech & JPL

More Related