Topics 8: Advance in Parallel Computer Architectures

Topics 8: Advance in Parallel Computer Architectures \course\cpeg323-05F\Topic-final-323.ppt

Reading List • Slides: Topic8x \course\cpeg323-05F\Topic-final-323.ppt

Why Study Parallel Architecture? • Role of a computer architect: • To design and engineer the various levels of a computer system to maximizeperformanceand programmability within limits of technology and cost. • Parallelism: • Provides alternative to faster clock for performance • Applies at all levels of system design • Is a fascinating perspective from which to view architecture • Is increasingly central in information processing \course\cpeg323-05F\Topic-final-323.ppt

Inevitability of Parallel Computing • Application demands • Technology Trends • Architecture Trends • Economics \course\cpeg323-05F\Topic-final-323.ppt

Application Trends • Demand for cycles fuels advances in hardware, and vice-versa • Range of performance demands • Goal of applications in using parallel machines: Speedup • Productivity requirement \course\cpeg323-05F\Topic-final-323.ppt

Summary of Application Trends • Transition to parallel computing has occurred for scientific and engineering computing • In rapid progress in commercial computing • Desktop also uses multithreaded programs, which are a lot like parallel programs • Demand for improving throughput on sequential workloads • Demand on productivity \course\cpeg323-05F\Topic-final-323.ppt

Proc $ Interconnect Technology: A Closer Look • Basic advance is decreasing feature size ( ) • Clock rate improves roughly proportional to improvement in  • Number of transistors improves like (or faster) • Performance > 100x per decade; clock rate 10x, rest transistor count • How to use more transistors? • Parallelism in processing • Locality in data access • Both need resources, so tradeoff \course\cpeg323-05F\Topic-final-323.ppt

Clock Frequency Growth Rate • 30% per year \course\cpeg323-05F\Topic-final-323.ppt

Transistor Count Growth Rate • 1 billion transistors on chip in early 2000’s A.D. • Transistor count grows much faster than clock rate • - 40% per year, order of magnitude more contribution in 2 decades \course\cpeg323-05F\Topic-final-323.ppt

Similar Story for Storage • Divergence between memory capacity and speed more pronounced • Larger memories are slower • Need deeper cache hierarchies • Parallelism and locality within memory systems • Disks too: Parallel disks plus caching \course\cpeg323-05F\Topic-final-323.ppt

Moore’s Law and Headcount • Along with the number of transistors, the effort and headcount required to design a microprocessor has grown exponentially \course\cpeg323-05F\Topic-final-323.ppt

Architectural Trends • Architecture: performance and capability • Tradeoff between parallelism and locality • Current microprocessor: 1/3 compute, 1/3 cache, 1/3 off-chip connect • Understanding microprocessor architectural trends • Four generations of architectural history: tube, transistor, IC, VLSI \course\cpeg323-05F\Topic-final-323.ppt

Technology Progress Overview • Processor speed improvement: 2x per year (since 85). 100x in last decade. • DRAM Memory Capacity: 2x in 2 years (since 96). 64x in last decade. • DISK capacity: 2x per year (since 97). 250x in last decade. \course\cpeg323-05F\Topic-final-323.ppt

Classes of Parallel Architecture forHigh Performance Computers (Courtesy of Thomas Sterling) • Parallel Vector Processors (PVP) • NEC Earth Simulator, SX-6 • Cray- 1, 2, XMP, YMP, C90, T90, X1 • Fujitsu 5000 series • Massively Parallel Processors (MPP) • Intel Touchstone Delta & Paragon • TMC CM-5 • IBM SP-2 & 3, Blue Gene/Light • Cray T3D, T3E, Red Storm/Strider • Distributed Shared Memory (DSM) • SGI Origin • HP Superdome • Single Instruction stream Single Data stream (SIMD) • Goodyear MPP, MasPar 1 & 2, TMC CM-2 • Commodity Clusters • Beowulf-class PC/Linux clusters • Constellations • HP Compaq SC, Linux NetworX MCR \course\cpeg323-05F\Topic-final-323.ppt

What we have learned in the last two decade? Building a “good” general-purpose parallel machine is very hard! Proof by contradiction: so many companies went bankrupt in the past decade! \course\cpeg323-05F\Topic-final-323.ppt

1 103 106 109 1012 1015 One OPS KiloOPS MegaOPS GigaOPS TeraOPS PetaOPS A Growth-Factor of a Billion in Performance in a Single Lifetime(Courtesy to Thomas Sterling) 1959 IBM 7094 1976 Cray 1 1991 Intel Delta 1996 T3E 2003 Cray X1 1949 Edsac 1823 Babbage Difference Engine 2001 Earth Simulator 1951 Univac 1 1964 CDC 6600 1982 Cray XMP 1988 Cray YMP 1997 ASCI Red 1943 Harvard Mark 1 \course\cpeg323-05F\Topic-final-323.ppt

System Performance Applications No schedule provided by source Plasma Fusion Simulation [Jardin 03] 1 Zettaflops Full Global Climate [Malone 03] 100 Exaflops  Geodata Earth  Station Range [NASA 02] 10 Exaflops 1 Exaflops Compute as fast as the engineer can think[NASA 99] protein folding 100 Petaflops simulation of large biomolecular structures (ms scale) 10 Petaflops 1 Petaflops simulation of medium biomolecular structures (us scale)  1001000 [SCaLeS 03] 100 Teraflops 2000 2010 2020 Applications Demands [Courtesy ofErik P. DeBenedictis 2004] Simulation of more complex biomolecular structures [HEC04] 1 PFLOPS 250 TFLOPS 50 TFLOPS [Jardin 03] S.C. Jardin, “Plasma Science Contribution to the SCaLeS Report,” Princeton Plasma Physics Laboratory, PPPL-3879 UC-70, available on Internet.[Malone 03] Robert C. Malone, John B. Drake, Philip W. Jones, Douglas A. Rotman, “High-End Computing in Climate Modeling,” contribution to SCaLeS report.[NASA 99] R. T. Biedron, P. Mehrotra, M. L. Nelson, F. S. Preston, J. J. Rehder, J. L. Rogers, D. H. Rudy, J. Sobieski, and O. O. Storaasli, “Compute as Fast as the Engineers Can Think!”NASA/TM-1999-209715, available on Internet.[NASA 02] NASA Goddard Space Flight Center, “Advanced Weather Prediction Technologies: NASA’s Contribution to the Operational Agencies,” available on Internet.[SCaLeS 03] Workshop on the Science Case for Large-scale Simulation, June 24-25, proceedings on Internet a http://www.pnl.gov/scales/.[DeBenedictis 04], Erik P. DeBenedictis, “Matching Supercomputing to Progress in Science,” July 2004. Presentation at Lawrence Berkeley National Laboratory, also published asSandia National Laboratories SAND report SAND2004-3333P. Sandia technical reports are available by going to http://www.sandia.gov and accessing the technical library. [HEC04] Federal Plan for High-End Computing, May, 2004. \course\cpeg323-05F\Topic-final-323.ppt

Multi-core Technology Is Becoming Mainstream • IBM: Power, CELL; AMD: Opteron; Intel, RMI, Clearspeed • Unprecedented peak performance • Significantly reduces hardware cost with much lower power consumption and heat • Greatly expands the spectrum of application domains “It is likely that 2005 will be viewed as the year that parallelism came to the masses, with multiple vendors shipping dual/multi-core platforms into the mainstream consumer and enterprise markets.” - Intel Fellow, Justin Ratner, IEEE PACT Keynote Speech (Sept 19, 2005) \course\cpeg323-05F\Topic-final-323.ppt

IBM Power5 Multicore Chip • Technology: 130nm lithography, Cu, SOI • Dual processor core • 8-way superscalar • Simultaneous multithreaded (SMT) core • Up to 2 virtual processors per real processor • 24% area growth per core for SMT • Natural extension to POWER4 design Courtesy of “Simultaneous Multi-threading Implementation in POWER5 --IBM's Next Generation POWER Microprocessor” by Ron Kalla, Balaram Sinharoy, and Joel Tendler of IBM Systems Group \course\cpeg323-05F\Topic-final-323.ppt

200-333MHz 9 byte Reg. DDR 200-333MHz 9 byte Reg. DDR 200-333MHz 9 byte Reg. DDR AMD Opteron™ 940 mPGA AMD Opteron™ 940 mPGA AMD Opteron™ 940 mPGA 8-G DRAM 8-G DRAM 8-G DRAM Quad AMD Opteron™ 200-333MHz 9 byte Reg. DDR AMD Opteron™ 940 mPGA VGA PCI Graphics AMD-8111TM I/O Hub SSL Encryption TCP/IP off load engine Legacy PCI FLASH LPC SIO Management SPI 3.0 interface 100 BaseT Management LAN USB1.0,2.0 AC97 UDMA133 10/100 Ethernet Modular Array ASIC 10/100 Phy GMII to OC-12 or 802.3 GigE NIC \course\cpeg323-05F\Topic-final-323.ppt

ARM MPCore Architecture \course\cpeg323-05F\Topic-final-323.ppt Courtesy of linuxdevice.com

ClearSpeed CSX600 • 250 MHz clock • 96 high-performance processing elements • 576 Kbytes PE memory • 128 Kbytes on-chip scratchpad memory • 25,000 MIPS • 50 GFLOPS single or double precision • 3.2 Gbytes/s external memory bandwidth • 96 Gbytes/s internal memory bandwidth • 2 x 4 Gbytes/s chip-to-chip bandwidth Courtesy of CSX600 Overview on http://www.clearspeed.com/ \course\cpeg323-05F\Topic-final-323.ppt

System Performance Applications No schedule provided by source Plasma Fusion Simulation [Jardin 03] 1 Zettaflops Full Global Climate [Malone 03] 100 Exaflops  Geodata Earth  Station Range [NASA 02] 10 Exaflops 1 Exaflops Compute as fast as the engineer can think[NASA 99] protein folding 100 Petaflops simulation of large biomolecular structures (ms scale) 10 Petaflops 1 Petaflops simulation of medium biomolecular structures (us scale)  1001000 [SCaLeS 03] 100 Teraflops 2000 2010 2020 Applications Demands [Courtesy ofErik P. DeBenedictis 2004] Simulation of more complex biomolecular structures [HEC04] 1 PFLOPS 250 TFLOPS 50 TFLOPS [Jardin 03] S.C. Jardin, “Plasma Science Contribution to the SCaLeS Report,” Princeton Plasma Physics Laboratory, PPPL-3879 UC-70, available on Internet.[Malone 03] Robert C. Malone, John B. Drake, Philip W. Jones, Douglas A. Rotman, “High-End Computing in Climate Modeling,” contribution to SCaLeS report.[NASA 99] R. T. Biedron, P. Mehrotra, M. L. Nelson, F. S. Preston, J. J. Rehder, J. L. Rogers, D. H. Rudy, J. Sobieski, and O. O. Storaasli, “Compute as Fast as the Engineers Can Think!”NASA/TM-1999-209715, available on Internet.[NASA 02] NASA Goddard Space Flight Center, “Advanced Weather Prediction Technologies: NASA’s Contribution to the Operational Agencies,” available on Internet.[SCaLeS 03] Workshop on the Science Case for Large-scale Simulation, June 24-25, proceedings on Internet a http://www.pnl.gov/scales/.[DeBenedictis 04], Erik P. DeBenedictis, “Matching Supercomputing to Progress in Science,” July 2004. Presentation at Lawrence Berkeley National Laboratory, also published asSandia National Laboratories SAND report SAND2004-3333P. Sandia technical reports are available by going to http://www.sandia.gov and accessing the technical library. [HEC04] Federal Plan for High-End Computing, May, 2004. \course\cpeg323-05F\Topic-final-323.ppt

Multi-core Technology Is Becoming Mainstream • IBM: Power, CELL; AMD: Opteron; Intel, RMI, Clearspeed • Unprecedented peak performance • Significantly reduces hardware cost with much lower power consumption and heat • Greatly expands the spectrum of application domains “It is likely that 2005 will be viewed as the year that parallelism came to the masses, with multiple vendors shipping dual/multi-core platforms into the mainstream consumer and enterprise markets.” - Intel Fellow, Justin Ratner, IEEE PACT Keynote Speech (Sept 19, 2005) \course\cpeg323-05F\Topic-final-323.ppt

IBM Power5 Multicore Chip • Technology: 130nm lithography, Cu, SOI • Dual processor core • 8-way superscalar • Simultaneous multithreaded (SMT) core • Up to 2 virtual processors per real processor • 24% area growth per core for SMT • Natural extension to POWER4 design Courtesy of “Simultaneous Multi-threading Implementation in POWER5 --IBM's Next Generation POWER Microprocessor” by Ron Kalla, Balaram Sinharoy, and Joel Tendler of IBM Systems Group \course\cpeg323-05F\Topic-final-323.ppt

200-333MHz 9 byte Reg. DDR 200-333MHz 9 byte Reg. DDR 200-333MHz 9 byte Reg. DDR AMD Opteron™ 940 mPGA AMD Opteron™ 940 mPGA AMD Opteron™ 940 mPGA 8-G DRAM 8-G DRAM 8-G DRAM Quad AMD Opteron™ 200-333MHz 9 byte Reg. DDR AMD Opteron™ 940 mPGA VGA PCI Graphics AMD-8111TM I/O Hub SSL Encryption TCP/IP off load engine Legacy PCI FLASH LPC SIO Management SPI 3.0 interface 100 BaseT Management LAN USB1.0,2.0 AC97 UDMA133 10/100 Ethernet Modular Array ASIC 10/100 Phy GMII to OC-12 or 802.3 GigE NIC \course\cpeg323-05F\Topic-final-323.ppt

ARM MPCore Architecture \course\cpeg323-05F\Topic-final-323.ppt Courtesy of linuxdevice.com

ClearSpeed CSX600 • 250 MHz clock • 96 high-performance processing elements • 576 Kbytes PE memory • 128 Kbytes on-chip scratchpad memory • 25,000 MIPS • 50 GFLOPS single or double precision • 3.2 Gbytes/s external memory bandwidth • 96 Gbytes/s internal memory bandwidth • 2 x 4 Gbytes/s chip-to-chip bandwidth Courtesy of CSX600 Overview on http://www.clearspeed.com/ \course\cpeg323-05F\Topic-final-323.ppt

A Case Study -- The IBM Cyclops-64 Architecture “System” 1.1Pflops/, 13.5TB Memory “Rack” 15.4Tflop/s 192GB Memory Communication Ports for 3D Mesh Inter-Chip Network External Memory Input Output Chip Thread Unit SRAM Intra-Chip Network FPU SRAM Thread Unit “Board” 320Gflop/s 4GB Memory Processor “Processor” 1Gflop/s 64 KB SRAM I-Cache “Chip” 80Gflop/s 1GB Memory Bisection BW: 4TB/s Architect: Monty Denneau \course\cpeg323-05F\Topic-final-323.ppt

Data Points of a 1 Petaflop C64 Machine • Cyclops Chip: 533 MHz, 5.1 MB SRAM, 1-2GB DRAM • Disk space: 300GB/node • Total system power: 2 MW (chill-water cooling) • Size: 20’ x 48’ • Mean time to failure: 2 weeks • Cost: 20 million ? \course\cpeg323-05F\Topic-final-323.ppt

A Cyclops-64 Rack \course\cpeg323-05F\Topic-final-323.ppt

C-64 Chip Architecture On-chip bisection BW = 0.38 TB/s, total BW to 6 neighbours = 48GB/sec \course\cpeg323-05F\Topic-final-323.ppt

Mrs.Clops \course\cpeg323-05F\Topic-final-323.ppt

Summary \course\cpeg323-05F\Topic-final-323.ppt

Topics 8: Advance in Parallel Computer Architectures