830 likes | 977 Views
Trends in Supercomputing for the Next Five Years. Katherine Yelick http://www.cs.berkeley.edu/~yelick/cs267 Based in part on lectures by Horst Simon and David Bailey. Five Computing Trends. Continued rapid processor performance growth following Moore’s law
E N D
Trends in Supercomputing for the Next Five Years Katherine Yelick http://www.cs.berkeley.edu/~yelick/cs267 Based in part on lectures by Horst Simon and David Bailey
Five Computing Trends • Continued rapid processor performance growth following Moore’s law • Open software model (Linux) will become standard • Network bandwidth will grow at an even faster rate than Moore’s Law • Aggregation, centralization, colocation • Commodity products everywhere
Overview • High end machine in general • Processors • Interconnect • Systems software • Programming models • Look at the 3 Japanese HPCs • Examine the Top131
Aggregate Systems Performance Increasing Parallelism Single CPU Performance CPU Frequencies History of High Performance Computers 1P 1000000.0000 100T 100000.0000 Earth Simulator ASCI Q ASCI White 10T 10000.0000 SX-6 VPP5000 SX-5 SR8000G1 VPP800 ASCI Blue 1T ASCI Blue Mountain ASCI Red VPP700 1000.0000 SX-4 T3E SR8000 Paragon VPP500 SR2201/2K NWT/166 T3D 100G 100.0000 CM-5 FLOPS SX-3R T90 S-3800 SX-3 SR2201 10G C90 10.0000 VP2600/1 CRAY-2 Y-MP8 0 SX-2 S-820/80 10GHz 1G 1.0000 VP-400 S-810/20 X-MP VP-200 1GHz 100M 0.1000 100MHz 10M 0.0100 10MHz 1M 0.0010 1980 1985 1990 1995 2000 2005 2010 Year
Analysis of TOP500 Data • Annual performance growth about a factor of 1.82 • Two factors contribute almost equally to the annual total performance growth • Processor number grows per year on the average by a factor of 1.30 and the • Processor performance grows by 1.40 compared to 1.58 for Moore's Law • Efficiency relative to hardware peak is declining Strohmaier, Dongarra, Meuer, and Simon, Parallel Computing 25, 1999, pp 1517-1544.
Performance Extrapolation A Laptop
Analysis of TOP500 Extrapolation Based on the extrapolation from these fits we predict: • First 100~TFlop/s system by 2005 • About 1–2 years later than the ASCI path forward plans. • No system smaller than 1 TFlop/s should be able to make the TOP500 • First Petaflop system available around 2009 • Rapid changes in the technologies used in HPC systems, therefore a projection for the architecture/technology is difficult • Continue to expect rapid cycles of re-definition
What About Efficiency? • Talking about Linpack • What should be the efficiency of a machine on the Top131 be? • Percent of peak for Linpack > 90% ? > 80% ? > 70% ? > 60% ? … • Remember this is O(n3) ops and O(n2) data • Mostly matrix multiply
Efficiency is Declining Over time • Analysis of top 100 machines in 1994 and 2004 • Shows the # of machines in the top 100 that achieve a given efficiency on the Linpack benchmark • In 1994 40 machines had >90% efficiency • In 2004 50 have < 50% efficiency
ES ASCI Q VT-Apple NCSA PNNL LANL Lighting LLNL MCR ASCI White NERSC LLNL (6.6) _ Rank Performance
Architecture/Systems Continuum Loosely Coupled • Commodity processor with commodity interconnect • Clusters • Pentium, Itanium, Opteron, Alpha, PowerPC • GigE, Infiniband, Myrinet, Quadrics, SCI • NEC TX7 • HP Alpha • Bull NovaScale 5160 • Commodity processor with custom interconnect • SGI Altix • Intel Itanium 2 • Cray Red Storm • AMD Opteron • IBM Regatta • Custom processor with custom interconnect • Cray X1 • NEC SX-7 • IBM Blue Gene/L (commodity core) • IBM Power PC • Note: commodity here means not designed solely for HPC Tightly Coupled
Cray X1 SGI Altix IBM Regatta Sun HP Bull Fujitsu PowerPower Hitachi SR11000 NEC SX-7 Apple Coming soon … Cray RedStorm Cray BlackWidow NEC SX-8 IBM Blue Gene/L Vibrant Field for High Performance Computers
AMD Opteron 2 GHz, 4 Gflop/s peak HP Alpha EV68 1.25 GHz, 2.5 Gflop/s peak IBM PowerPC 2 GHz, 8 Gflop/s peak Intel Itanium 2 1.5 GHz, 6 Gflop/s peak Intel Pentium Xeon, Pentium EM64T 3.2 GHz, 6.4 Gflop/s peak MIPS R16000 700 GHz, 1.4 Gflop/s peak Sun UltraSPARC IV 1.2 GHz, 2.4 Gflop/s peak Off-the-Shelf Processors
Itanium 2 Processor • Floating point bypass for level 1 cache • Bus is 128 bits wide and operates at 400 MHz, for 6.4 GB/s • 4 flops/cycle • 1.5 GHz Itanium 2 • Linpack Numbers: (theoretical peak 6 Gflop/s) • 100: 1.7Gflop/s • 1000: 5.4 Gflop/s
Processor of choice for clusters 1 flop/cycle, 2 with SSE2 Intel Xeon 3.2 GHz 400/533 MHz bus, 64 bit wide(3.2/4.2 GB/s) Linpack Numbers: (peak 6.4 Gflop/s) 100x100: 1.7 Gflop/s 1000x1000: 3.1 Gflop/s Pentium 4 IA32 • Coming Soon: “Pentium 4 EM64T” • 64 bit • 800 MHz bus 64 bit wide • 3.6 GHz, 2MB L2 Cache • Peak 7.2 Gflop/s using SSE2
High Bandwidth vs Commodity Systems • High bandwidth systems have traditionally been vector computers • Designed for scientific problems • Capability computing • Commodity processors are designed for web servers and the home PC market (should be thankful that the manufactures keep the 64 bit fl pt) • Used for cluster based computers leveraging price point • Scientific computing needs are different • Require a better balance between data movement and floating point operations. Results in greater efficiency.
Commodity Interconnects • Gig Ethernet • Myrinet • Infiniband • QsNet • SCI Clos Fat tree Torus
Switch topology $/node $/node $/node Lt(us)/BW (MB/s) MB/s/$ NIC switch total (MB/s) MPI GigE Bus $ 50 $ 50 $ 100 30 / 100 1.0 SCI Torus $1,600 $ 0 $1,600 5 / 300 0.2 QsNetII Fat Tree $1,200 $1,700 $2,900 3 / 880 0.3 Myrinet Clos $ 700 $ 400 $1,100 6.5/ 240 0.2 IB 4x Fat Tree $1,000 $ 400 $1,400 6 / 820 0.6 Commodity Interconnects • Price performance drives the commodity market • Bandwidth more than latency
Interconnects Used Largest node count min max average GigE 1024 17% 63% 37% SCI 120 64% 64% 64% QsNetII 2000 68% 78% 74% Myrinet 1250 36% 79% 59% Infiniband 4x 1100 58% 69% 64% Proprietary 9632 45% 98% 68% Efficiency for Linpack
ES ASCI Q VT-Apple NCSA PNNL LANL Lighting LLNL MCR ASCI White NERSC LLNL 44 3 19 12 52 1
Cray X1: Parallel Vector Architecture • 12.8 Gflop/s Vector processors • 4 processor nodes sharing up to 64 GB of memory • Single System Image to 4096 Processors • 64 CPUs/800 GFLOPS in LC cabinet
Visible to SW Visible to SW Transparent to SW Transparent to SW HW Resources Visible to Software Vector IRAM Pentium III • Software (applications/compiler/OS) can control • Main memory, registers, execution datapaths
Special Purpose: GRAPE-6 • The 6th generation of GRAPE (Gravity Pipe) Project • Gravity (N-Body) calculation for many particles with 31 Gflops/chip • 32 chips / board - 0.99 Tflops/board • 64 boards of full system is installed in University of Tokyo- 63 Tflops • On each board, all particles data are set onto SRAM memory, and each target particle data is injected into the pipeline, then acceleration data is calculated • No software! • Gordon Bell Prize at SC for a number of years (Prof. Makino, U. Tokyo)
Characteristics of Blue Gene/L • Machine Peak Speed 180 Teraflop/s • Total Memory 16 Terabytes • Foot Print 2500 sq. ft. • Total Power 1.2 MW • Number of Nodes 65,536 • Power Dissipation/CPU 7 W • MPI Latency 5 microsec
Building Blue Gene/L Image from LLNL
Sony PlayStation2 • Emotion Engine: • 6 Gflop/s peak • Superscalar MIPS 300 MHz core + vector coprocessor + graphics/DRAM • About $200 • 529M sold • 8K D cache; 32 MB memory not expandable OS goes here as well • 32 bit fl pt; not IEEE • 2.4GB/s to memory (.38 B/Flop) • Potential 20 fl pt ops/cycle • FPU w/FMAC+FDIV • VPU1 w/4FMAC+FDIV • VPU2 w/4FMAC+FDIV • EFU w/FMAC+FDIV • See PS2 cluster project at UIUC • What about PS3?
High-Performance Chips Embedded Applications • The driving market is gaming (PC and game consoles) • which is the main motivation for almost all the technology developments. • Demonstrate that arithmetic is quite cheap. • Not clear that they do much for scientific computing. • Today there are three big problems with these apparent non-standard "off-the-shelf" chips. • Most of these chips have very limited memory bandwidth and little if any support for inter-node communication. • Integer or only 32 bit fl.pt • No software support to map scientific applications to these processors. • Poor memory capacity for program storage • Developing "custom" software is much more expensive than developing custom hardware.
Choosing the Right Option • Good hardware options are available • There is a large national investment in scientific software that is dedicated to current massively parallel hardware architectures • Scientific Discovery Through Advanced Computing (SciDAC) initiative in DOE • Accelerated Strategic Computing Iniative (ASCI) in DOE • Supercomputing Centers of the National Science Foundation (NCSA, NPACI, Pittsburgh) • Cluster computing in universities and labs There is a software cost for each hardware option but, The problem can be solved
Phase I Phase II Phase III Cray HP IBM ?? IBM Sun ?? SGI Cray Sun ‘03 ‘06 ‘10 1 yr $3M/yr 3 yr $18M/yr 4 yr $50M/yr HPCS Program • Phase I: Concept Study • critical technology assessments • revolutionary HPCS concept solutions • new productivity metrics • requirements, scalable benchmark strategies and metrics • Phase II: Research & Development • Develop and evaluate groundbreaking technologies that can contribute to DARPA's productivity objectives • design reviews, risk reduction prototypes and demonstrations that contribute to a preliminary design • challenges & promising solutions identified during the concept study will be explored, developed, and simulated/ prototyped • Phase III: Full-Scale Development & Manufacturing • Pilot systems, Serial 001 in 2010
Options for New Architectures Option Software Impact Cost Timeliness Risk Factors
Is MPI the Right Programming Model? • Programming model has not changes in 10 years • What’s wrong with MPI? • Not bad for regular applications • Bulk-synchronous code with balanced load • What about fine-grained programs? • Pack/unpack • What about fine-grained asynchronous programs? • Pack/unpack, prepost, check-for-done • No explicit notion of distributed data structures
Global Address Space Languages • Static parallelism (like MPI) in all 3 languages • Globally shared address space is partitioned • References (pointers) are either local or global (meaning possibly remote) • Distributed arrays and pointer-based structures x: 1 y: 2 x: 5 y: 6 x: 7 y: 8 Object heaps are shared Global address space l: l: l: g: g: g: Program stacks are private p0 p1 pn
What’s Wrong with PGAS Languages? • Flat parallelism model • Machines are not flat: vectors,streams,SIMD, VLIW, FPGAs, PIMs, SMPs,nodes,… • No support for dynamic load balancing • Virtualize details of memory structure • No virtualization of processor space • No fault tolerance • SPMD model is not a good fit • Little understanding of scientific problems • CAF and Titanium have multiD arrays, numeric debugging • The base languages are not that great • Nevertheless, they are they the right next step
If we spend all our time on these problems, we’ll always be in a niche To Virtualize or Not to Virtualize • Why to virtualize • Portability • Fault tolerance • Machine variability • Application level load imbalance • Why to not virtualize • Deep memory hierarchies • Expensive system overhead • Performance for problems that match the hardware
NERSC’s Strategy Until 2010: Oakland Scientific Facility New Machine Room — 20,000 ft2, Option open to expand to 40,000 ft2. Includes ~50 offices and 6 megawatt electrical supply. It’s a deal: $1.40/ft2 when Oakland rents are >$2.50/ ft2 and rising!
Power and cooling are major costs of ownership of modern supercomputers Expandable to 6 Megawatts
Metropolis Center at LANL – home of the 30 Tflop/s Q machine Los Alamos
Strategic Computing Complex at LANL • 303,000 gross sq. ft. • 43,500 sq. ft. unobstructed computer room • Q consumes approximately half of this space • 1 Powerwall Theater (6X4 stereo = 24 screens) • 4 Collaboration rooms (3X2 stereo = 6 screens) • 2 secure, 2 open (1 of each initially) • 2 Immersive Rooms • Design Simulation Laboratories (200 classified, 100 unclassified) • 200 seat auditorium Los Alamos
For the Next Decade, The Most Powerful Supercomputers Will Increase in Size Power and cooling are also increasingly problematic, but there are limiting forces in those areas. • Increased power density and RF leakage power, will limit clock frequency and amount of logic [Shekhar Borkar, Intel] • So linear extrapolation of operating temperatures to Rocket Nozzle values by 2010 is likely to be wrong. Became This And will get bigger
“I used to think computer architecture was about how to organize gates and chips – not about building computer rooms” Thomas Sterling, Salishan, 2001
Processor Trends (summary) • The Earth Simulator is a singular event • It may become a turning point for supercomputing technology in the US • Return to vectors is unlikely, but more vigorous investment in alternate technology is likely • Independent of architecture choice we will stay on Moore’s Law curve
Five Computing Trends for the Next Five Years • Continued rapid processor performance growth following Moore’s law • Open software model (Linux) will become standard • Network bandwidth will grow at an even faster rate than Moore’s Law • Aggregation, centralization, colocation • Commodity products everywhere