SC’99: The 14th Mannheim Supercomputing Conference June 10, 1999 “looking 10 years ahead”

SC’99: The 14th Mannheim Supercomputing ConferenceJune 10, 1999“looking 10 years ahead” Gordon Bell http://www.research.microsoft.com/users/gbell Microsoft

What a difference 25 years and spending >10x makes! ESRDC c2002 40 Tflops. 5120 Proc. 640 Computers LLNL center 150 Mflops 7600 & Cray1 c1978

Talk plan • We are at a new beginning… many views:installations, parallelism, machine intros(t), timeline, cost to get results, and scalabilities • SCI c1985, the beginning: 1K processors (MPP)ASCI c1998, new beginning: 10K processors • Why I traded places with Greg Papadapolous re. Clusters and SmPs • Questions that users & architects will resolve • New structures: Beowulf and NT eqivalent, Condor, Cow, Legion, Globus, Grid …

Comments from LLNL Program manager • Lessons Learned with “Full-System Mode” • It is harder than you think • It takes longer than you think • It requires more people than you can believe • Just as in the very beginning of computing, leading edge users are building their own computers.

Are we at a new beginning? “Now, this is not the end. It is not even the beginning of the end, but it is, perhaps, the end of the beginning.” 1999 Salishan HPC Conference from W. Churchill 11/10/1942 “You should not focus NSF CS Research on parallelism. I can barely write a correct sequential program.” Don Knuth 1987 (to Gbell) “Parallel processing is impossible for people to create well, much less debug?’ Ken Thompson 1987 “I’ll give a $100 to anyone who can run a program on more than 100 processors.” Alan Karp (198x?) “I’ll give a $2,500 prize for parallelism every year.”Gordon Bell (1987)

Yes… we are at a new beginning! Based on clustered computing Single jobs, composed of 1000s of quasi-independent programs running in parallel on 1000s of processors. Processors (or computers) of all typesare distributed and inter-connected) in every fashionfrom a collection using a single shared memoryto globally disperse computers.

Intel/Sandia: 9000 Pentium Pro LLNL/IBM SP2: 3x(488x8) PowerPC LNL/Cray: 6144 P in 48x128 DSM clusters U. S. Tax Dollars At Work. How many processors does your center have?

High performance architectures timeline 1950 . 1960 . 1970 . 1980 . 1990 . 2000 Vtubes Trans. MSI(mini) Micro RISC nMicr “IBM PC” Sequential programming---->------------------------------ (single execution stream e.g. Fortran) Processor overlap, lookahead “killer micros” Cray era 6600 7600 Cray1 X Y C T Func PipeVector-----SMP----------------> SMP mainframes---> “multis”-----------> DSM?? Mmax. KSR DASHSGI---> <SIMD Vector--//--------------- Parallelization--- -----------------THE NEW BEGINNING----------------------- Parallel programs aka Cluster Computing <--------------- multicomputers <--MPP era------ Clusters Tandm VAX IBM UNIX-> MPP if n>1000 Ncube Intel IBM-> Local NOW Beowlf and Global Networks n>10,000 Grid

Computer types -------- Connectivity-------- WAN/LAN SAN DSM SM Netwrked Supers… GRID VPPuni NEC mP NEC super Cray X…T (all mPv) Clusters micros vector Legion Condor Beowulf NT clusters T3E SP2(mP) NOW SGI DSM clusters & SGI DSM Mainframes Multis WSs PCs

Technical computer types WAN/LAN SAN DSM SM Old World ( one program stream) New world: Clustered Computing (multiple program streams) Netwrked Supers… GRID VPPuni NEC mP T series NEC super Cray X…T (all mPv) micros vector Legion Condor Beowulf SP2(mP) NOW SGI DSM clusters & SGI DSM Mainframes Multis WSs PCs

Technical computer types WAN/LAN SAN DSM SM Vectorize Parallellelize MPI, Linda, PVM, ??? Distributed Computing Netwrked Supers… GRID VPPuni NEC mP T series NEC super Cray X…T (all mPv) micros vector Parallellelize Legion Condor Beowulf SP2(mP) NOW SGI DSM clusters & SGI DSM Mainframes Multis WSs PCs

Technical computer types:Pick of: 4 nodes, 2-3 interconnects SAN DSM SMP Fujitsu Hitachi NEC NEC super Cray ??? Fujitsu Hitachi micros vector IBM ?PC? SGI cluster Beow/NT SGI DSM T3 HP? HP IBM Intel SUN plain old PCs

Bell Prize and Future Peak Tflops (t) Petaflops study target NEC CM2 XMP NCube

SCI c1983(Strategic Computing Initiative)funded by DARPA in the early 80s and aimed at a Teraflops!Era of State computers and many efforts to build high speed computers… lead to HPCCThinking Machines, Intel supers,Cray T3 series

Humble beginning: “Killer” Micro? In 1981…did you predict this would be the basis of supers?

SCI (c1980s): Strategic Computing Initiative funded ATT/Columbia (Non Von), BBN Labs, Bell Labs/Columbia (DADO), CMU Warp (GE & Honeywell), CMU (Production Systems), Cedar (U. of IL), Encore, ESL, GE (like connection machine), Georgia Tech, Hughes (dataflow), IBM (RP3), MIT/Harris, MIT/Motorola (Dataflow), MIT Lincoln Labs, Princeton (MMMP), Schlumberger (FAIM-1), SDC/Burroughs, SRI (Eazyflow), University of Texas, Thinking Machines (Connection Machine)

Those who gave their lives in the search for parallelism Alliant, American Supercomputer, Ametek, AMT, Astronautics, BBN Supercomputer, Biin, CDC, Chen Systems, CHOPP, Cogent, Convex (now HP), Culler, Cray Computers, Cydrome, Dennelcor, Elexsi, ETA, E & S Supercomputers, Flexible, Floating Point Systems, Gould/SEL, IPM, Key, KSR, MasPar, Multiflow, Myrias, Ncube, Pixar, Prisma, SAXPY, SCS, SDSA, Supertek (now Cray), Suprenum, Stardent (Ardent+Stellar), Supercomputer Systems Inc., Synapse, Thinking Machines, Vitec, Vitesse, Wavetracer.

What can we learn from this? • SCI: ARPA-funded product development failed. No successes. Intel prospered. • ASCI: DOE-funded product purchases creates competition • First efforts in startups… all failed. • Too much competition (with each other) • Too little time to establish themselves • Too little market. No apps to support them • Too little cash • Supercomputing is for the large & rich • … or is it? Beowulf, shrink-wrap clusters; NOW,Condor, Legion, Grid, etc.

2010 ground rules:The component specs

2010 component characteristics100x improvement @60% growth Chip Density 500. Mt Bytes/chip 8. GB On chip clock 2.5 GHz Inter-system clock 0.5 Disk 1. TB Fiber speed (1 ch) 10. Gbps

Computer ops/sec x word length / $

µProc 60%/yr.. 1000 CPU 100 Processor-Memory Performance Gap:(grows 50% / year) Performance 10 DRAM 7%/yr.. DRAM 1 1992 2000 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1993 1994 1995 1996 1997 1998 1999 Processor Limit: DRAM Gap “Moore’s Law” • Alpha 21264 full cache miss / instructions executed: 180 ns/1.7 ns =108 clks x 4 or 432 instructions • Caches in Pentium Pro: 64% area, 88% transistors • *Taken from Patterson-Keeton Talk to SigMod

Gordon B.& Greg P.: Trading places. Or why I switched from SMPs to clusters“Miles Law: where you stand depends on where you sit.” 1993 GB: SMP and DSM inevitability after 30 years of belief in/building mPs GP: multicomputers ala CM5 2000+ GB: commodity clusters, improved log(p) GP: SMPs => DSM

GB with NT, Compaq, HP cluster

AOL Server Farm

WHY BEOWULFS ? • best price performance • rapid response to technology trends • no single-point vendor • just-in-place configuration • scalable • leverages large software development investment • mature, robust, accessible • user empowerment • meets low expectations created by MPPsfrom Thomas Sterling

IT'S THE COST, STUPID • $28 per sustained MFLOPS • $11 per peak MFLOPSfrom Thomas Sterling

Why did I trade places i.e. switch to clustered computing? • Economics: commodity components give a 10-100x advantage in price performance • Backplane connected processors (incl. DSMs) vs board-connected processors • Difficulty of making large SMPs (and DSM) • Single system image… clearly needs more work • SMPs (and DSMs) are NOT scalable in: • size. All have very lumpy memory access patterns • reliability. Redundancy and ft is required. • cross-generation. Every 3-5 years start over. • Spatial. Put your computers in multiple locations. • Clusters are the only structure that scales!

Technical users have alternatives (making the market size too small) • PCs work fine for smaller problems • “Do it yourself clusters” e.g. Beowulf works! • MPI, PVM, Linda: programming models don’t exploit shared memory… are they lcd? • ISVs have to use lcd to survive • SMPs are expensive. Parallelization is limited. • Clusters required for scalabilities or apps requiring extra-ordinary performance ...so DSM only adds to the already complex parallelization problem • Non-U.S. users buy SMPvectors for capacity for legacy apps, until cluster-ready apps

C1999 Clusters of computers. It’s MPP when processors/cluster >1000 Who ΣP.pap ΣP. P.pap ΣP.pap/CΣp/.C ΣMp./C ΣM.s T.fps #.K G.fps G.fps # GB TB LLNL 3.9 5.9 .66 5.3 8 2.5 62(IBM) LANL 3.1 6.1 .5 64 128. 32 76 (SGI) Sandia 2.7 9.1 .3 .6 2 -(Intel) Beowulf 0.5 2.0 4 Fujitsu 1.2 .13 9.6 9.6 1 4.-16 NEC 4.0 .5 8 128 16 128ESRDC 40 5.12 8 64 8 16

Commercial users don’t need them • Highest growth is & will be web servers delivering pages, audio, and video • Apps are inherently, embarrassingly parallel • Databases and TP parallelized and transparent • A single SMP handles traditional apps • Clusters required for reliability, scalabilities

Questions for builders & users Can we count on Moore’s Law continuation? Vector vs scalar using commodity chips? Clustered computing vs traditional SMPv?Can MPP apps be written for scalable //lism? Cost: How much time and money for apps? Benefit/need: In time & cost of execution? When will DSM occur or be pervasive? Commodity, proprietary, or net interconnections? VendorIX (or Linux) vs NT? Shrink-wrap supers? When will computer science research & teach //ism? Did Web divert follow-through efforts and funding? What’s the prognosis for gov’t leadership, funding?

The Physical Processor • commodity aka Intel micros • Does VLIW work better as a micro than it did as a mini at Cydrome & Multiflow? • vector processor… abandoned or reborn? • multiple processors per chip or • multi-threading • FPGA chip-based special processors or other higher volume processors

What Is The Processor Architecture?Clearly polarized as US vs Japan VECTORS VECTORS OR Comp. Sci. View MISC >> CISC Language directed RISC Super-scalar MTA Extra-Long Instruction Word Super Computer View RISC VCISC (vectors) multiple pipes

Weather model performance

40 Tflops Earth Simulator R&D Center c2002

Mercury & Sky Computers - & $ Rugged System With 10 Modules ~ $100K; $1K /# Scalable to several K processors; ~1-10 Gflop / Ft3 10 9U Boards * 4 Ppc750’s » 440 Specfp95 in 1 Ft3(18.5 * 8 * 10.75”) … 256 Gflops/$3M Sky 384 Signal Processor, #20 on ‘Top 500’, $3M Mercury VME Platinum System Sky PPC Daughtercard

Russian Elbrus E2K Who E2K Merced Clock GHz 1.2 0.8 Spec i/fp 135./350 45./70 Size mm2(.18u)126. 300. Power 35. 60. PAP Gflps 10.2 Pin B/W GB/8 1.9 Cache (KB) 64./256 System ship Q4./2001

Computer (P-Mp) system Alternatives • Node size: most cost-effective SMPs • Now 1-2 on a single board, evolving to 4-8 • Evolves based on n processor per chip • Continued use of single bus SMP “multi” with enhancements for perf. & reliability • Large, backplane bus based SMP provide a single system image for small systems, but not cost or space efficient for use as cluster component • SMPs evolving to weak coherency DSMs

Petaflops by 2010 “ ” DOEAccelerated Strategic Computing Initiative (ASCI)

1994 Petaflops Workshop c2007-2014. Clusters of clusters. Something for everyone SMP Clusters Active Memory 400 P 4-40K P 400K P 1 Tflops* 10-100 Gflops 1 Gflops 400 TB SRAM 400 TB DRAM 0.8 TB embed 250 Kchips 60K-100K chips 4K chips 1 ps/result 10-100 ps/result *100 x 10 Gflops threads 100,000 1 Tbyte discs => 100 Petabytes.10 failures / day

Petaflops DisksJust compute it at the source • 100,000 1 Tbyte discs => 100 Petabytes • 8 Gbytes of memory per chip • 10 Gflops of processing per chip • NT, Linux, or whatever O/S • 10 Gbps network interface • Result: 1.0 petaflops at the disks

HT-MT

Mechanical: cooling and signals Chips: design tools, fabrication Chips: memory, PIM Architecture: mta on steroids Storage material HT-MT…

Global clusters… a goal, challenge, possibility? “ Our vision ... is a system of millions of hosts… in a loose confederation. Users will have the illusion of a very powerful desktop computer through which they can manipulate objects. Grimshaw, Wulf, et al “Legion” CACM Jan. 1997 ” “ ”

Utilize in situ workstations! • NoW (Berkeley) set sort record, decrypting • Grid, Globus, Condor and other projects • Need “standard” interface and programming model for clusters using “commodity” platforms & fast switches • Giga- and tera-bit links and switches allow geo-distributed systems • Each PC in a computational environment should have an additional 1GB/9GB!

In 2010 every organization will have its own petaflops supercomputer! • 10,000 nodes in 1999 or 10x over 1987 • Assume 100K nodes in 2010 • 10 Gflops/10GBy/1,000 GB nodes for low end c2010 PCs • Communication is first problem… use the network that will be >10 Gbps • Programming is still the major barrier • Will any problems or apps fit it? • Will any apps exploit it?

The Grid:Blueprint for a New Computing InfrastructureIan Foster, Carl Kesselman (Eds),Morgan Kaufmann, 1999 • Published July 1998; ISBN 1-55860-475-8 • 22 chapters by expert authors including: • Andrew Chien, • Jack Dongarra, • Tom DeFanti, • Andrew Grimshaw, • Roch Guerin, • Ken Kennedy, • Paul Messina, • Cliff Neuman, • Jon Postel, • Larry Smarr, • Rick Stevens, • Charlie Catlett • John Toole • and many others “A source book for the history of the future” -- Vint Cerf http://www.mkp.com/grids

SC’99: The 14th Mannheim Supercomputing Conference June 10, 1999 “looking 10 years ahead”

SC’99: The 14th Mannheim Supercomputing Conference June 10, 1999 “looking 10 years ahead”

Presentation Transcript

Secrets of Supercomputing The Conservation Laws Supercomputing Challenge Kickoff October 21-23, 2007

Supercomputing in Plain English Overview: What is Supercomputing?

BSAS Annual Conference, Belfast, 12-14th April, 2010

Starts June 14th at Greenlake Park!!

Reinhard Hujer University Frankfurt/M. 3rd Conference on Evaluation Research, Mannheim

Supercomputing in Plain English High Throughput Computing

Eight years after Olmstead

Supercomputing in Plain English Stupid Compiler Tricks

Income Inequality and Investment: What Lies Ahead?

Introduction

Dr. Roland Deiser Dean, DC Corporate University Vienna, June 20-24, 1999

The Mississippi Center for Supercomputing Research

Best Start Conference January 2006

Dominik Stokłosa Pozna ń Supercomputing and Networking Center, Supercomputing Department

THE DREAMING PROJECT FINAL CONFERENCE Trieste, 14th June 2012

visum.uni-mannheim.de visum@rumms.uni-mannheim.de

Getting Connected with AHEAD

Supercomputing in Plain English Overview: What the Heck is Supercomputing?

Dynamic Data-driven Application Systems Panel ACM International Conference on Supercomputing