Can Commodity Linux Clusters Scale to Petaflops?

Can Commodity Linux ClustersScale to Petaflops? P. Beckman

Refining The Question • A petaflop is 1E+15 floating point operations per second as reported by the Top500 (not peak theoretic) • What is commodity? • Beowulf Classic: Computer Shopper catalog. • Not great, but good intent. • How many suppliers thrive? • How bad would it be if you could no longer buy a part or get the support you want easily? • What can I fix or modify myself? • HW: usually x86 servers SW: Linux Pete Beckman Lyon: CCGSC

Refining The Question (part 2) • What is a Cluster? • A “box” can be sold separately, and usually is • A complete OS, it can run standalone • Capacity is expanded by adding another box • Is the SP2 a cluster? Is the ES a cluster? • A Petaflop-scale Commodity Cluster: • A large collection of interconnected boxes running Linux and achieving a Petaflop on the Top500 Pete Beckman Lyon: CCGSC

Is It Possible? • Maybe not “if”, but “when” • Could we do it now? If not, how soon? Pete Beckman Lyon: CCGSC

Home-grown Clusters • June 1997, Berkeley Sun (Solaris) cluster is the first cluster to make the Top500. • June 1998, First Linux Cluster debuts on the Top500 • By June of 2002, 10% of all machines on the Top500 are Linux Clusters! • However, Linux represents only 7.5% of aggregate performance on the Top500. • Two possible conclusions…. Pete Beckman Lyon: CCGSC

The Expansion of Linux in the Top500 Berkeley NOW (Solaris) Pete Beckman Lyon: CCGSC

? How Efficient Are Linux Clusters? Cplant Aramco, 2048, eth Pete Beckman Lyon: CCGSC

Ouch! (scaling could be a problem) Earth Simulator Pete Beckman Lyon: CCGSC

Delivered Performance/CPU Pete Beckman Lyon: CCGSC

How Far Behind is Commodity? ES (7gf/cpu) Pete Beckman Lyon: CCGSC

Are Linux Clusters Keeping Up?Comparing Apples to Oranges (really, money spent) Pete Beckman Lyon: CCGSC

Compared To What? Excelling at mediocrity Pete Beckman Lyon: CCGSC

Observations? • Clusters are the most popular HPC platform • Linux is expanding fastest for HPC, but mostly in the mid and lower tiers of the Top500 • Why? • CPU efficiency is mediocre • PIII cluster with 1K nodes only 67% efficient • CPU Cost Effectiveness? • NCSA Itanium: 2.11 GF/CPU • P4 Xeon: 2.05 GF/CPU • AMD: 2.0 GF/CPU (3 years slower than ES?) Pete Beckman Lyon: CCGSC

Petaflops Now? • Assume: • 2 GF/CPU Linpack, no loss for scaling (obviously wrong) • A Petaflop Linux cluster would require about 500,000 processors • Al Geist: “The next generation of Peta-scale computers are being designed with 50,000 to 100,000 processors” • Ignoring power, wiring plans, and interconnection networks, how big is it ? • (disagreeing with Thomas) Pete Beckman Lyon: CCGSC

VIA C3 800MHz CPU, 100/133MHz FSB 1Gig Ram, PCI card Commodity is shrinking • Special “blades” not required • New form factors can achieve approx 528 nodes per rack • Sans management… arg!!! • Each rack needs ~12 ft2 floor space (space to move the rack) • 500K CPUs requires 11.3K ft2. • Not the Nimitz, but simply 1 former dot com office space in the Bay • Cost? At $3K/node, = $1.5B • (one big black plane) 17cm (6.7in) Pete Beckman Lyon: CCGSC

But Would It Work? NO • Buying the hardware is not the bottleneck, the system software won’t scale. • Software is so hard and costly, spending lots and lots more money won’t show immediately effects Pete Beckman Lyon: CCGSC

Silly examples: (part 1/2) • mpicc myprog.com; mpirun a.out • For a 10MB executable with 100BT, it would take 8.3 days to start your job • Or: 6.8 million dollars of machine lifetime (5 years) • With 2 Gbit Myrinet, it would take about 7 hrs • Or: $240K dollars of machine lifetime (5 years) Pete Beckman Lyon: CCGSC

Silly examples: (part 2/2) • mpicc hello_world.com; mpirun a.out • Probably 1 million socket connections would be required. Linux scales to a couple thousand • A recv() where the “master” node collects 1000 floating point numbers from each node would required 3.8 Gig of RAM Pete Beckman Lyon: CCGSC

Cluster Sizing Rule of Thumb • System software (Linux, MPI, Filesystems, etc) scale from 64 nodes to at most 2048 nodes for most HPC applications • Max socket connections • Direct access message tag lists & buffers • NFS / storage system clients • Debugging • Etc • It is probably hard to rewrite MPI and all Linux system software for O(100,000) node clusters Pete Beckman Lyon: CCGSC

Eliminating Flat Could Help, But It Must Be Nearly Transparent • A lesson from the IP address space crisis about 5 years ago: • Nearly every workstation at every institution had a real, global, IP address • IP space became difficult to find, and there were fears of running out. • Router tables were growing to big • NAT (Network Address Translation (IP Masquerading)) came to the rescue • Large institutions now have a few global IP addresses, instead of class B networks Pete Beckman Lyon: CCGSC

Maybe A Similar Technique Could Apply To Large Clusters & The Grid • Currently, Firewalls and NATing make Grid computing nearly impossible • To scale the Grids and clusters to 100K nodes we may want something like Grid/Cluster NAT Hypothesis: Nearly transparent Grid/MPI NAT translation may let system software scale to 100K nodes… Flat is bad Pete Beckman Lyon: CCGSC

Waiting For Commodity Linux Cluster Petaflops • If system software is most likely to work at about 1000 nodes, to reach Petaflops, each node will be a Teraflop (Linpack) • To go from 2GF nodes now to 1TF will probably take 12 to 15 years • Large SMPs could shave some time • Conclusion: • Petaflop Linux Cluster: 10-15 years “naturally” • Solve scalable system software issues, and we can reduce time by buying more nodes (32K nodes in 6-7 years) Pete Beckman Lyon: CCGSC

Can Commodity Linux Clusters Scale to Petaflops?

Can Commodity Linux Clusters Scale to Petaflops?

Presentation Transcript

Linux-HA Release 2 Tutorial

Linux Kernel Internals

Unit 2: Installing a Linux System

An Introduction to Linux

Linux Programming

Linux

LINUX

LINUX

Using OpenVMS Clusters for Disaster Tolerance