NERSC and Blue Planet

NERSC and Blue Planet William T.C. Kramer NERSC/LBNL May 28, 2003 NERSC User Group Meeting, Chicago IL

What is Blue Planet • “Blue Planet” is a “science driven” design process to develop systems that are simultaneously more effective for science and sustainable and cost effective for vendors. • White Paper uses IBM as an example of what can be done with this process • Can be applied to a number of vendors • Blue Planet is a new concept for a sustainable computer architecture more effective for science and engineering applications • A specific implementation leveraging the IBM roadmap that better balances scientific processing needs and the commercial viability • Described as a “ultrascale” scale system on the order of the Earth Simulator http://www.nersc.gov/news/blueplanet.html and http://www.nersc.gov/news/ArchDevProposal.5.01.pdf NERSC User Group Meeting

Signposts of Change in HPC • In early 2002 there were several signposts that signal a fundamental change in HPC in the US. For NERSC they were: • Poor benchmarks for the NERSC workload for our latest procurement (March 2002) • Impressive early performance results of the Earth Simulator System (April 2002) • Increasing indications that commodity clusters will not be as easy, scalable or cost-effective as first projected. • Lack of progress in computer architecture research evident at Petaflops Workshop (WIMPS, February 2002) • Detailed evaluation of current and future processors and systems • System designers did not truly understand current and future scientific applications • Design target codes are not current and future methods NERSC User Group Meeting

The Conclusion • The community has pursued the logical extreme of the COTS systems • The commodity building block was the microprocessor, but is now the entire server (SMP). • Communications and memory bandwidth are not scaling with peak processor power. • Near football-field size computers consuming megawatts of electricity. • DOE Office of Science is the natural leader to address this and is making it a priority • This is happening against the backdrop of: • Decreasing interest in HPC by some U.S. vendors • Further consolidation/reduction of the number of vendors • Reduced profitability and reduced technology investments • New ways to define commodity So, we are in the middle of a fundamental change of the basic premises of the HPC market in the U.S. NERSC User Group Meeting

The Divergence Problem • The requirements of high performance computing for science and engineering and the requirements of the commercial market are diverging. • The commercial-clusters-of-COTS-SMPs approach is no longer sufficient to provide the highest level of performance • Lack of memory bandwidth • High interconnect latency • Lack of interconnect bandwidth • Lack of high performance parallel I/O • High cost of ownership for large scale systems NERSC User Group Meeting

The Divergence Problem • In response, NERSC, ANL, IBM developed a Science Driven Computer Architecture proposal. • Includes a new architecture co-defined with IBM called Blue Planet • "Creating Science-Driven Computer Architecture: A New Path to Scientific Leadership“ - http://www.nersc.gov/news/ArchDevProposal.5.01.pdf • Expanding this process with other vendors NERSC User Group Meeting

Overall Goals • Restore American leadership in “capability” scientific computing by 2005 • Define a sustainable path for efficient scientific computing • Focus on achieving high sustained performance rather than peak • The first step in a long term strategy • Petaflop peak by the end of the decade with 40% sustained • An initial system with 2x sustained performance over the ES at 50% the cost • On at least a modest number of strategic large applications • Sustained performance of 30-40% on key benchmarks • Needs to be 4X peak performance of the current ES (assume Moore's law performance scaling) • 160TF peak performance • Phased delivery plan with the final system available in 2H05,1Q06 • Assumes a significant funding profile can be developed • Low risk (build off existing roadmap to the extent possible) • Full strategy has multiple solutions proposed • ANL with Blue Gene/L and ORNL with Cray X1 • Proposed to be a cooperative development effort between NERSC and IBM NERSC User Group Meeting

Approach • Study applications critical to DOE Office of Science and others. For example: • Material Science • Combustion simulation and adaptive methods • Computational astrophysics • Nanoscience (new drugs and also new microchip technologies) • Biochemical and Biosciences (protein folding/interactions) • Climate modeling • High Energy Physics (particle accelerators and astrophysics) • Multi-grid Eigen solvers and LA methods • Identify key bottlenecks found in these critical applications • Outline a high level approach to address the challenges • Follow-up meetings for detailed drill down by the IBM computer scientists and application scientists at NERSC • Iterate on proposed solution NERSC User Group Meeting

Other Ideas for Consideration • Finely tuned libraries for FFT, FMA, Matrix ops (ESSL, PESSL) • Hardware acceleration engines for performance critical ops especially when software tuning is inhibited by other constraints • MPI Lite (avoid some of the performance inhibiting semantics that are seldom used) • Trade-offs would be ordering rules and maybe repeatability of results • Hardware acceleration engines for MPI (in processor & adapter) • Unified Parallel C programming model and other new languages • Microkernel OS on compute nodes • Other OS enhancements for HPC • E.g. no paging of well behaved applications • better hooks for daemon control • Advanced Cooling to address floor space • New CPU’s clock is limited by packaging and cooling not chip technology • Compiler technologies for Viva, VMX, etc. NERSC User Group Meeting

New Class of Computer Architectures for Science • Sustained cooperative development of new computer architectures • Engaging the scientists with the developers • A focus on sustained performance of scientific applications – not on peak performance • Addressing the key bottlenecks of bandwidth and latency for memory and processor interconnection • A new investment in the computer science research and scientific research communities • Cost Matters • If effective scientific supercomputing is only available at high cost, it will have impact on only a small part of the scientific community. • So, we need to leverage the resources of mainstream IT companies like IBM, HP and Intel. • And our national science policy should motivate them to participate durably. NERSC User Group Meeting

Full IBM Blue Planet System Components • New IH++ Wide Node - 8 CPUs per node • POWER5 GS Processor - 2.5GHz • Single core MCM • 2048 node system (8 Nodes per frame) • 16K processors @ 10GF per CPU = 160TF Peak • Virtual Vector Architecture - VIVA • Federation Switch - 3 stage topology • 8GB/s per server for the uni direction communication bandwidth. • 40-50 TF Sustained on 2-3 selected applications • 256 TB of memory = 16GB per CPU • May reduced to 128TB of memory if it can sustain full memory BW • 2.5PB disk in I/O system [approximately 48 IO nodes] • Approximately 600 Frames • 256 compute racks, 250 Disk racks, 160 Switch racks • 12,000-15,000 Sq Feet; 5-7 MWatts Power • Scientists will focus on application optimization NERSC User Group Meeting

Blue Planet: A Conceptual View • Increasing memory bandwidth – single core • 8 single CPUs are matched with memory address bus limits for full memory bandwidth • Increasing switch bandwidth – 8-way nodes • Decreased switch latency while increasing span • Enabling vector programming model inside each SMP node • Sustained performance on science applications at a sustainable cost and development model NERSC User Group Meeting

Issues Under Study • Issues Covered • Memory Bandwidth • Especially for small scattered accesses • MPI communication latency • Protocol path length overhead hurts performance • Example: UPC very sensitive to this overhead • MPI Collective Communication Performance • MPI IO Performance • MPI scaling to a large number of tasks • New Issues • Microkernel option • VIVA follow-up with the Compiler team NERSC User Group Meeting

Progress Already • Additional changes Power 5 CPU • Scaling the switch larger than first planned • Software performance changes • Close to committing small, more memory intensive node NERSC User Group Meeting

Progress and Status • LLNL/ASCI has become very interested in Blue Planet • ~8 meetings with IBM and LLNL to develop the ideas and narrow down design choices • CPU Design • Node Design • Switch/Interconnect • Software • System Level • Libraries • Special devices • Modeling Performance NERSC User Group Meeting

Node Discussion • Large SMP vs Small SMP • Impacts switch scaling and hence cost • Partitioned vs Not partitioning • Impacts memory latencies and switch scaling • MCM based vs DCM based nodes • Impacts cost and performance • Memory Latency • Performance sensitivity • Memory Bandwidth • Streams performance sensitivity • Cost NERSC User Group Meeting

Interconnect Discussion • Importance of Bisection bandwidth • Collectives vs point to point • Application classification • Fat Tree (multi stage networks) vs 3D mesh designs • Is a 3D mesh or hypercube based approach cost effective? • Sensitivity to communication patterns • Need to model the communication patterns to corresponding bottlenecks • What to prioritize: • Global bandwidth versus local bandwidth • Infiniband switch costs: • Depends on the scale of the system NERSC User Group Meeting

Progress Already • System Software • Next on the list start consideration • Currently IBM is most interested in limited full blown system software • Studying Microkernel and minimum OSs • Modeling • IBM Research, NCAR, SDSC, PERC, NERSC, LLNL • Developing tools and methods NERSC User Group Meeting

Other Progress • Other sites very interested • Over 30 sites asked IBM for briefing • A number have asked to participate in discussions • Blue Planet is the basis for several other activities • A lot of the basis for recent White Paper for HECRTF • Continued discussions with DOE/SC • Having depth discussions with SGI, Intel, Cray and HP • SGI has some interesting plans based on DARPA HPCS • Considering holding a workshop on Blue Planet NERSC User Group Meeting

Summary • Think of Blue Planet as a new process as well as a single instantiation of a computer architecture • Waiting for vendors to produce a product, and then evaluating and purchasing will only increase the divergence • Products are already designed to a different design point • Ideas for new, sustainable architectures are sparse • Commodity clustering has more than reached its limits of effectiveness • Need a modified approach to improve this situation for capability science Lou Gerstner’s book title was “Who said Elephants Can’t Dance”. Blue Planet is a collaborative effort by some users and some parts of IBM, Cray, SGI, and other vendors to do do some fancy dancing for Scientific computing. NERSC User Group Meeting

NERSC and Blue Planet

NERSC and Blue Planet

Presentation Transcript

What a pretty blue planet

What a pretty blue planet!

NERSC

Earth: The Blue Planet

Earth: The Blue Planet

What a pretty blue planet!

The Blue Planet

Earth the blue planet

What a pretty blue planet!

What a pretty blue planet!

Chapter 3 frontispiece. The Blue Planet

TURNING THE BLUE PLANET GREEN

NERSC Storage

Our blue planet ...

Exploring the Blue Planet … ESRI Press Authors’ Panel on Arc Marine: GIS for a Blue Planet

Blue planet divers

Blue Planet Shop

The Blue Planet

Chapter 3 frontispiece. The Blue Planet

BLUE PLANET IMMIGRATION - EDUCATION AND IMMIGRATION CONSULTANTS

Blue Planet