Cray XT3 John Levesque Director Cray‘s Supercomputing Center of Excellence

Cray XT3John Levesque Director Cray‘s Supercomputing Center of Excellence

Post SGI, Cray’s MPP Program was re-established through the Red Storm development contract with Sandia Key system characteristics: Massively parallel system – 10,000 AMD 2 GHz processors High bandwidth mesh-based custom interconnect High performance I/O subsystem Fault tolerant Full system delivered in 2004 Designed to quadruple in size—200 Tflops Cray Red Storm "We expect to get substantially more real work done, at a lower overall cost, on a highly balanced system like Red Storm than on a large-scale cluster." Bill Camp, Sandia Director of Computers, Computation, Information and Mathematics

Current Cray XT3 Installs (over 400 Cabinets sold) • Sandia (131 Cabinets, 4 Cabinets, 31 additional Cabinets currently being installed) • ORNL (56 Cabinets, 1 Cabinet, 68 Hood Cabinets on Order) • CSCS (12 Cabinets, 2 Single Cabinets, 6 additional Cabinets being installed) • PSC (22 Cabinets, 1 Cabinet) • SS Customers (Total of 11 Cabinets) • U of Tokyo (1 Cabinet) • JAIST (4 Cabinets) • ERDC (44 Cabinets, 1 Cabinet) • U of Western Australia (3 Cabinets) • AHPCRC (1 Cabinet, 5 Cabinets being installed) • AWE (42 Cabinets, 1 Cabinet)

Leadership Class Computing • Cray-ORNL Selected by DOE for National Leadership Computing Facility (NLCF) • Goal: Build the most powerful capability supercomputer in the world • Petaflop capability by 2008 • 100-500 TF sustained performance on challenging scientific applications • Steps: • 6.5 TF (512p) X1 • 18 TF (1,0245p) X1E • 25 TF XT3 • 100TF Capability System planned • Focused on capability computing • Available across government, academia, and industry • Including biology, climate, fusion, materials, nanotech, chemistry • Open scientific research

ORNL Petaflop Plans

EPSRC’s HECToR Procurement

NERSC Order – 102 Cabinet Hood System

` Cray XT3 System

Recipe for a good MPP • Select Best Microprocessor • Surround it with a balanced or “bandwidth rich” environment • “Scale” the System • Eliminate Operating System Interference (OS Jitter) • Design in Reliability and Resiliency • Provide Scaleable System Management • Provide Scaleable I/O • Provide Scaleable Programming and Performance Tools • System Service Life (provide an upgrade path)

Select the Best Processor • We still believe this is the AMD Opteron • Cray performed an extensive microprocessor evaluation between Intel and AMD during the summer of 2005 • AMD was selected as the microprocessor partner for the next generation MPP • AMD’s current 90nm processors compare well in benchmarks with Intel’s new 65nm Woodcrest (linpack is an exception that is dealt with in the quad-core timeframe)

2-4 GB/secSustained 5 to 8.5 GB/secSustained ~50ns latency 6.5 GB/secSustained Cray XT3/Hood Processing Element: Measured Performance Six Network Links Each >3 GB/s x 2(7.6 GB/sec Peak for each link)

IBM pSeries 690 0.19 Intel Xeon 0.24 0.36 IBM p5 595 SGI Altix 3700 0.38 0.25 IBM Blue Gene 1.10 Cray XT3 2.6 Ghz DC 0.55 0.00 0.50 1.00 1.50 2.00 Bandwidth Rich Environment:Measured Local Memory Balance Cray XT3 2.6Ghz Cray Hood 2.6 Ghz DC 667 Mhz DDR2 0.81 Memory/Computation Balance (B/F)

Providing a Bandwidth Rich Environment: Measured Network Balance (bytes/flop) Network bandwidth is the maximum bidirectional data exchange rate between two nodes using MPI sc = single coredc = dual core

Specialized Linux nodes • Compute PE • Login PE • Network PE • System PE • I/O PE Compute Partition Service Partition Scalable Software Architecture: UNICOS/lc“Primum non nocere” • Microkernel on Compute PEs, full featured Linux on Service PEs. • Contiguous memory layout used on compute processors to streamline communications • Service PEs specialize by function • Software Architecture eliminates OS “Jitter” • Software Architecture enables reproducible run times • Large machines boot in under 30 minutes, including filesystem • Job Launch time is a couple seconds on 1000s of PEs

Scalable Software Architecture:Why it matters for Capability Computing NPB Result: MGStandard Linux vs. Microkernel Results of study by Ron Brightwell, Sandia National Laboratory comparing Lightweight Kernel vs. Linux on ASCI Red System

Blade Control Processor Embedded HyperTransport Link Cray XT3: Reliable Packaging at Scale Blade Backplane Connector (>100 GB/sec) 4 DIMM Slotswith Chipkill Redundant VRMs CRAY SeaStar™ CRAY SeaStar™ CRAY SeaStar™ CRAY SeaStar™ Currently Support1, 2, 3, 4, 5, 6, or 8 GB memory per socket

Blade Control Processor Cray XT3 Service and I/O Blade CRAY SeaStar™ 2 PCI-X CRAY SeaStar™ CRAY SeaStar™ CRAY SeaStar™ 8131 AMD PCI-X Bridge

Power Dist. Unit Power Supply Rack CRAY XT3 Compute Cabinet Compute Modules - enclosed • Cabinets are 1 floor tile wide • Cold air is pulled from the floor space • Room can be kept at a comfortable temp 24 Module Cage Assy Industrial Variable Speed Blower Pre-Prototype Cabinet

Simple, microkernel-based software design Redundant Power Supplies and Voltage Regulator Modules (VRMs) Small number of moving parts Limited surface-mount components All RAID devices connected with dual paths to survive controller failure Seastar Engineered to Provide Reliable Interconnect Cray XT3 Reliability Features

Scalable I/O

Lustre File System • Cray systems use Lustre for high-performance, parallel I/O • XT3, XD1 and Hood now; Black Widow and Baker in the future • Focus on stability, scalability, performance, and resiliency • Lustre is “open source” software, supplied by Cluster File Systems, Inc. (CFS) • Cray integrates Lustre into the Cray software stack • Cray supports Lustre through field service, SPS, and R&D • CFS supports Cray with bug fixes and new features – some parts of Lustre are essentially custom to Cray, such as Catamount support • Lustre began to stabilize on the XT3 last fall • Stability and performance has continued to improve since then

Cray-CFS Partnership • Quarterly Business Reviews • Program review and business discussions • Biweekly architecture calls • Discuss technical requirements and future plans • Weekly technical calls • Discuss and track progress on bugs and direct CFS efforts • Engineering interaction with CFS developers • Daily communication through bugzilla, email, IRC... • Other meetings when appropriate • Peter Braam attended Red Storm Quarterly Review in April • Test team summit (Cray and CFS test teams) met in May • Cray at Lustre Users Group; CFS at Cray Users Group • Cray developers at Lustre Internals training class in June

Programming Environment • Portland Group (PGI) compilers • Pathscale Compilers • GNU Compilers • High Performance MPICH 2.0 • Shmem Library • AMD Math Libraries (ACML 2.6) • CrayPat & Apprentice2 performance tools • Etnus TotalView debugger available

The Cray Tools Strategy • Must be easy to use (or they simply won’t be used) • Automatic program instrumentation • no source code or makefile modification needed • Integrated performance tools solution • Design strategy based on the three main steps normally used for application optimization and tuning: • Debug application • Totalview is our debugger of choice • Single Processor Optimization • Make the Single PE code run fast • Parallel Processing and I/O Optimization • Communication / barriers / etc

Cray Apprentice2 Call Graph Profile Function Overview Communication & I/O Activity View Load balance views Source code mapping Pair-wise Communication View Time Line Views

Recipe for a good MPP • Select Best Microprocessor • Surround it with a balanced or “bandwidth rich” environment • “Scale” the System • Eliminate Operating System Interference (OS Jitter) • Design in Reliability and Resiliency • Provide Scaleable System Management • Provide Scaleable I/O • Provide Scaleable Programming and Performance Tools • System Service Life (provide an upgrade path) If you do a good job with all of these things, then you will have created a powerful tool for scientists and engineers…

` Performance

Cray Y-MP8 CPUs Cray T9016 CPUs Cray T3E1024 CPUs Cray X1E256 CPUs Cray XT34096 / 8192 CPUs 1990 1994 1996 2005 Cray Systems: Then and Now

1990 1994 1996 2005 2005 Performance Cray Y-MP8 CPUs Cray T9016 CPUs Cray T3E1024 CPUs Cray X1E256 CPUs Cray XT34096+ CPUs Can codes really scale to 1000s of processors? *Actual XT3 run. We estimate 32,000 Y-MP Units of performance on 4096 dual-core Hood System

NPB: BT

NPB: FT

Ptrans (Global Bandwidth) GP Trans (GB/sec) Cray XT3 5208 Cray XT3 1100 PEs SGI Altix - 2024 PEs NEC SX-8 576 PEs IBM Power 5 10240 PEs IBM Blue Gene 131072 / 65536 PEs 0 100 200 300 400 500 600 700 800 900 1000 Sandia has seen 1.8 TB/sec on 10K processors

2.7 Tflops on4608 Processors WRF Performance – Large Grid

Spectral Element Atmospheric Model (SEAM)Mark Taylor, Sandia • Goal: Demonstrate 10km capability for atmospheric model on massively parallel supercomputers • Result: • Run on 10,000 processors of Red Storm for 36 hours • Simulated 20 days • Over one billion grid-points in simulation • Sustained 5.3 Tflops 5.3 Tflops on10,000Processors

Asteroid Explosion: Sandia National Laboratory • Golevka Asteroid has been an object of interest since NASA discovered that its course had changed in 2003 • Code: CTH • Simulation: 10 Megaton explosion at the center of the asteroid • Simulation size: 1 billion cells for .5 seconds • Resources: 7200 processors for 12 hours

MPP Dyna • Version: mpp970, 5434a • Test Case: car2car, downloaded from topcrunch.org website • Total number of elements: 2.4 million • Simulation Time: 120ms • Version mpp971 will also handle implicit methods 64 128 256 256 (dc) Number of Opterons

Source: Bill Camp – SOS-10

1 Pflop 1 Pflop Expect 1 Pflop sustained for $20M around 2017 Cray XT3 dual core $200M will buy a sustained “BT Pflop” in 2011 Cray XT3 Cray T3E Cray T3D NAS Parallel BT Performance: ~$20Million Cray MPP Systems 1 Pflop 100 Tflops 10 Tflops 1 Tflop 100 Gflops 10 Gflops 1 Gflop 1990 1995 2000 2005 2010 2015 2020

` Cray MPP Futures

BlackWidow 2007 Eldorado 2007 Cray XD1 2006 2008 2007 2009 Baker Hood Dual-Core Cray XT3 Dual-Core Hood Multi-Core The Cray RoadmapFollowing the Adaptive Supercomputing Vision Cray X1E Cascade Specialized HPC Systems Rainier Purpose-Built HPC Systems HPC Optimized Systems Phase II: Cascade Fully Integrated System Cray XT3 Phase I: Rainier Multiple Processor Types with Integrated User Environment

New Air Baffledesigned to accommodatefuture processors AM2 Opteron Initially Dual CoreUpgradeable to Multi-core Node VRM(Redundantfor reliability) Blade ControlComputer SeaStar24GB/sec MPIbandwidth perOpteron Packaging: Hood Module DDR2 Memory10.6 GB/sec perSocket

AMD Multi Core Technology • 4 Cores per die • Each core capable of four 64-bit results per clock vs. two today • In order to leverage this “accelerator”, the code must make use of SSE2 instructions • This basically means you need to vectorize your code and block for cache AMD Proprietary Information – NDA Required

4-16 Cabinet Configurations • Example: 12 Cabinets • 1116 Compute PEs • 12.5 Tflops with 2.8 Ghz Dual core • Upgradeable to ~45T • 12 x 12 x 8 Torus

16-32 Cabinet Configurations • Example: 24 cabinets • 2224 Compute PEs • 26 Tflops based on 2.8 Ghz dual core • Upgradeable to ~90 Tflops • 12 x 12 x 16

Cray XT3 John Levesque Director Cray‘s Supercomputing Center of Excellence

Cray XT3 John Levesque Director Cray‘s Supercomputing Center of Excellence

Presentation Transcript