Prometheus: Further Expansion of HPE's Largest Warm-Water-Cooled Apollo 8000 System

Furtherexpansion of HPE’slargestwarm-water-cooled Apollo 8000 system:„Prometheus” atCyfronet Patryk Lasoń, Marek Magryś

ACC Cyfronet AGH-UST • establishedin1973 • part of AGH University of Science and TechnologyinKrakow, Poland • providesfreecomputing resources for scientific institutions • centre of competenceinHPC and GridComputing • IT service management expertise (ITIL, ISO 20k) • member of PIONIERconsortium • operator of Krakow MAN • home for supercomputers

PL-Gridinfrastructure • Polish national IT infrastructure supporting e-Science • based upon resources of most powerful academic resource centres • compatible and interoperable with European Grid • offering grid and cloud computing paradigms • coordinated by Cyfronet • Benefits for users • unifiedinfrastructure from 5 separate compute centres • unified access to software, compute and storage resources • non-trivial quality of service • Challenges • unified monitoring, accounting, security • create environment of cooperation rather than competition • Federation – the key to success

PLGridCoreproject • Competence Centre in the Field of Distributed ComputingGrid Infrastructures • Duration: 01.01.2014 – 31.11.2015 • ProjectCoordinator: Academic Computer Centre CYFRONET AGH The main objective of the project is to support the development of ACC Cyfronet AGH as a specialized competencecentrein the field of distributed computing infrastructures, with particular emphasis on grid technologies, cloud computing and infrastructures supporting computations on big data.

ZEUS 374 TFLOPS #269 on Top500

Zeus usage

New building 5 MW, UPS + diesel

2.4 PFLOPS, #49 on Top500

Prometheus – Phase 1 • Installedin Q2 2015 • HP Apollo 8000 • 13 m2, 15 racks (3 CDU, 12 compute) • 1.65 PFLOPS • 1728 nodes, Intel Haswell E5-2680v3 • 41472 cores, 13824 per island • 216 TB DDR4 RAM • N+1/N+Nredundancy

Prometheus – Phase 2 • Installedin Q4 2015 • 4th island • 432 regularnodes (2 CPUs, 128 GB RAM) • 72 nodeswithGPGPUs (2x NVIDIA Tesla K40 XL) • 2.4 PFLOPS total performance (Rpeak) • 2140 TFLOPS inCPUs • 256 TFLOPS inGPUs • 2232 nodes, 53568 CPU cores, 279 TB RAM • <850 kWpower (includingcooling)

Prometheusstorage • Disklesscomputenodes • Separateprocurment for storage • Lustre on top of DDN hardware • Twofilesystems: • Scratch: 120 GB/s, 5 PB usablespace • Archive: 60 GB/s, 5 PB usablespace • HSM-ready • NFS for homedirectories and software

Prometheus: IB farbic Core IB switches services nodes storagenodes 576 CPU nodes 576 CPU nodes 576 CPU nodes 432 CPU nodes 72 GPU nodes Computeisle Computeisle Computeisle Computeisle Service isle

Whyliquidcooling? • Water: up to 1000x moreefficientheatexchangethan air • Less energy needed to movethecoolant • Hardware (CPUs, DIMMs) can handle ~80 C • Challenge: cool 100% of HW withliquid • networkswitches • PSUs

Whatabout MTBF? • The less movementthebetter • pumps • fans • HDDs • Example • pump MTBF: 50 000 h • fan MTBF: 50 000 h • 2300 node system MTBF: ~5 h

Why Apollo 8000? • Most energy efficient • Theonlysolutionwith 100% warmwatercooling • Highestdensity • Lowest TCO

Evenmore Apollo • Focusesalso on ‘1’ in PUE! • Power distribution • Less fans • Detailed monitoring • ‘energy to solution’ • Drynodemaintenance • Less cables • Prefabricatedpiping • Simplified management

Secondaryloop

System software • CentOS 7 • Boot to RAM over IB, imagedistributionwith HTTP • Wholemachinebootsupin 10 minuteswithjust 1 bootserver • Hostname/IP generator based on MAC collector • Data automaticallycollectedfrom APM and iLO • Graphical monitoring of power, temperature and networktraffic • SNMP data source, • GUI allowseasy problem location • Nowsyncedwith SLURM • SpectaculariLO LED blinking system developed for theofficiallaunch

HPL: powerusage

HPL: watertemperature

Real application performance • Prometheusvs. Zeus intheory 4x differencecore to core • Storage system (scratch) 10x faster • More time to focus on the most popular codes • COSMOS++ – 4.4x • Quantum Espresso – 5.6x • ADF – 6x • Widelyused QC codewiththenamederrivedfrom a famousmathematician – 2x

Futureplans • Continue to moveusersfromtheprevious system • Add a fewlarge-memorynodes • Furtherimprovements of the monitoring tools • Detailed energy and temperature monitoring • Energy-awarescheduling • Collecttheannual energy and PUE

Prometheus: Further Expansion of HPE's Largest Warm-Water-Cooled Apollo 8000 System

Prometheus: Further Expansion of HPE's Largest Warm-Water-Cooled Apollo 8000 System

Presentation Transcript

A Powerpoint Presentation about Powerpoint

Interbeton PowerPoint presentation

PowerPoint Tutorial 1 Creating a Presentation

How to use this PowerPoint presentation

Presentation Basics

CA203 Presentation Application

Powerpoint presentation tips

LASO

Introduction to PowerPoint

MS Powerpoint

PowerPoint Chapter 1

How to use PowerPoint

Effective PowerPoint presentation

vti_encoding:SR|utf8-nl

How to Give a High-Quality PowerPoint Presentation

WELCOME TO OUR PRESENTATION

PowerPoint Presentation Guidelines

PowerPoint Presentation

How to Prepare and Present a PowerPoint Presentation

Sample PowerPoint Presentation