270 likes | 295 Views
Explore the expansion of the warm-water-cooled Apollo 8000 system 'Prometheus' at Cyfronet, providing unified access to software, compute, and storage resources, challenges, and benefits for users. The project focuses on supporting distributed computing infrastructures, with a special emphasis on grid technologies and cloud computing. Learn about the system's specifications, storage, liquid cooling benefits, and the secondary loop system software used. Compare performance data between Prometheus and Zeus, and discover future plans for the project's development.
E N D
Furtherexpansion of HPE’slargestwarm-water-cooled Apollo 8000 system:„Prometheus” atCyfronet Patryk Lasoń, Marek Magryś
ACC Cyfronet AGH-UST • establishedin1973 • part of AGH University of Science and TechnologyinKrakow, Poland • providesfreecomputing resources for scientific institutions • centre of competenceinHPC and GridComputing • IT service management expertise (ITIL, ISO 20k) • member of PIONIERconsortium • operator of Krakow MAN • home for supercomputers
PL-Gridinfrastructure • Polish national IT infrastructure supporting e-Science • based upon resources of most powerful academic resource centres • compatible and interoperable with European Grid • offering grid and cloud computing paradigms • coordinated by Cyfronet • Benefits for users • unifiedinfrastructure from 5 separate compute centres • unified access to software, compute and storage resources • non-trivial quality of service • Challenges • unified monitoring, accounting, security • create environment of cooperation rather than competition • Federation – the key to success
PLGridCoreproject • Competence Centre in the Field of Distributed ComputingGrid Infrastructures • Duration: 01.01.2014 – 31.11.2015 • ProjectCoordinator: Academic Computer Centre CYFRONET AGH The main objective of the project is to support the development of ACC Cyfronet AGH as a specialized competencecentrein the field of distributed computing infrastructures, with particular emphasis on grid technologies, cloud computing and infrastructures supporting computations on big data.
ZEUS 374 TFLOPS #269 on Top500
New building 5 MW, UPS + diesel
Prometheus – Phase 1 • Installedin Q2 2015 • HP Apollo 8000 • 13 m2, 15 racks (3 CDU, 12 compute) • 1.65 PFLOPS • 1728 nodes, Intel Haswell E5-2680v3 • 41472 cores, 13824 per island • 216 TB DDR4 RAM • N+1/N+Nredundancy
Prometheus – Phase 2 • Installedin Q4 2015 • 4th island • 432 regularnodes (2 CPUs, 128 GB RAM) • 72 nodeswithGPGPUs (2x NVIDIA Tesla K40 XL) • 2.4 PFLOPS total performance (Rpeak) • 2140 TFLOPS inCPUs • 256 TFLOPS inGPUs • 2232 nodes, 53568 CPU cores, 279 TB RAM • <850 kWpower (includingcooling)
Prometheusstorage • Disklesscomputenodes • Separateprocurment for storage • Lustre on top of DDN hardware • Twofilesystems: • Scratch: 120 GB/s, 5 PB usablespace • Archive: 60 GB/s, 5 PB usablespace • HSM-ready • NFS for homedirectories and software
Prometheus: IB farbic Core IB switches services nodes storagenodes 576 CPU nodes 576 CPU nodes 576 CPU nodes 432 CPU nodes 72 GPU nodes Computeisle Computeisle Computeisle Computeisle Service isle
Whyliquidcooling? • Water: up to 1000x moreefficientheatexchangethan air • Less energy needed to movethecoolant • Hardware (CPUs, DIMMs) can handle ~80 C • Challenge: cool 100% of HW withliquid • networkswitches • PSUs
Whatabout MTBF? • The less movementthebetter • pumps • fans • HDDs • Example • pump MTBF: 50 000 h • fan MTBF: 50 000 h • 2300 node system MTBF: ~5 h
Why Apollo 8000? • Most energy efficient • Theonlysolutionwith 100% warmwatercooling • Highestdensity • Lowest TCO
Evenmore Apollo • Focusesalso on ‘1’ in PUE! • Power distribution • Less fans • Detailed monitoring • ‘energy to solution’ • Drynodemaintenance • Less cables • Prefabricatedpiping • Simplified management
System software • CentOS 7 • Boot to RAM over IB, imagedistributionwith HTTP • Wholemachinebootsupin 10 minuteswithjust 1 bootserver • Hostname/IP generator based on MAC collector • Data automaticallycollectedfrom APM and iLO • Graphical monitoring of power, temperature and networktraffic • SNMP data source, • GUI allowseasy problem location • Nowsyncedwith SLURM • SpectaculariLO LED blinking system developed for theofficiallaunch
Real application performance • Prometheusvs. Zeus intheory 4x differencecore to core • Storage system (scratch) 10x faster • More time to focus on the most popular codes • COSMOS++ – 4.4x • Quantum Espresso – 5.6x • ADF – 6x • Widelyused QC codewiththenamederrivedfrom a famousmathematician – 2x
Futureplans • Continue to moveusersfromtheprevious system • Add a fewlarge-memorynodes • Furtherimprovements of the monitoring tools • Detailed energy and temperature monitoring • Energy-awarescheduling • Collecttheannual energy and PUE