370 likes | 384 Views
Established in 1973, Cyfronet AGH-UST in Krakow, PL, offers free computing resources to scientific institutions, focusing on HPC and Grid Computing. As a member of PIONIER and operator of Krakow MAN, Cyfronet is leading the PL-Grid Consortium, aiming to enhance computing resources for researchers. The national IT infrastructure supports e-Science and is compatible with European Grid. Users benefit from unified access to resources, but face challenges like monitoring and security. The PLGridCore project, funded by the EC, enhances distributed computing infrastructures. Its services cover data access, cloud computing, and workflow management. The facility boasts advanced HPC systems, including Zeus, with over 1300 servers and high-core count nodes. To meet future needs, upgrading is essential, offering energy-efficient solutions like direct liquid cooling. The system's topology ensures efficient service delivery. Harmonizing hardware performance, energy efficiency, and cooling methods are vital for sustainable computing.
E N D
Towardsenergyefficient HPC HP Apollo 8000 atCyfronet Part I Patryk Lasoń, Marek Magryś
ACC Cyfronet AGH-UST • establishedin1973 • part of AGH University of Science and TechnologyinKrakow, PL • providesfreecomputing resources for scientific institutions • centre of competenceinHPC and GridComputing • IT service management expertise (ITIL, ISO 20k) • member of PIONIER • operator of Krakow MAN • home for Zeus
PL-GridConsortium • Consortiumcreation – January 2007 • a response to requirements from Polish scientists • due to ongoing Grid activities in Europe (EGEE, EGI_DS) • Aim:significantextension of amount of computing resources provided to the scientific community (start of the PL-GridProgramme) • Development based on: • projectsfunded by the EuropeanRegional Development Fund as part of the InnovativeEconomy Program • closeinternationalcollaboration (EGI, ….) • previousprojects (5FP, 6FP, 7FP, EDA…) • National Network Infrastructure available: Pionier National Project • computingresources: Top500 list • Polish scientific communities: ~75% highly rated Polish publications in 5 Communities PL-Grid Consortium members: 5 High Performance Computing Polish Centres, representing Communities, coordinated by ACC Cyfronet AGH
PL-Gridinfrastructure • Polish national IT infrastructure supporting e-Science • based upon resources of most powerful academic resource centres • compatible and interoperable with European Grid • offering grid and cloud computing paradigms • coordinated by Cyfronet • Benefits for users • one infrastructure instead of 5 separate compute centres • unified access to software, compute and storage resources • non-trivial quality of service • Challenges • unified monitoring, accounting, security • create environment of cooperation rather than competition • Federation – the key to success
Competence Centre in the Field of Distributed Computing Grid Infrastructures PLGridCoreproject • Budget: total 104 949 901,16 PLN, including funding from the EC : 89 207 415,99 PLN • Duration: 01.01.2014 – 31.11.2015 • ProjectCoordinator: Academic Computer Centre CYFRONET AGH The main objective of the project is to support the development of ACC Cyfronet AGH as a specialized competencecentrein the field of distributed computing infrastructures, with particular emphasis on grid technologies, cloud computing and infrastructures supporting computations on big data.
PLGridCoreproject– services • Basic infrastructure services • Uniform access to distributed data • PaaSCloud for scientists • Applications maintenance environment of MapReduce type • End-user services • Technologies and environmentsimplementing the Open Science paradigm • Computing environment for interactiveprocessing of scientific data • Platform for development and execution of large-scale applications organized in a workflow • Automatic selection of scientific literature • Environment supporting data farming mass computations
HPC at Cyfronet 2013 2007 2008 2009 2010 2011 2012 Baribal Panda Zeus vSMP Mars Zeus Zeus Platon U3 FPGA Zeus GPU
374 TFLOPS #176, #1 in Poland
Zeus • over1300servers • HP BL2x220c blades • HP BL685cfatnodes (64 cores, 256 GB) • HP BL490cvSMPnodes (up to 768 cores, 6 TB) • HP SL390s GPGPU (2x,8x) nodes • InfinibandQDR (Mellanox+Qlogic) • >3 PB of diskstorage (Lustre+GPFS) • Scientific Linux 6, Torque/Moab
Zeus - statistics • 2400 registered users • >2000 jobs running simultaneously • >22000 jobs per day • 96 000 000 computing hours in 2013 • jobs lasting from minutes to weeks • jobs from 1 core to 4000 cores
Cooling Hot aisle Coldaisle Hot aisle Rack Rack 40°C 40°C 20°C
Whyupgrade? • Jobsgrowing • Usershatequeuing • New users, newrequirements • Technology movingforward • Power bill stayingthe same
Requirements • Petascale system • Lowest TCO • Energy efficient • Dense • Good MTBF • Hardware: • corecount • memorysize • networktopology • storage
DirectLiquidCooling! • Up to 1000x moreefficientheatexchangethan air • Less energy needed to movethecoolant • Hardware can handle • CPUs ~70C • memory ~80C • Hard to cool 100% of HW withliquid • networkswitches • PSUs
MTBF • The less movementthebetter • less pumps • less fans • less HDDs • Example • pump MTBF: 50 000 hrs • fan MTBF: 50 000 hrs • 1800 node system MTBF: 7 hrs
Thetopology Core IB switches services nodes storagenodes 576 computingnodes 576 computingnodes 576 computingnodes Service isle Computing isle Computing isle Computing isle
Itshouldcount • Max jobsize ~10k cores • FastestCPUs, but compatiblewith old codes • Twosocketsareenough • CPUs, not accelerators • Newestmemory • and morethanbefore • Fast interconnect • stillInfiniband • but no need for full CBB fattree
Thehard part • Public institution, public tender • Strict requirements • 1.65 PFLOPS, max. 1728 servers • 128 GB DDR4 per node • warm water cooling, no pumps inside nodes • infiniband topology • compute+cooling, dry-cooler only • Criteria: price, power, space
And thewinneris… • HP Apollo 8000 • Most energy efficient • Theonlysolutionwith 100% warmwatercooling • Leastfloorspaceneeded • Lowest TCO
Evenmore Apollo • Focusesalso on ‘1’ in PUE! • Power distribution • Less fans • Detailed monitoring • ‘energy to solution’ • Safermaintenance • Less cables • Prefabricatedpiping • Simplified management
System configuration • 1.65 PFLOPS (first 30. of the current Top500) • 1728 nodes, Intel Haswell E5-2680v3 • 41472 cores, 13824 per island • 216 TB DDR4 RAM • PUE ~1.05, 680 kW total power • 15 racks, 12.99 m2 • System ready for undisruptive upgrade • Scientific Linux 6 or 7
Prometheus • Created human • Gave fire to the people • Accelerated innovation • Defeated Zeus
Deployment plan • Contractsigned on 20.10.2014 • Installation of theprimaryloopstarted on 12.11.2014 • First delivery (service island) expected on 24.11.2014 • Apollo pipingshouldarrivebeforeChristmas • Maindeliveryin January • Installation and acceptanceinFebruary • Productionworksince Q2 2015
Futureplans • Benchmarking and Top500 submission • Evaluation of Scientific Linux 7 • Movingusersfromtheprevious system • Tuning of applications • Energy-awarescheduling • First experiencepresentedatHP-CAST 24
More information • www.cyfronet.krakow.pl/en • www.plgrid.pl/en