This work was performed under the auspices of the U.S. Department of Energy by

Deploying HPC Linux Clusters (the LLNL way!)Robin GoldstonePresented to ScicomP/SP-XXLAugust 10, 2004 UCRL-PRES-206330 This work was performed under the auspices of the U.S. Department of Energy by University of California, Lawrence Livermore National Laboratory under Contract W-7405-Eng-48.

Background

Purpose of HPC Linux Clusters Create low-cost alternatives to vendor integrated solutions. motivates vendors to cut costs provides labs with a level of independence from vendor solutions Provide affordable capacity solutions today for the program, so capability systems can be used for capability. Provide a path for next generation capability, if technology evolves appropriately for HPC. LLNL Strategy Design s/w & h/w for ease of manageability. Leverage open source software model. Augment a base Linux distro with in-house development expertise & vendor partnerships. Develop best of breed software including: robust, scalable cluster management tools efficient, scalable resource manager to exploit maximum utilization of resources Employ multi-tiered software support model: system administrators, on-site developers, vendor partners, open source community Build clusters from (mostly) commodity components. Use “self-maintenance” model for h/w repair. Provide users with a feature-rich environment including: parallel environment (MPI, OpenMP) development tools parallel file system Deliver world-class HPC systems to our users. HPC Linux Cluster Strategy

Production Computing Resources(projected to 6/30/04)

Term coined during vector to MPP transition (~1992 Meiko CS/2). Assure code teams that porting effort would not have to be repeated for every MPP platform. Common application programming environment on all LC clusters. Allows complex scientific apps to be ported across generations of hardware architectures and different operating systems with minimal disruption. MPI for parallelism between nodes. OpenMPI for SMP parallelism Parallel filesystem POSIX facilities Compilers/debuggers – not necessarily the same ones, but best of breed wherever possible. Livermore Model HPC Linux clusters must implement the Livermore Model!

Principal Investigator for HPC Platforms “Visionary” who tracks HPC trends and innovations Understands scientific apps and customer requirements Very deep and broad understanding of computer architecture IO Testbed Evaluate new technologies: nodes, interconnects, storage systems Performance characteristics Manageability aspects, MTBF, etc. Explore integration with existing infrastructure Develop required support in Linux OS stack RFP process In response to a specific programmatic need for a new computing resource: determine the best architecture define the requirements solicit bids Best value procurements or sole source where required Bidders typically required to execute a benchmark suite on the proposed HW and submit results Integration -> acceptance -> deployment Integration requires close coordination with LLNL staff since our OS stack is used Often the integrator is not familiar with all of the HPC components (interconnect, etc.) so LLNL must act at the “prime contractor” to coordinate the integration of cluster nodes, interconnect and storage systems. Acceptance testing is critical! Does the system function/perform as expected? For LLNL’s large Linux clusters, time from contract award to production status has been ~8 months. Decision-making process

Software Environment

What is CHAOS? Clustered High Availability Operating System Internal LC HPC Linux Distro Currently derived from Red Hat Enterprise Linux Scalability target: 4K nodes What is added? Kernel modifications for HPC requirements Cluster system administration tools MPI environment Parallel filesystem (Lustre) Resource manager (SLURM) The CHAOS “Framework” Configuration Management Release discipline Integration testing Bug tracking Limitations Assumes LC services structure Assumes LC sys admin culture Tools supported by DEG are not included in CHAOS distro Only supports LC production h/w Assumes cluster architecture CHAOS Overview

CHAOS Target Environment

Kernel Based on Red Hat kernel Add: QsNet modules plus coproc, ptrack, misc core kernel changes Update: drivers for e1000, qla2300, nvidia, digi, etc. Add: Lustre modules plus VFS intents, read-only device support, zero-copy sendpage/recvpackets, etc. Add: increase kernel stack size, reliable panic on overflow. Update: netdump for cluster scalability Update: IA-64 netdump, machine check architecture fixes, trace unaligned access traps, spinlock debugging, etc. Update: ECC module support for i860, E75XX Add: implement totalview ptrace semantics. Add: p4therm Update: various NFS fixes. Cluster Admin/Monitoring Tools pdsh YACI ConMan/PowerMan HM Genders MUNGE/mrsh Firmware update tools lmsensors CMOS config FreeIPMI Ganglia/whatsup SLURM resource manager Lustre parallel filesystem QsNet MPI environment CHAOS Content

CHAOS 2.0 (GA: June 2004) Red Hat Enterprise Linux 3 2.4.21 kernel with NPTL IA64 QsNet II IPMI 1.5 MCORE  Netdump Lustre fully integrated CHAOS 3.0 (GA: ~Feb 2005) Red Hat Enterprise Linux 4 2.6 kernel X86_64 OpenIB IPMI 2.0 Releases

CHAOS Software Life Cycle • Target is six month release cycle. • Loosely synchronized with RedHat releases but also dependent on availability of key third-party tools (e.g. Intel compilers, Totalview debugger). • Also driven by requirement to support new hardware: • Fall 2001 PCR ia32, i860 chipset, RDRAM memory • Fall 2002 MCR E7500 chipset, Federated Quadrics switch, LinuxBIOS • Late 2002 ALC Serverworks GC-LE chipset, IBM Service Processor • Late 2003 Thunder IA64, Intel 8870, IPMI • Automated test framework (TET) • Subset of SWL apps • Regression tests • Staged rollout: small  large, unclassified  classified • testbed systems dev, mdev, adev, tdev • small production systems pengra, ilx, PVC, Gviz • large production systems PCR, Lilac, MCR, ALC, Thunder

CHAOS Software Support Model • System administrators perform first level problem determination. • If a software defect is suspected, problem is reported to CHAOS development team. • Bugs are entered and tracked in CHAOS GNATS database. • Fixes are pursued as appropriate: • in open source community • through RedHat • through vendor partners (Quadrics, CFS, etc.) • locally by CHAOS developers • Bug fix is incorporated into CHAOS CVS source tree and new RPM(s) built and tested on testbed systems. • Changes are rolled out in next CHAOS release or sooner depending on severity.

SLURM in a Nutshell External scheduler manages the queue SLURM allocates nodes, starts and manages the jobs Node 0 Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 Job 1 Job 2 Job 3 Job 4 Users submit work

SLURM Architecture

SLURM Job Startup/bin/hostname, 2 tasks per node

SLURM Plans • Support more configurations • LLNL: Kerberos, AIX, IBM Federation switch, Blue Gene/L • Collaborators: Myrinet, InfiniBand, LAMMPI, MPICH, Maui… • Job checkpoint/restart • Job preempt/resume • Manage consumable resources both system-wide (e.g. licenses) and within shared nodes (memory, disk space, etc.)

LLNL Interests in Lustre • LustreLite • a cluster file system for: • a single Linux cluster: MCR, ALC, LILAC • multiple clusters that share a file system: MCR+PVC • a “Lustre Farm” file system • DOE TriLabs PathForward Project: • full-functionality Lustre, 3-year contract • scaling to 10,000 client nodes, bells, whistles, more “commercial” features • platform for a “enterprise wide” shared file system • Storage industry “buy in.”

Clusters Using Lustre

The InterGalactic Filesystem Thunder

Lustre Issues and Priorities • Reliability!!!! • first and foremost • Manageability of “Lustre Farm” configurations: • OST “pool” notion • management tools (config, control, monitor, fault-determination) • Security of “Lustre Farm” TCP-connected operations. • Better/quicker fault determination in multi-networked environment. • Better recovery/operational robustness in the face of failures. • Improved performance.

Profiling mpiP gprof, pgprof Vampir/Guideview PAPI In-house HWmon tool - layered on perfctr TAU (experimental) VTune coming (not yet available for RHEL 3) Memory debuggers Valgrind dmalloc, efence, Purify for gcc Development Environment • Compilers • Intel compilers (ifort, icc) through v8 • PGI compilers (pgcc, pgCC, pgf90, pghpf) through 5.1 • Gcc 2.96, 3.2.3, 3.3.2 • Glibc 2.2.5, 2.3.2 • Libraries • Intel MKL, currently 6.1 • Other libs (ScaLAPACK, pact, etc) are responsibility of apps teams • Interpreters • Java- j2re1.4.1_03 • Python 2.2 • Debuggers • Totalview 6 – port with ASC enhancements • gdb, ddd, idb, pgdb • ddt – under evaluation

CSM vs. CHAOS Function CSM method CHAOS method Cluster Install/Update NIM YACI Cluster Security CtSec munge Topology services RSCT (hats) ganglia Remote command execution dsh pdsh Node groups nodegrp genders Configuration management CFM genderized rdist Node Monitoring SMC SNMP Cluster monitoring WebSM HM Console management HMC+conserver conman Remote power management HMC+rpower powerman Resource management LoadLeveler SLURM

Linux Development Staffing • 2 FTE kernel • 3 FTE cluster tools + misc • 2 FTE SLURM • 4 FTE Lustre • 1 FTE on-site Red Hat analyst

Hardware Environment

MDS MDS GW GW GW GW GW GW GW GW OST OST OST OST OST OST OST OST OST OST OST OST OST OST OST OST Typical Linux Cluster Architecture 1,112 P4 Compute Nodes 1152 Port QsNet Elan3 QsNet Elan3, 100BaseT Control 100BaseT Management login login mgmt mgmt GbEnet Federated Switch 6 Login nodes with 4 Gb-Enet SPC SPC SPC RPC RPC RPC SPC RPC • Serial Port Concentrators Remote Power Controllers

Hardware component choices • Motherboard / chipset: Is chipset specification open? • Processors – Influenced by kernel/distro support, availability of compilers, tools… • Memory – ECC mandatory, chipkill = $$ • Hard drives – IDE or SCSI, RAID, mirroring, diskless? • Node form factor – influenced primarily by # PCI/AGP slots needed. What about blades? • Remote power mgmt: external plug control vs. service processor • Remote console management: terminal servers vs. Serial Over LAN (IPMI) Even slight variations in hardware configuration can create a significant support burden!

Integration Considerations • Power and cooling • Racking, cabling and labeling • BIOS/CMOS settings • Burn-in • Acceptance testing • Software deliverables: can integrator contribute Open Source tools for the hardware that they are providing?

Hardware support strategy • Self-maintenance vs. vendor maintenance • Overhead of vendor maintenance: • Turnaround time • Access to classified facilities • Convincing vendor something is really broken when it passes diags • Overhead of self-maintenance: • Need appropriately skilled staff to perform repairs • Inventory management • RMA coordination • Hot spare cluster – include in cluster purchase! • Spare parts cache • How big? • What happens when you run out of spare parts?

Conclusions

Lessons Learned (SW) • A “roll your own” software stack like CHAOS takes significant effort and in-house expertise. Is this overkill for your needs? • Consider alternatives like OSCAR or NPACI Rocks for small (< 128 node) clusters with common interconnects (Myrinet or GigE). • Use LLNL cluster tools to augment or replace deficient components of these stacks as needed. • Avoid proprietary vendor cluster stacks which will tie you to one vendor’s hardware. A vendor-neutral Open Source software stack enables a common system management toolset across multiple platforms. • Even with vendor-supported software stack, a local resource with kernel/systems expertise is a valuable asset to assist in problem isolation and rapid deployment of bug fixes.

Lessons Learned (HW) • Buy hardware from vendors who understand Linux and can guarantee compatibility. • BIOS pre-configured for serial console or LinuxBIOS! • Linux-based tools (preferably Open Source) for hardware management functions. • Linux device drivers for all hardware components. • Who will do integration? The devil is in the details. • A rack is NOT just a rack, a rail is not just a rail • All cat5 cables are not created equally • Labeling is critical • Hardware self-maintenance can be a significant burden. • Node repair • Parts cache maintenance • RMA processing PC hardware in a rack != HPC Linux cluster

LCRM/SLURM Delivers Excellent Capacity/Capability Mix LCRM/SLURM scheduling environment allows optimal use of resources. Over 40% of the cycles go to jobs requiring over 50% of the machine

Do Our Users Like It?Yes They Do! MCR is more heavily used than ASCI White!

This work was performed under the auspices of the U.S. Department of Energy by

This work was performed under the auspices of the U.S. Department of Energy by

Presentation Transcript

U.S. Department of Energy

Funding for this work has been provided by the U.S. Environmental Protection Agency, the U.S. Department of Agriculture

U.S. Department of Energy

managed by Brookhaven Science Associates for the U.S. Department of Energy

Museum activities under the auspices of the Technion – Israel Institute of Technology

U.S. Department of Energy

The U.S. Department of Energy Environmental Management

Work supported by the U.S. Department of Energy under contract No. DE-AC02-07CH11359

This work was performed under the auspices of the U.S. Department of Energy

This work was supported primarily by the Engineering

Work Supported by the US Department of Energy

Description of the Work to be Performed

Under the auspices of the Ministry of Finance,

*This research was supported by the U.S. Department of Energy: Contract No. DE-AC02 -98CH10886

Museum activities under the auspices of the Technion – Israel Institute of Technology

under the auspices of the European Commission

This work is performed as part of the EYES Project.

Sponsored by the U.S. Department of Justice

U.S. Department of Energy