100 likes | 249 Views
Lattice QCD Clusters. Amitoj Singh Fermi National Accelerator Laboratory. Introduction. The LQCD Clusters Cluster monitoring and response Cluster job types submission, scheduling and allocation Execution Wish List Questions and Answers. The LQCD Clusters. pion and qcd cluster.
E N D
Lattice QCD Clusters Amitoj Singh Fermi National Accelerator Laboratory
Introduction • The LQCD Clusters • Cluster monitoring and response • Cluster job • types • submission, scheduling and allocation • Execution • Wish List • Questions and Answers
pion and qcd cluster pion cluster front qcd cluster back pion cluster back
kaon cluster kaon cluster front kaon cluster back kaon head-nodes & Infiniband spine
Cluster monitoring • Worker node • nannies monitor critical components/processes such as: • health (cpu/system temperature, cpu/system fan speeds) • batch queue clients (PBS mom) * • disk space • NFS mount points • high speed interconnects Except for * nannies report via email any anomalies that may exist. For * a corrective action is defined. A corrective action needs to be well-defined with sufficient decision paths to fully automate the error diagnosis and recovery process. Users are sophisticated enough to report any performance related issues. • Head-node • nanny monitors critical processes such as: • mrtg graph plotting scripts * • automated scripts to generate cluster status pages * • batch queue server (PBS server) • NFS server * Except for * nanny will restart processes that may have exited abnormally. All unhealthy nodes are reported as blinking on the cluster status pages. Cluster administrators can then analyze the mrtg plots to isolate the problem. • Network fabric • For the high speed network interconnects: • Nannies monitor and plot health of critical components (switch blade temperature, chassis fan speeds) on the 128 port myrinet spine switch. No automated corrective action has been defined for any anomalies that may occur. • Cluster administrators can run Infiniband cluster administration tools to locate bad Infiniband cables, failing spine or leaf switch ports, failing Infiniband HCAs. The Infiniband hardware has been reliable.
Cluster job types • A large fraction of the jobs that are run on the LQCD clusters are limited by: • Memory-bandwidth • Network-bandwidth Memory bandwidth bound Network bandwidth bound
Cluster job execution • Open PBS (Torque) and the Maui scheduler schedule jobs using the "FIFO" algorithm as follows: • Jobs are queued in the order of submission • Maui will run the highest (oldest) jobs in the queue in order, except it will not start a job if any of the following are true: • the job will put the number of running jobs by a particular user over the limit • the job will put the total number of nodes used by a particular user over the limit • the job specifies resources that cannot be fulfilled (e.g. a specific set of nodes requested by the user) • If there are jobs that are not eligible for any of the above, Maui will run the next eligible job. • Under certain conditions, Maui may run the next eligible job if only limit (c) holds. This is called backfilling. Maui will look at the state of the queue and the running jobs, and based on the requested and used wall-clock times predict when the job blocked by (c) will be able to run. If job(s) lower in the queue can run without extending the start time for the job blocked by (c), Maui will run that (those) jobs. • Once a job is ready to run, a set of nodes are allocated to the job exclusively, for the requested wall-time. Almost all jobs run on the LQCD clusters are MPI jobs. Users can explicitly refer to the PBS_NODEFILE environment variable OR it is coded into the mpirun launch script.
Cluster job execution (cont’d) • Typical user jobs are 8, 16 or 32 nodes which run for a maximum wall time of 24 hours. • A user nanny job running on the head-node executes job streams. Each job stream is a PBS job which: • on the job head-node (MPI node 0) copies a lattice (problem) stored in dCache to the local scratch disk. • divides the lattice into the number of nodes and copies the sub-lattices to each node local scratch disk. • launches an MPI process on each node which computes it’s sub-lattice. • the main process (MPI process 0) gathers the results from each node onto the job head-node (MPI node 0) and copies the output into dCache. • marks checkpoints at regular intervals for error recovery. • Output from one job stream is the input lattice for the next job stream. • If a job stream fails, the nanny job restarts the stream from the most recent saved checkpoint.
Wish List • Missing link between the monitoring process and the scheduler. Scheduler could do better by being node and network aware. • Ability to monitor factors that are critical to application performance (e.g. Thermal instabilities can cause throttling of cpu speed which ultimately affects performance). • Very few automated corrective actions defined for components and processes that are currently being monitored. • Using current health data, ability to predict node failures rather than just updating mrtg plots.