Short Report on the laohu GPU cluster usage at NAOC

Changhua, Li National Astronomical Observatory of China Short Report on the laohu GPU cluster usage at NAOC

Introduction of Laohu Laohu GPUcluster built in 2009, the peak of single precision performance is 160TFLOPs. Total Cost: 6 million RMB2009 (4/1 Min. of Finance ZDYZ2008-2-A06/NAOC) Hardware configuration: 85nodes+ Infiniband+140T Node hardware info: Lenovo R740， 2 Xeon E5520 CPUs，24GBMemory，500G disk, 2 Nvidia C1060GPU cards.

Laohu upgrade C1060 240 cores, 4GMemory, 933GFlops In Sep. 2013，we bought 59 K20 GPU cards for 59 nodes，we spent 1.18million RMB. So, the new laohu configuration is 59 hosts with one k20 GPU cardand 26 hosts with 3C1060GPU cards。In theory, the peak of single precision performance is 280 TFLOPS/s.

LAOHUArchitecture

Laohu management system--LSF • Platform LSF (Load Sharing Facility) is a suite of distributed resource management products that: • Connects computers into a Cluster (or “Grid”) ； • Monitors loads of systems ； • Distributes, schedules and balances workload； • Controls access and load by policies； • Analyzes the workload； • High Performance Computing (HPC) environment

Laohu Queues for GPU job GPUqueues： • gpu_16: k20 host, max cores: 16, min cores: 4, total cores limitation: 32 • gpu_8: k20 host, max cores: 8, min cores: 2, total cores limitation: 24 • gpu_k20_test: k20 host, only 2 croes for one job, total cores • limitation: 3 • gpu_c1060: c1060 host, max cores: 30, min cores: 2, total cores limitation: 66 • gpu_c1060_test: c1060 host, only 3 cores for one job, total cores limitation: 9

LaohuQueues for CPU job CPUqueues • cpu_32: 25-32 nodes with 7/5 Cpu cores per node (192 cores) for per job, Allow to execute as two jobs. Maximum running time 1 week. • cpu_large: 8 - 22 nodes with 7/5 Cpu cores per node (total: 48 cores). Allow to execute as many jobs. Maximum running time 1 week. • cpu_small: 2 - 8 nodes with 7/5 cpu cores per node for per single job.Allow to execute as many job to fill 8 nodes/48 cpu cores. Maximum running time 1 week. • cpu_test: 1 - 5 nodes with 7/5 cpu cores per node(total: 30 cores). Allow to execute as many job to fill 5 nodes/30 cpu cores. Maximum running time 3 hours

CPU job submit script Sample 1: cpujob.lsf #!/bin/sh #BSUB -q cpu_32 #job queue, modify according to user #BSUB -a openmpi-qlc #BSUB -R 'select[type==any] span[ptile=6] ‘ #resource requirement of host #BSUB -o out.test #output file #BSUB -n 132 #the maximum number of CPU mpirun.lsf --mca "btlopenib,self" Gadget2wy WJL.PARAM # need modify for user’s program. Exec method: bsub < cpujob.lsf

GPU job submit script Sample 2: gpujob.lsf #!/bin/sh #BSUB -q gpu_32 #job queue #BSUB -a openmpi-qlc #BSUB -R 'select[type==any]‘ #resource requirement of host #BSUB -o out.test #output file #BSUB –e out.err #BSUB -n 20 #the maximum number of CPU mpirun.lsf --prefix "/usr/mpi/gcc/openmpi-1.3.2-qlc" -x "LD_LIBRARY_PATH=/export/cuda/lib:/usr/mpi/gcc/openmpi-1.3.2- qlc/lib64" ./phi-GRAPE.exe # need modify for user’s program. Exec method: bsub < gpujob.lsf

LAOHUMonitoring http://laohu.bao.ac.cn

LAOHU SOFT. CUDA4.0/CUDA5.0 OPENMPI/IntelMPI, etc. GCC 4.1/GCC4.5 Intel Compiler Math lib: blas, gsl, cfitsio, fft,… Gnuplot, pgplot Gadget

LAOHU Users

LAOHU CPU utilization ratio (2012) 2012 (Avg. 74%)

LAOHU CPU utilization ratio (2013) 2013 (Avg. 64%)

LAOHU Application List NBODY Simulations (NBODY6++, phiGPU, Galactic Nuclei, Star Clusters) NBODY Simulations (Gadget2, galactic dynamics) Correlator(only test) Gravitational Microlensing Local spirals formation through major merger Dark energy survey 7. TREND, the Mentocarlo simulation for the extreme-high energy Extensive AirShower(EAS) 8. Parallelization of Herschel Interactive Processing Environment 9. The HII region and PDR modeling based on CLOUDY code 10. Reconstructing primordial power spectrum and dark energy equation of state ……

LAOHU Achievement Berczik, P., Nitadori, K., Zhong S., Spurzem, R., Hamada, T, Wang, X.W., Berentzen, I., Veles, A., Ge, W., Proceedings of the International conference on High Performance Computing High Performance massively parallel direct N-body simulations on large GPU clusters Amaro-Seoane, P., Miller, M. C., Kennedy, G. F., Monthly Notices of the Royal Astronomical Society Tidal disruptions of separated binaries in galactic nuclei Just, A., Yurin, D., Makukov, M., Berczik, P., Omarov, C., Spurzem, R., Vilkoviskij, E. Y., The Astrophysical Journal Enhanced Accretion Rates of Stars on Supermassive Black Holes by Star-Disk Interactions in Galactic Nuclei Taani, A., Naso, L., Wei, Y., Zhang, C., Zhao, Y., Astrophysics and Space Science Modeling the spatial distribution of neutron stars in the Galaxy Olczak, C., Spurzem, R., Henning, T., Kaczmarek, T., Pfalzner, S., Harfst, S., PortegiesZwart, S., Advances in Computational Astrophysics: Methods, Tools, and Outcome Dynamics in Young Star Clusters: From Planets to Massive Stars Spurzem, R., Berczik, P., Zhong, S., Nitadori, K., Hamada, T., Berentzen, I., Veles, A., Advances in Computational Astrophysics: Methods, Tools, and Outcome Supermassive Black Hole Binaries in High Performance Massively Parallel Direct N-body Simulations on Large GPU Clusters Khan, F. M., Preto, M., Berczik, P., Berentzen, I., Just, A., Spurzem, R., The Astrophysical Journal Mergers of Unequal-mass Galaxies: Supermassive Black Hole Binary Evolution and Structure of Merger Remnants Li, S., Liu, F. K., Berczik, P., Chen, X., Spurzem, R., The Astrophysical Journal Interaction of Recoiling Supermassive Black Holes with Stars in Galactic Nuclei ..... http://silkroad.bao.ac.cn/web/index.php/research/publications

HPC/GPU Training

Astronomy Cloud Project

Astronomy Cloud Architecture

ThankS! Email: lich@nao.cas.cn

Short Report on the laohu GPU cluster usage at NAOC