220 likes | 385 Views
Running flexible, robust and scalable grid application: Hybrid QM/MD Simulation. Hiroshi Takemiya, Yusuke Tanimura and Yoshio Tanaka Grid Technology Research Center National Institute of Advanced Industrial Science and Technology, Japan. Goals of the experiment.
E N D
Running flexible, robust and scalable grid application: Hybrid QM/MDSimulation Hiroshi Takemiya, Yusuke Tanimura and Yoshio Tanaka Grid Technology Research Center National Institute of Advanced Industrial Science and Technology, Japan
Goals of the experiment • To clarify functions needed to execute large scale grid applications • requires many computing resources for a long time • 1000 ~ 10000 CPUs • 1 month ~ 1 year • 3 requirements • Scalability • Managing a large number of resources effectively • Robustness • Fault detection and fault recovery • Flexibility • Dynamic Resource Switching • Can’t assume all resources are always available during the experiment
Difficulty in satisfying these requirements • Existing grid programming models are hard to satisfy the requirements • GridRPC • Dynamic configuration • Does not need co-allocation • Easy to switch computing resources dynamically • Good fault tolerance (detection) • One remote executable fault client can retry or use other remote executable • Hard to manage large numbers of servers • Client will become bottleneck • Grid-enabled MPI • Flexible communication • Possible to avoid communication bottleneck • Static configuration • Need co-allocation • Can not change the No. of processes during execution • Poor fault tolerance • One process fault all process fault • Fault tolerant MPI is still in the research phase
GridRPC GridRPC Gridifying applications using GridRPC and MPI • Combining GridRPC and MPI • Grid RPC • Allocating server (MPI) programs dynamically • Supporting loose communication between a client and servers • Managing only tens to hundreds of server programs • MPI • Supporting scalable execution of a parallelized server program • Suitable for gridifying applications consisting of loosely-coupled parallel programs • Multi-disciplinary simulations • Hybrid QM/MD simulation MPI Programs MPI Programs MPI Programs client GridRPC
Related Work • Scalability • Large scale experiment in SC2004 • Gridfying QM/MD simulation program based on our approach • Executing a simulation using ~1800 CPUs of 3 clusters • Our approach can manage a large No. of computing resources • Robustness • Long run experiment on the PRAGMA testbed • Executing TDDFT program over a month • Ninf-G can detect servers faults and return errors correctly • Conducting an experiment to show the validity of our approach • Long run QM/MD simulation on the PRAGMA testbed • implementing scheduling mechanism as well as fault tolerant mechanism
QM #1: 69 atoms including 2H2O+2OH QM #4: 56 atoms including H2O QM #2: 68 atoms including H2O MD: 110,000 atoms QM #3: 44 atoms including H2O Large scale experiment in SC2004 P32 (512 CPU) • Using totally 1793 CPUs on 3 clusters • Succeeded in running QM/MD program over 11 hours • Our approach can manage a large No. of resources TCS (512 CPU) @ PSC P32 (512 CPU) F32 (1 CPU) F32 (256 CPU) • ASC@AIST (1281 CPU) • P32 (1024 CPU) • Opteron (2.0 GHz) 2-way cluster • F32 (257 CPU) • Xeon (3.06 GHz) 2-way cluster • TCS@ PSC (512 CPU) • ES45 alpha (1.0 GHz) 4-way cluster
Related Work • Scalability • Large scale experiment in SC2004 • Gridfying QM/MD simulation program based on our approach • Executing a simulation using ~1800 CPUs of 3 clusters • Our approach can manage a large No. of computing resources • Robustness • Long run experiment on the PRAGMA testbed • Executing TDDFT program over a month • Ninf-G can detect servers faults and return errors correctly • Conducting an experiment to show the validity of our approach • Long run QM/MD simulation on the PRAGMA testbed • implementing scheduling mechanism as well as fault tolerant mechanism
Long run Experiment on the PRAGMA testbed • Purpose • Evaluate quality of Ninf-G2 • Have experiences on how GridRPC can adapt to faults • Ninf-G stability • Number of executions : 43 • Execution time (Total) : 50.4 days (Max) : 6.8 days (Ave) : 1.2 days • Number of RPCs: more than 2,500,000 • Number of RPC failures: more than 1,600 (Error rate is about 0.064 %) Ninf-G detected these failures and returned errors to the application
Related Work • Scalability • Large scale experiment in SC2004 • Gridfying QM/MD simulation program based on our approach • Executing a simulation using ~1800 CPUs of 3 clusters • Our approach can manage a large No. of computing resources • Robustness • Long run experiment on the PRAGMA testbed • Executing TDDFT program over a month • Ninf-G can detect servers faults and return errors correctly • The present experiment reinforces the evidence of the validity of our approach • Long run QM/MD simulation on the PRAGMA testbed • implementing a scheduling mechanism for flexibility as well as fault tolerance
Quantum description of bond breaking [ Stress distribution ] [ Deformation process ] Necessity of Large-scale Atomistic Simulation • Modern material engineering requires detailed knowledge based on microscopic analysis • Future electronic devices • Micro electro mechanical systems (MEMS) • Features of the analysis • nano-scale phenomena • A large number of atoms • Sensitive to environment • Very high precision Stress enhances the possibility of corrosion? Large-scale Atomistic Simulation
QM simulation based on DFT MD Simulation Hybrid QM/MD Simulation (1) • Enabling large scale simulation with quantum accuracy • Combining classical MD Simulation with QM simulation • MD simulation • Simulating the behavior of atoms in the entire region • Based on the classical MD using an empirical inter-atomic potential • QM simulation • Modifying energy calculated by MD simulation only in the interesting regions • Based on the density functional theory (DFT)
QM2 QM1 QM simulation QM simulation MD simulation Hybrid QM/MD Simulation (2) • Suitable for Grid Computing • Additive Hybridization • QM regions can be set at will and calculated independently • Computation dominant • MD and QMs are loosely coupled • Communication cost between QM and MD: ~ O(N) • Very large computational cost of QM • Computation cost of QM: ~ O(N3) • Computation cost of MD: ~ O(N) • A lot of sources of parallelism • MD simulation: executed in parallel (with tight communication) • each QM simulation: executed in parallel (with tight communication) • QM simulations: executed independently (without communication) • MD and QM simulations: executed in parallel (loosely coupled) tight loose tight independent tight
Data of QM atoms Calculate QM force of the QM region Calculate QM force of the QM region Calculate QM force of the QM region Calculate QM force of the QM region Calculate QM force of the QM region Calculate QM force of the QM region QM forces Initialization Initialization Initialization Modifying the Original Program • Eliminating initial set-up routine in the QM program • Adding initialization function • Eliminating the loop structure in the QM program • Tailoring the QM simulation as a function • Replacing MPI routine to Ninf-G function calls initial set-up initial set-up Initial parameters Calculate MD forces of QM+MD regions Data of QM atoms Calculate MD forces of QM region MD part QM part QM forces Update atomic positions and velocities
Implementation of a scheduling mechanism • Inserting scheduling layer between application and grpc layers in the client program • Application does not care about scheduling • Functions of the layer • Dynamic switching of target clusters • Checking availabilities of clusters • Available period • Maximum execution time • Error detection/recovery • Detecting server errors/time-outing • Time-outing • Preventing application from long wait • Long wait in the batch queue • Long data transfer time • Trying to continue simulation on other clusters • Implemented using Ninf-G QMMD simulation Layer (Fortran) Scheduling Layer GRPC layer (Ninf-G System) Client program
Long run experiment on the PRAGMA testbed • Goals • Continue simulation as long as possible • Check the availability of our programming approach • Experiment Time • Started at the 18th Apr. • End at the end of May (hopefully) • Target Simulation • 5 QM atoms inserted in the box-shaped Si • Totally 1728 atoms • 5 QM regions each of which consists of only 1 atom Entire region Central region Time evolution of the system
KISTI TITECH CNIC USM BII Testbed for the experiment NCSA: TGC SDSC: Rocks-52 Rocks-47 AIST: UME NCHC: ASE SINICA: PRAGMA UNAM: Malicia KU: AMATA • 8 clusters of 7 institutes in 5 countries • AIST, KU, NCHC, NCSA, SDSC, SINICA and UNAM • under porting for other 5 clusters • Using 2 CPUS for each QM simulation • Change target the cluster at every 2 hours
Porting the application • 5 steps to port our application • (1) Check the accessibility using ssh • (2) Executing sequential program using globus-job-run • (3) Executing MPI program using globus-job-run • (4) Executing Ninfied program • (5) Executing our application • Troubles • Jobmanager-sge had bugs to execute MPI programs • Fixed version was released from AIST • Inappropriate MPI was specified in jobmanagers • LAM/MPI does not support execution through Globus • Mpich-G is not available due to the certificate problem • Recommended to use mpich library <front end> <back end> <client> GRAM PBS/SGE GRAM mpirun Limited Cert Full Cert
Executing the application • Expiration of certificates • We had to care about many kinds of globus related certs • User cert, host cert, CA cert, CRL… • Globus error message is bad • “check host and port” • Poor I/O performance • Programs compiled by Intel fortran compiler takes a lot of time for I/O • 2 hours to output several Mbytes data! • Specifying buffered I/O • Using NFS file system is another cause of poor I/O performance • Remaining processes • Server processes remain on the backend nodes while job is deleted from a batch-queue • SCMS web is very convenient to find such remaining processes
Preliminary result of the experiment • Succeeded in calculating ~ 10000 time steps during 2 weeks • No. of GRPC executed: 47593 times • No. of failure/time-out: 524 times • Most of them (~80 %) occurred in the connection phase • Due to connection failure/batch system down/queuing time out • Time out for queueing: ~ 60 sec • Other failures include; • Exceeding max. execution time (2 hours) • Exceeding max. execution time/1 time step (5 min) • Exceeding max. CPU time the cluster specified (900 sec)
Execution Profile: Scheduling Example of exceeding max. execution time (~60 sec) (~80 sec)
Batch System Fault Execution time-out Queueing time-out Execution Profile: Error Recovery • Example of error recovering • Batch system fault • Queueing time-out • Execution time-out