280 likes | 549 Views
NWChem software development with Global Arrays. Edoardo Apr à High Performance Computational Chemistry Environmental Molecular Sciences Laboratory Pacific Northwest National Laboratory Richland, WA 99352. Outline. Global Arrays Toolkit overview NWChem overview Performance
E N D
NWChem software development with Global Arrays Edoardo Aprà High Performance Computational Chemistry Environmental Molecular Sciences Laboratory Pacific Northwest National Laboratory Richland, WA 99352
Outline • Global Arrays Toolkit overview • NWChem overview • Performance • Usage of Global Arrays in NWChem
Global Arrays programming interface Physically distributed data • Shared-memory programming in context of distributed dense arrays • combines better features of shared memory and message passing • explicit control of data distribution & locality like in MPI codes • shared memory ease of use • focus on hierarchical memory of modern computers (NUMA) • for highest performance, requires porting to each individual platform Single, shared data structure
Global Arrays Overview • A substantial capability extension to the message-passing model • Compatible with MPI and standard parallel libraries • ScaLAPACK, SUMMA,PeIGS, PETSc, CUMULVS • BLAS interface • Forms a part of bigger system for NUMA programming • extensions for computational grid environments: Mirrored Arrays • natural parallel I/O extensions: Disk Resident Arrays • Originated in an High Performance Computational Chemistry project • Used by 5 major chemistry codes, financial futures forecasting, astrophysics, computer graphics
Global Array Model of Computations Shared Object Shared Object 1-sided communication 1-sided communication copy to shared object copy to local memory compute/update local memory local memory local memory
ARMCI First portable 1-sided communication library • Core communication capability of GA were generalized, extended, and made standalone • Approach • simple progress rules • less restrictive than MPI-2 (includes 1-sided communication) • low-level • high-performance • optimized for noncontiguous data transfers (array sections, scatter/gather) • implemented using whatever mechanisms work best on given platform: • active messages, threads, sockets, shared memory, remote memory copy • Used by GA as its run-time system and contributed to other projects • Adlib (U. Syracuse), Padre/Overture (LLNL), DDI (Ames)
Intelligent Protocols in ARMCI On IBM SP with SMP nodes ARMCI exploits • cluster locality information • Active Messages • remote memory copy • shared memory within SMP node • threads logically partitioned shared memory segment local process memory m e m c p y put LAPI P2 LAPI P3 P0 P1 P3 P0 P1 P2 adapter adapter SMP node SMP node switch
Advanced networks • Exciting research and development opportunities • High-performance <10mS latency, 100-200 MB/s bandwidth • Low cost • Flexible • Finally the traditionally closed interfaces open • Protocols: GM(Myrinet), VIA (Giganet), Elan (Quadrics) • offer a lot of capabilities and performance to support not only MPI but also more advanced models • H/W support for 1-sided communication • NIC support • More of the high-level protocols pushed down to h/w • Opportunities for optimization e.g., fast collective operations
High Performance Parallel I/O Models System of Hints allows performance tuning to match application characteristics Disk Resident Arrays array objects on disk (RI-SCF, RI-MP2) Shared Files “shared-memory on disk” (MRCI) Exclusive Access Files private files per processor (semidirect SCF and MP-2) application layer ELIO device library Distant I/O one-sided communication to disk portability layer filesystem C filesystem B filesystem layer filesystem A hardware
independent to shared file collective to shared file independent to private files Forms of Parallel I/O in Chemistry Apps
90 80 70 60 Eigenvectors 50 Householder Backtransform 40 Total 30 20 10 0 0 32 64 96 128 GA Interface to PeIGS 3.0 (Solution of real symmetric generalized and standard eigensystem problems) • Unique feature not available elsewhere • Inverse iteration using Dhillon-Fann-Parlett’s parallel algorithm (fastest uniprocessor performance and good parallel scaling) • Guaranteed orthonormal eigenvectors in the presence of large clusters of degenerate eigenvalues • Packed Storage • Smaller scratch space requirements Full eigensolution performed on a matrix 966x966
Why NWChem Was Developed • Developed as part of the construction of the Environmental Molecular Sciences Laboratory (EMSL) • Envisioned to be used as an integrated component in solving DOE’s Grand Challenge environmental restoration problems • Designed and developed to be a highly efficient and portable MPP computational chemistry package • Provides computational chemistry solutions that are scalable with respect to chemical systemsize as well as MPP hardware size
NWChem Software Development Cycle Computational Chemistry Models Requirements Definition and Analysis Phase • Research including work in: • Theory innovation • O(N) reduction • Improved MPP algorithms • etc New Theory Development Theory (Method Development) Detailed Design Phase Algorithm Development Implementation Phase Preliminary Design Phase Acceptance Test Phase Prototyping (Implementation)
Generic Tasks Energy, structure, … SCF energy, gradient, … Molecular Calculation Modules DFT energy, gradient, … MD, NMR, Solvation, … Run-time database Optimize, Dynamics, … Molecular Modeling Toolkit Basis Set Object Geometry Object Integral API PeIGS ... ... Molecular Software Development Toolkit Parallel IO Memory Allocator Global Arrays NWChem Architecture
Input Database DFT Basis Theory nwArgos Energy SCF Status CCSD Geometry MP2 Filenames Stepper Operation Driver Program Modules • All major functions exist as independent modules. • Modules only communicate thru the database and files • No shared common blocks • The only argument to a module is the database • Modules have well defined actions • Modules can call other modules
NWChem is supported on at least these platforms, and is readily ported to essentially all sequential and parallel computers. • IBM SP • IBM workstations • CRAY T3 • SGI SMP systems • Fujitsu VX/VPP • SUN and other Homogeneous workstation networks • x86-based workstations running Linux including laptops • x86-based workstations running NT or Win98 • Tru64 and Linux Alpha servers (including SC series)
Speedup Speedup Speedup 1000 Linear Linear Linear 250 CRAY T3E-900 120 CRAY T3E-900 CRAY T3E-900 IBM SP2 iPSC/860 (Charmm Brooks et al.) 0.17 / 108 1.0 / 250 200 100 750 0.94 / 128 IBM SP2 0.58 / 1000 80 150 500 60 1.6 / 250 100 0.36 / 108 40 250 50 20 0 0 0 0 250 500 750 1000 0 50 100 150 200 250 0 20 40 60 80 100 120 Number of nodes Number of nodes Myoglobin in Water 10,914 atoms Rc=1.6 nm Dichloroethane-Water Interface 100,369 atoms Rc=1.8 nm Octanol 216,000 atoms Rc=2.4 nm MD Benchmarks
Speedup Linear 120 CRAY T3E-900 iPSC/860 (Charmm Brooks et al.) 0.17 / 108 100 0.94 / 128 IBM SP2 80 60 0.36 / 108 40 20 0 0 20 40 60 80 100 120 Myoglobin Wall clock times and scaling obtained for a 1000 step NWChem classical molecular dynamics simulation of myoglobin in water, using the AMBER force field, and a 1.6 nm cutoff radius. The system size is 10,914 atoms.
SCF: Parallel Performance IBM Poughkeepsie vendor rack 160 MHz nodes, 512 MB/node, 3 GB disk/node 15 MB/sec/node sustained read bandwidth 300 900 5.7 hours Speedup(CPU) Speedup (Wall) GB 200 600 Speedup 105 atoms 1343 functions 362 electrons Dunning augmented cc-pvdz basis set Total disk space / GB 100 300 37 hours (wall) 0 0 0 60 120 180 240 No. of processors
MP2 Gradient Parallel Scaling 2.9 hours 10000 0.98 hours KC8O4H16 C/O/H -aug-cc-pvdz K - Ahlrichs PVDZ 458 functions 114 electrons 120 MHz P2SC 128MB mem/node 2GB disk/node TB3 switch Global Arrays TCGMSG+LAPI Total Forw-tran Wall time 1000 Make-t Lai Back-tran Non-sep CPHF Sep Fock 100 50 100 150 200 No. of processors
DFT Benchmarks Si8O7H18 347 Basis f. LDA wavefunction
DFT Benchmarks Si28O67H30 1687 Basis f. LDA wavefunction
GA Operations • The primitive operations that are invoked collectively by all processes are: • create a distributed array, controlling alignment and distribution; • destroy an array; • synchronize. • Primitive operations in MIMD style : • fetch, store and accumulate into a rectangular patch of global array; • gather and scatter; • atomic read and increment; • BLAS-like data-parallel operations on sections or entire global arrays : • vector operations including: dot product, scale, add; • matrix operations including: symmetrize, transpose, multiplication.
GA sample program status = ga_create(mt_dbl,n,n,'matrix',n,0,g_a) status = ga_create(mt_dbl,n,n,'matrix',n,0,g_b) status = ga_create(mt_dbl,n,n,'matrix',n,0,g_c) ... call ga_dgemm('N','N',n,n,n,1d0,g_a,g_b,0d0,g_c) call ga_print(g_c) call ga_diag_std(g_c,g_a,evals) status = ga_destroy(g_a)
The SCF Algorithm in a Nutshell Density Fock matrix
Acknowledgments Jarek Nieplocha The NWChem developers team
NWChem Developers HPCC Developers Staff • Dr. Edo Apra • Dr. Eric Bylaska • Dr. Michel Dupuis • Dr. George Fann • Dr. Robert Harrison • Dr. Rick Kendall • Dr. Jeff Nichols • Dr. T. P. Straatsma • Dr. Theresa Windus Research Fellows • Dr. Ken Dyall • Prof. Eric Glendenning • Dr. Benny Johnson • Prof. Joop van Lenthe • Dr. Krzyzstof Wolinski • Non-HPCC Developers • Staff • Dr. Dave Elwood • Dr. Maciej Gutowski • Dr. Anthony Hess • Dr. John Jaffe • Dr. Rik Littlefield • Dr. Jarek Nieplocha • Dr. Matt Rosing • Dr. Greg Thomas • Post-Doctoral Fellows • Dr. Zijing Lin • Dr. Rika Kobayash • Dr. Jialin Ju • Graduate Students • Dr. Daryl Clerc • Post-Doctoral Fellows • Dr. James Anchell • Dr. Dave Bernholdt • Dr. Piotr Borowski • Dr. Terry Clark • Dr. Holger Dachsel • Dr. Miles Deegan • Dr. Bert de Jong • Dr. Herbert Fruchtl • Dr. Ramzi Kutteh • Dr. Xiping Long • Dr. Baoqi Meng • Dr. Gianni Sandrone • Dr. Mark Stave • Dr. Hugh Taylor • Dr. Adrian Wong • Dr. Zhiyong Zhang