310 likes | 589 Views
Outline Design philosophy Implementation Results Conclusions. Collaborators Mr. Mark Ponton, ACES Q. C. (SIP/SIAL/Compiler) Dr. Norbert Flocke, QTP (Integral package) Dr. Erik Deumens, QTP (Architect) Dr. Ajith Perera, QTP Dr. H. Lei, ACES Q. C. (Compiler) Dr. Anthony Yau, HPTi
E N D
Outline Design philosophy Implementation Results Conclusions Collaborators Mr. Mark Ponton, ACES Q. C. (SIP/SIAL/Compiler) Dr. Norbert Flocke, QTP (Integral package) Dr. Erik Deumens, QTP (Architect) Dr. Ajith Perera, QTP Dr. H. Lei, ACES Q. C. (Compiler) Dr. Anthony Yau, HPTi Dr. Rodney Bartlett, QTP ACES Q. C. ACESIII
Traditional Design code
ACESIII Design control compute communication Disk I/O code hardware
ACESIII design High level Problem Performance Low level concepts communication Data structures algorithms Input/output Super instruction Assembly language SIAL Super instruction Processor SIP (xaces3) input output
Key features Index segmentation Data blocking Task isolation Advantageous Flexibility Tune ability: Fast optimization New methods implemented in reduced time Portable SIAL (Super Instruction Assembly Language)
SCF MBPT(2) gradient CCSDgradient CCSD(T) MBPT(2)Hessian EOM CCSD (Tomasz Kus) RHF, UHF RHF, UHF, ROHF RHF, UHF RHF, UHF RHF, UHF, ROHF RHF, UHF Implemented
SCF Transformation CCSD CCSD(T) Easy if you have a good integrals package Hard but small cost Hard as highly nonlinear Trivial !!! At least that is the common wisdom CCSD(T)
(T) Strategy occupied o1 o2 o3 o4 E1 E2 E3 E4 E(TOTAL)
(T) Strategy occupied o1 o2 o3 o4 E1 E2 E3 E4 E(TOTAL)
Advantages of DUAL layer parallelism • Less data replication or I/O bottlenecks • Trivial restart capability • Better turnaround due to queuing • Since more processors are used the effective (T) time is comparable to the CCSD time making the CCSD as/more important that the (T)!
Luciferin(C11H8O3S2N2) RHF C1 symmetry Basis = aug-cc-pvdz (498bf) Ncorrocc = 46 Sucrose (C12H22O11) RHF C1 symmetry Basis = 6-311G** (546bf) 68 CCSD(T)
32 256
32 256
32 512
H10C 3O 4P C1 208 basis functions 75 electrons Number of processors = 64 Time CCSD = 69 minutes (3.8 min/iter) Time (T) =111 minutes *** *** 7 dual jobs DMMP+OH
Systematic set of Benchmarks • Why? To remove confusion over technological verses algorithmic advances. • Allow users informed choices. • Provide a set of calculations to evaluate each program so strengths and weaknesses become evident. • Remove ambiguity in literature.
Specifications(Mine!) N=6 UHF C1 symmetry Basis = aug-cc-pvtz (300bf) Ncorrocc = 54 R = 5 bohr Methods MBPT(2) gradient CCSD gradient CCSD(T) (core dropped) MBPT(2) Hessian (RHF) ArN Cluster Benchmarks(Performance)
32 256
32 256 32 32 256
32 256 32 32 256 256
256 32
32 256 32 256
MBPT(2) Hessian perturbations d d[ [Vabij Vabij Dabij ] / dp ]dq dV/dp*dV/dq V*d2V/dp/dq
Details of calculation • Number of basis functions = 300 • Number of correlated occupied = 54 • Number of Hessian elements = 324/2 • Number of processors = 128 • RHF reference
V*d2V/dpdq dV/dp dV/dq dV/dp*dV/dq T=381 minutes 155 sec / pert p 330 sec / pert q 16 sec / element Results
Observations • Ideally suited for dual layer parallelization with ‘dual’ layer being over the perturbations. • Dual layer strategy not optimal from an operation viewpoint as some computation must be repeated but many advantages: restart capability, real time of calculation, queuing, data storage.
Conclusions • ACESIII provides an ideal parallel environment in which to implement computationally intense methods. • MBPT(2) gradient achieved over 90% scaling until work exhausted • CCSD achieved better than ideal scaling up to 512 processors (32 as reference) indicating an optimal range of processors exists for each computation. • CCSD(T) perturbative triples can be computed quit effectively using a dual layer parallelization strategy so that (T) and CCSD are comparable to compute in a pragmatic way. • CCSD gradients (Ar6) exhibit ideal scaling from 32-256 processors.
Conclusions • MBPT(2) Hessians (and others also) benefit from dual layer parallelism but care bust be taken to segment the work optimally. • A set of benchmark calculations would be very valuable to the quantum chemistry community to remove ambiguities among various programs. • ACESIII has been successfully ported to the following systems: IBM SP4 SP5, ALTIX, Linux cluster, Opteron cluster and is available on many DOD machines. • ACESIII benefits from ‘many’ processors indicating potential in the massively parallel regime. • The flexibility offered by the ACESIII environment allows for rapid tuning and implementation of codes.