70 likes | 177 Views
Workshop on Parallelization of Coupled-Cluster Methods Panel 1: Parallel efficiency An incomplete list of thoughts. Bert de Jong High Performance Software Development Molecular Science Computing Facility. Overall hardware issues. Computer power per node has increased
E N D
Workshop on Parallelization of Coupled-Cluster MethodsPanel 1: Parallel efficiencyAn incomplete list of thoughts Bert de Jong High Performance Software Development Molecular Science Computing Facility
Overall hardware issues • Computer power per node has increased • Increase of single CPU has flattened out (but you never know!) • Multiple cores together tax out other hardware resources in a node • Bandwidth and latency for other major hardware resources are far behind • Affecting the flops we actually use • Memory • Very difficult to feed the CPU • Multiple cores further reduce bandwidth • Network • Data access considerably slower than memory • Speed of light is our enemy • Disk input/output • Slowest of them all, disks spin only so fast 2
Dealing with memory • Amounts of data needed in coupled cluster can be huge • Amplitudes • Too large to store on a single node (except for T1) • Shared memory would be good, but will shared memory of 100s of terabytes be feasible and accessible? • Integrals • Recompute vs store (on disk or in memory) • Can we avoid access to memory when recomputing • Coupled cluster has one advantage: it can easily be formulated as matrix-multiplication • Can be very efficient: DGEMM on EMSL’s 1.5 GHz Itanium-2 system reached over 95% of peak efficiency • As long as we can get all the needed data in memory! 3
Dealing with networks • With 10s of terabytes of data and distributed memory systems, getting data from remote nodes is inevitable • Can be no problem, as long as you can hide the communication behind computation • Fetch data while computing = one-sided communication • NWChem uses Global Arrays to accomplish this • Issues are • Low bandwidth and high latency relative to increasing node speed • Non-uniform network • Cabling a full fat tree can be cost prohibitive • Effect of network topology • Fault resiliency of network • Multiple cores need to compete for limited number of busses • Data contention increase with increasing node count • Data locality, data locality, data locality 4
Dealing with spinning disks • Using local disk • Will only contain data needed by its own node • Can be fast enough if you put large number of spindles behind it • And, again, if you can hide behind computation (pre-fetch) • With 100,000s of disks, chances of failure become significant • Fault tolerance of computation becomes an issue • Using globally shared disk • Crucial when going to very large systems • Allows for large files shared by large numbers of nodes • Lustre file system of petabytes possible • Speed limited by number of access points (hosts) • Large number of reads and writes need to be handled by small number of hosts, creating lock and access contention 5
What about beyond 1 petaflop? • Possibly 100,000s of multicore nodes • How does one create a fat enough network between that many nodes? • Possibly 32, 64, 128 or more cores per node • All cores simply cannot do the same thing anymore • Not enough memory bandwidth • Not enough network bandwidth • Heterogenous computing within a node (CPU+GPU) • Designate nodes for certain tasks • Communication • Memory access, put and get • Recompute integrals hopefully using cache only • DGEMM operations • Task scheduling will become an issue 6
WR Wiley Environmental Molecular Sciences Laboratory A national scientific user facility integrating experimental and computational resources for discovery and technological innovation 7