200 likes | 316 Views
Performance Characteristics of a Cosmology Package on Leading HPC Architectures. Leonid Oliker http://crd.lbl.gov/~oliker Julian Borrill, Jonathan Carter Lawrence Berkeley National Laboratories. Overview. Superscalar cache-based architectures dominate HPC market
E N D
Performance Characteristics of a Cosmology Packageon Leading HPC Architectures Leonid Oliker http://crd.lbl.gov/~oliker Julian Borrill, Jonathan CarterLawrence Berkeley National Laboratories
Overview • Superscalar cache-based architectures dominate HPC market • Leading architectures are commodity-based SMPs due to generality and perception of cost effectiveness • Growing gap between peak & sustained performance is well known in scientific computing • Modern parallel vectors may bridge gap this for many important applications • In April 2002, the Earth Simulator (ES) became operational: Peak ES performance > all DOE and DOD systems combined Demonstrated high sustained performance on demanding scientific apps • Conducting evaluation study of scientific applications on modern vector systems • 09/2003 MOU between ES and NERSC was completedFirst visit to ES center: Dec 2003, second visit Oct 2004 (no remote access)First international team to conduct performance evaluation study at ES • Examining best mapping between demanding applications and leading HPC systems - one size does not fit all
Vector Paradigm • High memory bandwidth • Allows systems to effectively feed ALUs (high byte to flop ratio) • Flexible memory addressing modes • Supports fine grained strided and irregular data access • Vector Registers • Hide memory latency via deep pipelining of memory load/stores • Vector ISA • Single instruction specifies large number of identical operations • Vector architectures allow for: • Reduced control complexity • Efficiently utilize large number of computational resources • Potential for automatic discovery of parallelism However: most effective if sufficient regularity discoverable in program structure • Suffers even if small % of code non-vectorizable (Amdahl’s Law)
Architectural Comparison • Custom vector architectures have • High memory bandwidth relative to peak • Superior interconnect: latency, point to point, and bisection bandwidth Another key balance point is I/O performance: Seaborg I/O: 16 GFPS servers, each w/ 32 GB main memory (for caching & metadata) I/O uses switch fabric, sharing bandwidth with message-passing traffic ES I/O: Each group 16 nodes has a pool of RAID disks attached with fiber channel switch (each node has a separate filesystem)
Previous ES visit Tremendous potential of vector architectures: 4 codes running faster than ever before • Vector systems allows resolution not possible with scalar arch (regardless of # procs) • Opportunity to perform scientific runs at unprecedented scale • Evaluation codes contain sufficient regularity in computation for high vector performance • However, none of the tested codes contained significant I/O requirements
The Cosmic Microwave Background • The CMB is a snapshot of the Universe when it first became neutral 400,000 years after the Big Bang. • After Big Bang the expansion of space cooled Universe sufficiently for charged electrons and neutrons to combine Cosmic - primordial photons filling all of space. Microwave - redshifted by the expansion of the Universe from 3000K to 3K. Background - coming from “behind” all astrophysical sources.
The CMB is a unique probe of the very early Universe. Tiny fluctuations in its temperature & polarization encode - the fundamental parameters of cosmology Universe geometry, expansion rate, number of neutrino species, ionization history, dark matter, cosmological constant - ultra-high energy physics beyond the Standard Model CMB Science
CMB Data Analysis CMB analysis moves from the time domain - observations - O(1012) to the pixel domain - maps - O(108) to the multipole domain - power spectra - O(104) calculating the compressed data and their reduced error bars at each step.
MADCAP: Performance • Porting: ScaLAPACK plus rewrite of Legendre polynomial recursion, such that large batches are computed in inner loop • Original ES visit: only partially ported due to code’s requirements of global file system • Could not meet minimum parallelization and vectorization thresholds for ES • All systems sustain relatively low % peak considering MADCAP’s BLAS3 ops • Detailed analysis presented HiPC 2004 • Further work performed for MADbench to: reduce I/O, remove system calls, and remove global file system requirements • New results collected from recent ES visit October 2004
IPM Overview Integrated Performance Monitoring • portable, lightweight, scalable profiling • fast hash method • profiles MPI topology • profiles code regions • open source ########################################### # IPMv0.7 :: csnode041 256 tasks ES/ESOS # madbench.x (completed) 10/27/04/14:45:56 # # <mpi> <user> <wall> (sec) # 171.67 352.16 393.80 # … ############################################### # W # <mpi> <user> <wall> (sec) # 36.40 198.00 198.36 # # call [time] %mpi %wall # MPI_Reduce 2.395e+01 65.8 6.1 # MPI_Recv 9.625e+00 26.4 2.4 # MPI_Send 2.708e+00 7.4 0.7 # MPI_Testall 7.310e-02 0.2 0.0 # MPI_Isend 2.597e-02 0.1 0.0 ############################################### … MPI_Pcontrol(1,”W”); …code… MPI_Pcontrol(-1,”W”);
MADbench Is a lightweight version of the MADCAP maximum likelihood CMB power spectrum estimation code. Retains the operational complexity & integrated system requirements of the full science code. Has three basic steps - dSdC, invD & W. Out of core calculation: holds approx 3 of the 50 matrices in memory Is used for - computer & file-system procurements. - realistic scientific code benchmarking and optimization. - architectural comparisons.
dSdC This step generates a set of Nb dense, symmetric NpxNp signal correlation derivative matrices dSdCb by Lengendre polynomial recursion. Each matrix is block-cyclic distributed over the 2D processor array with blocksize B. As each matrix is calculated, each processor writes its subset of the matrix elements to a unique file. No inter-processor communication is required. Flops: O(N2P) Disk: 8NbN2p (primarily writing)
invD This step generates the data correlation matrix D and inverts it. The dSdCb matrices are read from disk one at a time and progressively accumulated to build the signal correlation matrix S. A diagonal white noise correlation matrix N is added to S to give the data correlation matrix D, which is inverted using ScaLAPACK to give D-1. Each processor writes its subset of the D-1 matrix elements to a unique file. Flops: O(N3P) Disk: 8NbN2p (primarily reading)
W This step multiplies each dSdCb matrix by D-1 to form Wb and derives a Newton-Raphson iterative step from this. Since they are independent, these matrix multiplications can be carried out gang-parallel across Ng gangs of processors. Each dSdCb matrix is read in by all processors and then redistributed to the target gang. When all gangs have been given a matrix, they all perform their multiplication simultaneously. Flops: O(N3P) Disk: 8NbN2p (primarily reading)
Parameters Np - number of pixels (matrix size). Nb - number of bins (matrix count). Ng - number of gangs of processors. B - ScaLAPACK blocksize. MODIO - IO concurrency control (only 1 in MODIO processors do IO simultaneously). Running on P processors requires: - 3 x 8 x Np2 bytes of memory per gang - Nb x 8 x Np2 bytes & Nb x P inodes of disk - Nb a multiple of Ng to load-balance the gangs. B & MODIO are architecture-specific optimizations.
dSdC performance • ES shows constant I/O performance (independent disks) • Significantly fast computation (30X) due to high memory bandwidth • Overall only 2.6X faster than Power3 due to I/O overhead • Power3 has faster write I/O until GPFS contention at P=1024
invD performance • I/O remains relatively constant, while MPI overhead and computation grows • Seaborg I/O reads faster than ES • Overall ES only 2.3X faster
W performance • Multi-gang runs significantly reduce MPI overhead (4.8X on ES, 3.3X on Seaborg) • MPI and CALC grow with numbers of processors • I/O trivial part of W calculation • Overall ES is 7X faster
Performance overview • Overall ES 5.6X faster & slightly higher % of peak compared w/ Seaborg for P=1024 • For P=256 Seaborg shows higher % peak, due to relative I/O vs. peak flop performance • Although I/O cost remains relatively high, both systems achieve over 50% peak
Overview • New version of Madbench successfully reduced I/O overhead and removed global file system requirements • Allowed ES runs up to 1024 processors, achieving over 50% of peak • Compared with only 23% of peak on 64 processors from first visit • Results show that I/O has more effect on ES than Seaborg - due to ratio between I/O performance and peak ALU speed • Demonstrated IPM capabilities to measure MPI overhead on variety of architectures without the need to recompile, at a trivial runtime overhead (1-2%) • Continue study of complex interplay between architecture, interconnect, and I/O • Currently performing experiments on Columbia and Phoenix • MADbench and IPM being prepared for public distribution • Future CMB analysis will require sparse methods due to size of data sets - potentially at odds with vector architectures