780 likes | 801 Views
Learn about HPC problems and approaches, Global Address Space Programming, Titanium language, SciMark Benchmark, Java vs. C performance, and compiler optimizations for parallel code. Discover compiler analysis and memory fences in optimizing parallel code.
E N D
Tools for High Performance Scientific Computing Kathy YelickU.C. Berkeley http://www.cs.berkeley.edu/~yelick/
HPC Problems and Approaches • Parallel machines are too hard to program • Users “left behind” with each new major generation • Efficiency is too low • Even after a large programming effort • Single digit efficiency numbers are common • Approach • Titanium: A modern (Java-based) language that provides performance transparency • Sparsity: Self-tuning scientific kernels • IRAM: Integrated processor-in-memory
Titanium: A Global Address Space Language Based on Java • Faculty • Susan Graham • Paul Hilfinger • Katherine Yelick • Alex Aiken • LBNL collaborators • Phillip Colella • Peter McQuorquodale • Mike Welcome • Students • Dan Bonachea • Szu-Huey Chang • Carrie Fei • Ben Liblit • Robert Lin • Geoff Pike • Jimmy Su • Ellen Tsai • Mike Welcome (LBNL) • Siu Man Yau http://titanium.cs.berkeley.edu/
Global Address Space Programming • Intermediate point between message passing and shared memory • Program consists of a collection of processes. • Fixed at program startup time, like MPI • Local and shared data, as in shared memory model • But, shared data is partitioned over local processes • Remote data stays remote on distributed memory machines • Processes communicate by reads/writes to shared variables • Note: These are not data-parallel languages • Examples are UPC, Titanium, CAF, Split-C • E.g., http://upc.nersc.gov
Titanium Overview Object-oriented language based on Java with: • Scalable parallelism • SPMD model with global address space • Multidimensional arrays • points and index sets as first-class values • Immutable classes • user-definable non-reference types for performance • Operator overloading • by demand from our user community • Semi-automated memory management • uses memory regions for high performance
SciMark Benchmark • Numerical benchmark for Java, C/C++ • Five kernels: • FFT (complex, 1D) • Successive Over-Relaxation (SOR) • Monte Carlo integration (MC) • Sparse matrix multiply • dense LU factorization • Results are reported in Mflops • Download and run on your machine from: • http://math.nist.gov/scimark2 • C and Java sources also provided Roldan Pozo, NIST, http://math.nist.gov/~Rpozo
SciMark: Java vs. C(Sun UltraSPARC 60) * Sun JDK 1.3 (HotSpot) , javac -0; Sun cc -0; SunOS 5.7 Roldan Pozo, NIST, http://math.nist.gov/~Rpozo
Can we do better without the JVM? • Pure Java with a JVM (and JIT) • Within 2x of C and sometimes better • OK for many users, even those using high end machines • Depends on quality of both compilers • We can try to do better using a traditional compilation model • E.g., Titanium compiler at Berkeley • Compiles Java extension to C • Does not optimize Java arrays or for loops (prototype)
Language Support for Performance • Multidimensional arrays • Contiguous storage • Support for sub-array operations without copying • Support for small objects • E.g., complex numbers • Called “immutables” in Titanium • Sometimes called “value” classes • Unordered loop construct • Programmer specifies iteration independent • Eliminates need for dependence analysis – short term solution? Used by vectorizing compilers.
Optimizing Parallel Code • Compiler writers would like to move code around • The hardware folks also want to build hardware that dynamically moves operations around • When is reordering correct? • Because the programs are parallel, there are more restrictions, not fewer • The reason is that we have to preserve semantics of what may be viewed by other processors
Sequential Consistency • Given a set of executions from n processors, each defines a total order Pi. • The program order is the partial order given by the union of these Pi ’s. • The overall execution is sequentially consistent if there exists a correct total order that is consistent with the program order. write x =1 read y 0 When this is serialized, the read and write semantics must be preserved write y =3 read z 2 read x 1 read y 3
Use of Memory Fences • Memory fences can turn a weak memory model into sequential consistency under proper synchronization: • Add a read-fence to acquire lock operation • Add a write fence to release lock operation • In general, a language can have a stronger model than the machine it runs if the compiler is clever • The language may also have a weaker model, if the compiler does any optimizations
Compiler Analysis Overview • When compiling sequential programs, compute dependencies: Valid if y not in expr1 and x not in expr2 (roughly) • When compiling parallel code, we need to consider accesses by other processors. x = expr1; y = expr2; y = expr2; x = expr1; Initially flag = data = 0 Proc A Proc B data = 1; while (flag == 0); flag = 1; ... = ...data...;
write data read flag write flag read data Cycle Detection • Processors define a “program order” on accesses from the same thread P is the union of these total orders • Memory system define an “access order” on accesses to the same variable A is access order (read/write & write/write pairs) • A violation of sequential consistency is cycle in P U A [Shash&Snir]
Cycle Analysis Intuition • Definition is based on execution model, which allows you to answer the question: Was this execution sequentially consistent? • Intuition: • Time cannot flow backwards • Need to be able to construct total order • Examples (all variables initially 0) write data 1 read data 1 write data 1 read flag 1 write flag 1 read data 0 write flag 1 read flag 0
Cycle Detection Generalization • Generalizes to arbitrary numbers of variables and processors • Cycles may be arbitrarily long, but it is sufficient to consider only minimal cycles with 1 or 2 consecutive stops per processor • Can simplify the analysis by assuming all processors run a copy of the same code write x write y read y read y write x
read x write z write y read y write z Static Analysis for Cycle Detection • Approximate P by the control flow graph • Approximate A by undirected “conflict” edges • Bi-directional edge between accesses to the same variable in which at least one is a write • It is still correct if the conflict edge set is a superset of the reality • Let the “delay set” D be all edges from P that are part of a minimal cycle • The execution order of D edge must be preserved; other P edges may be reordered (modulo usual rules about serial code)
Cycle Detection in Practice • Cycle detection was implemented in a prototype version of the Split-C and Titanium compilers. • Split-C version used many simplifying assumptions. • Titanium version had too many conflict edges. • What is needed to make it practical? • Finding possibly-concurrent program blocks • Use SPMD model rather than threads to simplify • Or apply data race detection work for Java threads • Compute conflict edges • Need good alias analysis • Reduce size by separating shared/private variables • Synchronization analysis
Communication Optimizations • Data on an old machine, UCB NOW, using a simple subset of C Time (normalized)
Global Address Space • To run shared memory programs on distributed memory hardware, we replace references (pointers) by global ones: • May point to remote data • Useful in building large, complex data structures • Easy to port shared-memory programs (functionality is correct) • Uniform programming model across machines • Especially true for cluster of SMPs • Usual implementation • Each reference contains: • Processor id (or process id on cluster of SMPs) • And a memory address on that processor
Use of Global / Local • Global pointers are more expensive than local • When data is remote, it turns into a remote read or write) which is a message call of some kind • When the data is not remote, there is still an overhead • space (processor number + memory address) • dereference time (check to see if local) • Conclusion: not all references should be global -- use normal references when possible. • Titanium adds “local qualifier” to language
Local Pointer Analysis • Compiler can infer locals using Local Qualification Inference • Data structures must be well partitioned
lv lv lv lv lv lv gv gv gv gv gv gv Region-Based Memory Management • Processes allocate locally • References can be passed to other processes Other processes Process 0 LOCAL HEAP LOCAL HEAP class C { int val;... } C gv; // global pointer C local lv; // local pointer if (thisProc() == 0) { lv = new C(); } gv = broadcast lv from 0; gv.val = ...; ... = gv.val;
Parallel Applications • Genome Application • Heart simulation • AMR elliptic and hyperbolic solvers • Scalable Poisson for infinite domains • Genome application • Several smaller benchmarks: EM3D, MatMul, LU, FFT, Join,
Heart Simulation • Problem: compute blood flow in the heart • Modeled as an elastic structure in an incompressible fluid. • The “immersed boundary method” [Peskin and McQueen]. • 20 years of development in model • Many other applications: blood clotting, inner ear, paper making, embryo growth, and more • Can be used for design of prosthetics • Artificial heart valves • Cochlear implants
AMR Gas Dynamics • Developed by McCorquodale and Colella • 2D Example (3D supported) • Mach-10 shock on solid surface at oblique angle • Future: Self-gravitating gas dynamics package
Benchmarks for GAS Languages • EEL – End to end latency or time spent sending a short message between two processes. • BW – Large message network bandwidth • Parameters of the LogP Model • L – “Latency”or time spent on the network • During this time, processor can be doing other work • O – “Overhead” or processor busy time on the sending or receiving side. • During this time, processor cannot be doing other work • We distinguish between “send” and “recv” overhead • G – “gap” the rate at which messages can be pushed onto the network. • P – the number of processors • This work was done with the UPC group at LBL
Non-overlapping overhead Send and recv overhead can overlap P0 osend L orecv P1 LogP: Overhead & Latency P0 osend orecv P1 EEL = osend + L + orecv EEL = f(osend, L, orecv)
Benchmarks • Designed to measure the network parameters • Also provide: gap as function of queue depth • Measured for “best case” in general • Implemented once in MPI • For portability and comparison to target specific layer • Implemented again in target specific communication layer: • LAPI • ELAN • GM • SHMEM • VIPL
Send Overhead Over Time • Overhead has not improved significantly; T3D was best • Lack of integration; lack of attention in software
Summary • Global address space languages offer alternative to MPI for large machines • Easier to use: shared data structures • Recover users left behind on shared memory? • Performance tuning still possible • Implementation • Small compiler effort given lightweight communication • Portable communication layer: GASNet • Difficulty with small message performance on IBM SP platform
Future Plans • Merge communication layer with UPC • “Unified Parallel C” has broad vendor support. • Uses some execution model as Titanium • Push vendors to expose low-overhead communication • Automated communication overlap • Analysis and refinement of cache optimizations • Additional support for unstructured grids • Conjugate gradient and particle methods are motivations • Better uniprocessor optimizations, possibly new arrays
Faculty • James Demmel • Katherine Yelick • Graduate Students • Rich Vuduc • Eun-Jim Im • Undergraduates • Shoaib Kamil • Rajesh Nishtala • Benjamin Lee • Hyun-Jin Moon • Atilla Gyulassy • Tuyet-Linh Phan Sparsity: Self-Tuning Scientific Kernels http://www.cs.berkeley.edu/~yelick/sparsity
Context: High-Performance Libraries • Application performance dominated by a few computational kernels • Today: Kernels hand-tuned by vendor or user • Performance tuning challenges • Performance is a complicated function of kernel, architecture, compiler, and workload • Tedious and time-consuming • Successful automated approaches • Dense linear algebra: PHiPAC/ATLAS • Signal processing: FFTW/SPIRAL/UHFFT
Tuning pays off – ATLAS Extends applicability of PHIPAC; Incorporated in Matlab (with rest of LAPACK)
Tuning Sparse Matrix Kernels • Performance tuning issues in sparse linear algebra • Indirect, irregular memory references • High bandwidth requirements, poor instruction mix • Performance depends on architecture, kernel, and matrix • How to select data structures, implementations? at run-time? • Typical performance: < 10% machine peak • Our approach to automatic tuning: for each kernel, • Identify and generate a space of implementations • Search the space to find the fastest one (models, experiments)
Sparsity System Organization • Optimizations depend on machine and matrix structure • Choosing optimization is expensive Representative Matrix Data Structure Definition & Code Sparsity machine profiler Machine Profile Sparsity optimizer Matrix Conversion routine Maximum # vectors
Sparse Kernels and Optimizations • Kernels • Sparse matrix-vector multiply (SpMV): y=A*x • Sparse triangular solve (SpTS): x=T-1*b • y=ATA*x, y=AAT*x • Powers (y=Ak*x), sparse triple-product (R*A*RT), … • Optimization (implementation) space • A has special structure (e.g., symmetric, banded, …) • Register blocking • Cache blocking • Multiple dense vectors (x) • Hybrid data structures (e.g., splitting, switch-to-dense, …) • Matrix reordering
Register Blocking Optimization • Identify a small dense blocks of nonzeros. • Fill in extra zeros to complete blocks • Use an optimized multiplication code for the particular block size. 2x2 register blocked matrix 3 2 1 2 0 1 4 2 2 5 1 0 0 3 1 0 3 1 2 0 5 0 3 7 0 1 1 4 • Improves register reuse, lowers indexing overhead. • Filling in zeros increases storage and computation
Machine-dependent Matrix-dependent Register Blocking Performance Model • Estimate performance of register blocking: • Estimated raw performance: Mflop/s of dense matrix in sparse rxc blocked format • Estimated overhead: to fill in rxc blocks • Maximize over rxc: Estimated raw performance Estimated overhead • Use sampling to further reduce time, row and column dimensions are computed separately
73 105 172 250 35 42 88 110 Machine Profiles Computed Offline Register blocking performance for a dense matrix in sparse format. 333 MHz Sun Ultra 2i 500 MHz Intel Pentium III 375 MHz IBM Power3 800 MHz Intel Itanium
Register Blocked SpMV Performance: Ultra 2i (See upcoming SC’02 paper for a detailed analysis.)
Register Blocked SpMV Performance: Power3 Additional low-level performance tuning is likely to help on the Power3.