340 likes | 350 Views
Explore the use of open-source compilers and tools for scalable global address space computing, including languages like UPC and Titanium. Learn about the programming models, benchmarks, and plans for these languages.
E N D
Open-source compilers and tools for scalable global address space computing Kathy Yelick University of California, Berkeley and Lawrence Berkeley National Laboratory UPC and Titanium
Outline • Global Address Languages in General • UPC • Language overview • Berkeley UPC compiler status and microbenchmarks • Application benchmarks and plans • Titanium • Language overview • Berkeley Titanium compiler status • Application benchmarks and plans
Global Address Space Languages • Explicitly-parallel programming model with SPMD parallelism • Fixed at program start-up, typically 1 thread per processor • Global address space model of memory • Allows programmer to directly represent distributed data structures • Address space is logically partitioned • Local vs. remote memory (two-level hierarchy) • Programmer control over performance critical decisions • Data layout and communication • Performance transparency and tunability are goals • Initial implementation can use fine-grained shared memory • Suitable for current and future architectures • Either shared memory or lightweight messaging is key • Base languages differ: UPC (C), CAF (Fortran), Titanium (Java)
Global Address Space • The languages share the global address space abstraction • Shared memory is partitioned by processors • Remote memory may stay remote: no automatic caching implied • One-sided communication through reads/writes of shared variables • Both individual and bulk memory copies • Differ on details • Some models have a separate private memory area • Distributed arrays generality and how they are constructed X[0] X[1] X[P] Shared Global address space ptr: ptr: ptr: Private
UPC Programming Model Features • SPMD parallelism • fixed number of images during execution • images operate asynchronously • Several kinds of array distributions • double a[n] a private n-element array on each processor • shared double a[n] a n-element shared array, with cyclic mapping • shared [4] double a[n] a block cyclic array with 4-element blocks • shared [0] double *a = (shared [0] double *) upc_alloc(n); a shared array with all elements local • Pointers for irregular data structures • shared double *sp a pointer to shared data • double *lp a pointers to private data
UPC Programming Model Features • Global synchronization • upc_barrier traditional barrier • upc_notify/upc_wait split-phase global synchronization • Pair-wise synchronization • upc_lock/upc_unlock traditional locks • Memory consistence has two types of accesses • Strict: must be performed immediately and atomically: typically a blocking round-trip message if remote • Relaxed: still must preserve dependencies, but other processors may view these as happening out of order • Parallel I/O • Based on ideas in MPI I/O • Specification for UPC by Thakur, El Ghazawi et al
Berkeley UPC Compiler • Compiler based on Open64 • Recently merged Rice sources • Multiple front-ends, including gcc • Intermediate form called WHIRL • Current focus on C backend • IA64 possible in future • UPC Runtime • Pointer representation • Shared/distribute memory • Communication in GASNet • Portable • Language-independent UPC Higher WHIRL Optimizing transformations C + Runtime Lower WHIRL Assembly: IA64, MIPS,… + Runtime
Design for Portability & Performance • UPC to C translator: • Translates UPC to C; insert runtime calls for parallel features • UPC runtime: • Allocate shared data; implement pointers-to-shared • GASNet: • A uniform interface for low-level communication primitives • Portability: • C is our intermediate language • GASNet is itself layered with a small core as the essential part • High-Performance: • Native C compiler optimizes serial code • Translator can perform communication optimizations • GASNet can access network directly
Berkeley UPC Compiler Status • UPC Extensions added to front-end • Code-generation complete • Some issues related to code quality (hints to backend compilers) • GASNet communication layer • Running on Quadrics/Elan, IBM/LAPI, Myrinet/GM, and MPI • Optimized for small non-blocking messages and compiled code • Next step: strided and indexed put/get leveraging ARMCI work • UPC Runtime layer • Developed and tested on all GASNet implementations • Supports multiple pointer representations • Next step: direct shared memory support • Release scheduled for later this month • Glitch related to include files and usability to iron out
Address Thread Phase Pointer-to-Shared Representation • UPC has three difference kinds of pointers: • Block-cyclic, cyclic, and indefinite (always local) • A pointer needs a “phase” to keep track of where it is in a block • Source of overhead for updating and de-referencing • Consumes space in the pointer • Our runtime has special cases for: • Phaseless (cyclic and indefinite) – skip phase update • Indefinite – skip thread id update • Pointer size/representation easily reconfigured • 64 bits on small machines, 128 on large, word or struct
Preliminary Performance • Testbed • Compaq AlphaServer, with Quadrics GASNet conduit • Compaq C compiler for the translated C code • Microbenchmarks • Measures the cost of UPC language features and construct • Shared pointer arithmetic, barrier, allocation, etc • Vector addition: no remote communication • NAS Parallel Benchmarks • EP: no communication • IS: large bulk memory operations • MG: bulk memput • CG: fine-grained vs. bulk memput
Performance of Shared Pointer Arithmetic • Phaseless pointers are an important optimization • Indefinite pointers almost as fast as regular C pointers • General blocked cyclic pointer 7x slower for addition • Competitive with HP compiler, which generates native code • Both compiler have known opportunities for improvement
Cost of Shared Memory Access • Local shared accesses somewhat slower than private ones • HP has improved local performance in newer version • Remote accesses worse than local, as expected • Runtime/GASNet layering for portability is not a problem
NAS PB: EP • EP = Embarrassingly Parallel has no communication • Serial performance via C code generation is not a problem
NAS PB: IS • IS = Integer Sort is dominated by Bulk Communication • GASNet bulk communication adds no measurable overhead
NAS PB: MG • MG = Multigrid involves medium bulk copies • “Berkeley” reveals a slight serial performance degradation due to casts • Berkeley-C uses the original C code for the inner loops
Scaling MG on the T3E • Scalability of the language shown here for the T3E compiler • Directly shared memory support is probably needed to be competitive on most current machines
Mesh Generation in UPC • Parallel Mesh Generation in UPC • 2D Delaunay triangulation • Based on Triangle software by Shewchuk (UCB) • Parallel version from NERSC uses dynamic load balancing, software caching, and parallel sorting
UPC Interactions • UPC consortium • Tarek El-Ghazawi is coordinator: semi-annual meetings, ~daily e-mail • Revised UPC Language Specification (IDA,GWU,…) • UPC Collectives (MTU) • UPC I/O Specifications (GWU, ANL-PModels) • Other Implementations • HP (Alpha cluster and C+MPI compiler (with MTU)) • MTU (C+MPI Compiler based on HP compiler, memory model) • Cray (X1 implementation) • Intrepid (SGI implementation based on gcc) • Etnus (debugging) • UPC Book: T. El-Ghazawi, B. Carlson, T. Sterling, K. Yelick • Goal is proofs by SC03 • HPC HPCS Effort • Recent interest from Sandia
Titanium • Based on Java, a cleaner C++ • classes, automatic memory management, etc. • compiled to C and then native binary (no JVM) • Same parallelism model as UPC and CAF • SPMD with a global address space • Dynamic Java threads are not supported • Optimizing compiler • static (compile-time) optimizer, not a JIT • communication and memory optimizations • synchronization analysis (e.g. static barrier analysis) • cache and other uniprocessor optimizations
Summary of Features Added to Java • Scalable parallelism (Java threads replaced) • Immutable (“value”) classes • Multidimensional arrays with iterators • Checked Synchronization • Operator overloading • Templates • Zone-based memory management (regions) • Libraries for collective communication, distributed arrays, bulk I/O
Immutable Classes in Titanium • For small objects, would sometimes prefer • to avoid level of indirection • pass by value (copy entire object) • especially when immutable -- fields never modified • Example: immutableclass Complex { Complex () {real=0; imag=0; } ... } Complex c1 = new Complex(7.1, 4.3); c1 = c1.add(c1); • Addresses performance and programmability • Similar to structs in C (not C++ classes) in terms of performance • Adds support for complex types
Multidimensional Arrays • Arrays in Java are objects • Array bounds are checked • Multidimensional arrays are arrays-of-arrays • Safe and general, but potentially slow • New kind of multidimensional array added to Titanium • Sub-arrays are supported (interior, boundary, etc.) • Indexed by Points (tuple of ints) • Combined with unordered iteration to enable optimizations foreach (p within A.domain()) { A[p]... } • “A” could be multidimensional, an interior region, etc.
Communication • Titanium has explicit global communication: • Broadcast, reduction, etc. • Primarily used to set up distributed data structures • Most communication is implicit through the shared address space • Dereferencing a global reference, g.x, can generate communication • Arrays have copy operations, which generate bulk communication: A1.copy(A2) • Automatically computes the intersection of A1 and A2’s index set or domain
Distributed Data Structures • Building distributed arrays: Particle [1d] single [1d] allParticle = new Particle [0:Ti.numProcs-1][1d]; Particle [1d] myParticle = new Particle [0:myParticleCount-1]; allParticle.exchange(myParticle); • Now each processor has array of pointers, one to each processor’s chunk of particles All to all broadcast P0 P1 P2
Titanium Compiler Status • Titanium compiler runs on almost any machine • Requires a C compiler (and decent C++ to compile translator) • Pthreads for shared memory • Communication layer for distributed memory (or hybrid) • Recently moved to live on GASNet: obtained GM, Elan, and improved LAPI implementation • Leverages other PModels work for maintenance • Recent language extensions • Indexed array copy (scatter/gather style) • Non-blocking array copy under development • Compiler optimizations • Cache optimizations, for loop optimizations • Communication optimizations for overlap, pipelining, and scatter/gather under development
Applications in Titanium • Several benchmarks • Fluid solvers with Adaptive Mesh Refinement (AMR) • Conjugate Gradient • 3D Multigrid • Unstructured mesh kernel: EM3D • Dense linear algebra: LU, MatMul • Tree-structured n-body code • Finite element benchmark • Genetics: micro-array selection • SciMark serial benchmarks • Larger applications • Heart simulation • Ocean modeling with AMR (in progress)
Serial Performance (Pure Java) • Several optimizations in Titanium compiler (tc) over the past year • These codes are all written in pure Java without performance extensions
AMR for Ocean Modeling • Ocean Modeling [Wen, Colella] • Require embedded boundaries to model the ocean floor and coastline • Results in irregular data structures and array accesses • Starting with AMR solver this year for ocean flow • Compiler and language support for irregular problem under design Graphics from Titanium AMR Gas Dynamics [McCorquodale,Colella
Heart Simulation • Immersed Boundary Method [Peskin/MacQueen] • Fibers (e.g., heart muscles) modeled by list of fiber points • Fluid space modeled by a regular lattice • Irregular fiber lists need to interact with regular fluid lattice • Trade-off between load balancing of fibers and minimizing communication • Memory and communication intensive • Random array access is key problem in the performance • Developed compiler optimizations to improve their performance • Application effort funded by NSF/NPACI
Parallel Performance and Scalability • Poisson solver using “Method of Local Corrections” [Balls, Colella] • Communication < 5%; Scaled speedup nearly ideal (flat) IBM SP Cray T3E
Titanium Interactions • GASNet interactions • In addition to the • Application collaborators • Charles Peskin and Dave McQueen and Courant Institute • Phil Colella and Tong Wen and LBNL • Scott Baden and Greg Balls and UCSD • Involved in Sun HPCS Effort • The GASNet work is common to UPC and Titanium • Joint effort between U.C. Berkeley and LBNL • (UPC project is primarily at LBNL; Titanium is U.C. Berkeley) • Collaboration with Nieplocha on communication runtime • Participation in Global Address Space tutorials
The End • http://upc.nersc.gov • http://titanium.cs.berkeley.edu/
NAS PB: CG • CG = Conjugate Gradient can be written naturally with fine-grained communication in the sparse matrix-vector product • Worked well on the T3E (and hopefully will on the X1) • For other machines, a bulk version is required