480 likes | 1.12k Views
MPJ Express. and Lessons from Java Gadget. Bryan Carpenter OMII, University of Southampton Southampton SO17 1BJ, UK February 7 2007 dbc@ecs.soton.ac.uk. Contents. MPJ Express overview The Gadget 2 code Making a Java version of Gadget. Acknowledgements.
E N D
MPJ Express and Lessons from JavaGadget Bryan Carpenter OMII, University of Southampton Southampton SO17 1BJ, UK February 7 2007 dbc@ecs.soton.ac.uk
Contents • MPJ Express overview • The Gadget 2 code • Making a Java version of Gadget
Acknowledgements • Work done in collaboration with Mark Baker and Aamir Shafir. • Most work done by latter!
Java for High Performance Computing? • Various people have argued that Java provides a good programming platform for large scale computation (e.g. Java Grande, 1997 - ~2003). • Parallel computing has a long history of novel language concepts to support parallelism, multithreading etc. • Java seems to incorporate at least some of these ideas, and has the benefit that it is a “mainstream” language. • Java encourages better software engineering and is more portable than, say, Fortran or C. • There is a large body of supporting software for Java.
Performance • Modern JVMs perform compilation from byte code to native machine code, as the Java program is executed. • Typical JVMs implement many of the optimizations used by compilers for “traditional” programming languages. • Adaptive compilation may even allow optimizations that are impractical for static compilers, because run-time information is available? • Quality of compilation comparable to many C or Fortran compilers. • But Java has safety features that may limit performance. • Difficulty of managing memory explicitly, exact exception processing, …
MPI Background • MPI was introduced in 1994 as a standard message passing API for parallel scientific computing: • Language bindings for C, C++, and Fortran. • Original programming model was SPMD (Single Program Multiple Data) ― P copies of a single program run on a fixed number, P, of processors. • Only interaction is by message exchange ― no sharing of memory between processors (c.f. CSP). • Introduced idea of communicator as basic communication abstraction (c.f. channels). A communicator is shared by a group of processors. • Supports library development and collective operations.
Communicators Communicator C Proc 0 Proc 1 C.Send(…, 2) Proc 2 C.Recv(…, 0)
Collective Communications Communicator C Proc 0 Proc 1 C.Bcast(…, 0) C.Bcast(…, 0) Proc 2 C.Bcast(…, 0)
MPJ Background • Java Grande Message Passing Workgroup defined Java bindings for MPI in 1998. • Previous efforts follow two approaches: • Pure Java approaches: • Remote Method Invocation (RMI), • Java Sockets, • Java Native Interface (JNI) approaches: • mpiJava – Java wrapper to platform native MPI.
Minimal mpiJava Program import mpi.* class Hello { static public void main(String[] args) { MPI.Init(args) ; int myrank = MPI.COMM_WORLD.Rank() ; if(myrank == 0) { char[] message = “Hello, there”.toCharArray() ; MPI.COMM_WORLD.Send(message, 0, message.length, MPI.CHAR, 1, 99) ; } else { char[] message = new char [20] ; MPI.COMM_WORLD.Recv(message, 0, 20, MPI.CHAR, 0, 99) ; System.out.println(“received:” + new String(message) + “:”) ; } MPI.Finalize() ; } }
MPJ Express Goals • Modular design: unify earlier approaches by pluggable communication devices. • Support high-level functionality of MPI-1.2 in pure Java code, as far as possible: • Point to point communications, • Collective communications, • Groups, communicators, and contexts, • Derived datatypes – buffering API. • Thread safety ― support Java threads for parallelism within multicore nodes. • Open source distribution: http://mpj-express.org
MPJ Demos • There were a few existing demos for MPJ, inherited from the earlier mpiJava project, e.g. • A 2d Fluid Flow example. • Some Monte Carlo simulations of 2d spin systems from condensed matter physics. • These have nice GUIs, but aren’t very representative of real codes used by computational scientists.
Gadget-2 • Gadget-2 is a free-software, production code for cosmological N-body (and hydrodynamic) computations. http://www.mpa-garching.mpg.de/gadget • Written by Volker Springel, of the Max Plank Institute for Astrophysics, Garching. • It is written in the C language – already parallelized using MPI. • Versions have been used in various research papers in astrophysics literature, including the Millennium Simulation.
Millennium Simulation • See the paper Simulating the Joint Evolution of Quasars, Galaxies and their Large-Scale Distribution by Springel et al on Gadget home page. • Follows evolution of 1010 dark matter “particles” from early Universe (z = 127) to current day. • Performed on 512 nodes of IBM p690. • Used 1TB of distributed memory. • 350,000 CPU hours – 28 days elapsed time. • Floating point performance around 0.2 TFLOPS. • Around 20Tb total data generated.
Dynamics in Gadget • Gadget is “just” simulating the movement of (a lot of) representative particles under the influence of Newton’s law of gravity. • Plus some hydrodynamic forces, but these don’t affect dominant Dark Matter. • Co-moving coordinates take account of expansion of Universe according to General Relativity, but otherwise basically classical mechanics. • Classical N-body problem.
Gadget Main Loop • This is a slightly simplified view of the Gadget code: … Initialize … while (not done) { move_particles() ; // update positions domain_Decomposition() ; compute_accelerations() ; advance_and_find_timesteps() ; // update velocities } • Most of the interesting work happens in compute_accelerations() (and domain_Decomposition() – see later).
Computing Forces • The function compute_accelerations() must compute forces experienced by all particles. • In particular must compute gravitational force. • Because gravity is a long range force, every particle affects every other. Naively, total cost is O(N2). • with N ≈ 1010, this is quite infeasible. • Need some kind of approximation. • Intuitive approach: when computing force on particle i, group together particles in selected regions distant from i, and treat groups as single particles, located at their centres of mass.
Barnes Hut Tree • First divide cubical region of 3d space into 23 = 8 regions, halving each dimension. • For every sub-region that contains any particles, divide again recursively to make an oct-tree, until “leaf” regions have at most one particle.
Two Dimensional Example • Picture borrowed from www.cs.berkeley.edu/~demmel/cs267/lecture26/lecture26.html
Barnes Hut Force Computation • To compute the force on a particle i, traverse tree starting from root: • if a node n is “distant from” i, just add contribution to force on i from centre of mass of n– no need to visit children of n; • if node n is “close to” i, visit children of n and recurse. • Hinges on definition of distant from/close to. • Basic idea is that a node representing some region of space is distant from a particle i if the angle it subtends is smaller than a threshold opening angle: n θ i
Complexity • On average, number of nodes “opened” to compute force on i is O(log N), as opposed to visiting O(N) particles in naïve algorithm. • A huge win when N ≈ 1010.
Domain Decomposition • Need to divide space and/or particle set into domains, where each domain is handled by a single processor. • Problem is that we can’t just divide space evenly, because some regions will have many more particle than others – poor load balancing. • Conversely, can’t just divide particles evenly, because particles move throughout space, and want to maintain physically close particles on the same processor, as far as practical – communication problem.
Peano-Hilbert Curve • Warren and Salmon originally suggested using a space-filling curve: • Picture borrowed from http://www.mpa-garching.mpg.de/gadget/gadget2-paper.pdf
Peano-Hilbert Key • Gadget applies the recursion 20 times, logically dividing space into up to 220 × 220 × 220 cells on the Peano-Hilbert curve. • Then can label each cell by its location along the Peano-Hilbert curve – 260 possible locations comfortably fit into a 64-bit word.
Decomposition based on P-H Curve • Because number of cells> > number of particles, segments of linear Peano-Hilbert curve sparsely populated. • Sort particles by their Peano-Hilbert key, then divide evenly into P domains. • Intuitively – stretch out the P-H curve with particles dotted along it; segment it into P parts where each part has the same number of particles. • Characteristics of this decomposition: • Good load balancing. • Domains simply connected and quite “compact” in real space, because particles that are close along P-H curve are close in real space (converse often but not always true). • Domains have relatively simple mapping to BH octtree nodes.
Distributed Representation of Tree • Every processor holds a copy of the root nodes, and a copy of all child nodes down to the point where all particles in of a node are held on a single remote processor. Remotely held nodes are called pseudo-particles. • To compute the force on a single local target particle, traverse tree from root as usual, and accumulate force contributions from locally held particles. • Build an export list containing target particle and hosts of pseudo-particles encountered in walk.
Communication • After computation for all locally held target particles, process export list and send list of local target particles to all hosts that own pseudo-particle nodes needed for those particles. • All processors do another tree walk to compute their contributions to remotely owned (from their point of view) target particles. • These contributions are returned to the original processor, and added into the accelerations for the target particles.
Other Communications • Notably: • The Domain Decomposition itself requires (in principle) a distributed sort of the particle list. • Gadget only approximates this sort, but still fairly intricate communications. • In Gadget all communication is using the standard MPI library. It makes fairly extensive use of collective communication routines.
Timeline • MPJ Express released 2005. • Realized we needed a realistic exemplar code: both as a demo, and to drive further improvements of our software. • First thoughts that an N-body simulation code would make sense – Summer 2005. • Learned about Gadget in December that year. • Effort to make a Java version of Gadget started seriously early February 2006 – expecting many months work. • Non-trivial simulations running in March.
Gadget Code Statistics • Public distribution is around 17,000 lines of C, distributed over about 30 source files. • Dependencies on: • MPI library for parallelization. • GNU scientific library (but only a handful of functions) • FFTW – library for parallel Fourier transforms. • Comes with a set of initial conditions files that we can use to test our code, including “colliding galaxies” and “cluster formation”.
Translation of Gadget Code • Manually translated, but in first cut, deliberately keep the Java code as similar to C as possible. • e.g. nearly all of Java Gadget is currently implemented as static methods on one enormous class called Main. • Small supporting classes corresponding to structs of C code. • Currently not supporting a number of features, including: • PMTree algorithm (an optimization using Fourier Transforms on a mesh to compute long-range part of gravitational force), • Multiple Initial Conditions files, • COMPUTE_POTENTIAL_ENERGY, ISOTHERM_EQS, FLEXSTEPS, OUTPUTPOTENTIAL, FORCETEST, MAKEGLASS, PSEUDOSYMMETRIC, …
Handling Dependencies • Replace MPI calls with MPJ calls (!) • Use of FFTW avoided by disabling the PMTree algorithm (but may slow down some simulations a lot). • Few GNU Scientific Library functions that were needed (numeric quadrature, etc) were hand translated to Java.
Test Cases • Restrictions aside, we successfully ran Colliding Galaxies and Cluster Formation example simulations varying numbers of processors. • These use pure Dark Matter – hydrodynamics code not yet tested. • Results are indistinguishable from running the original Gadget.
Optimizing Java Gadget • Initial Java version of Gadget was about 3 times slower than C. • Optimizations: • Replace communication of Object arrays (corresponding to C structs) with communication of component primitive elements (floats, ints, etc). • Still wasn’t enough – final version actually expunges large Object arrays from program – flattens all struct arrays into arrays of primitive types (see later comments). • Extended the MPJ Express API to manipulate Java ByteBuffers directly.
Flattening Object Arrays • To preserve memory locality, we decided to “flatten” large arrays of small objects (ParticleData) – replace an array of objects with an array of doubles containing object component fields, for example. • This makes the code less readable and maintainable than the original C code – clearly not acceptable. But we wanted speed. • Need a much better approach here. One possibility is some kind of macro pre-processing (it would be better if the Java language had structs…).
Benchmarks • Graphs taken from Aamir Shafi’s thesis – you will find many more results there. • Some benchmarks are on up to 8 processors of Starbug cluster in Portsmouth Distributed Systems Group. • Other benchmarks are on up to 32 processors from NW-GRID cluster at Daresbury Laboratory.
Conclusions • Possible to do large scale numerical simulations in Java • Speed is OK. • Memory consumption still an issue (Java Gadget needs perhaps three times more memory than C Gadget). • Made various improvements to MPJ Express implementation and API along the way. • As things stand, the Java Gadget code is not very maintainable – C-like, and then optimized some. • Needs more work to learn how to write maintainable code in Java that is also fast – e.g. how to preserve memory locality in arrays of objects.