450 likes | 604 Views
MPJ meets Gadget. A Java Code for Cosmological Simulations?. Bryan Carpenter OMII, University of Southampton Southampton SO17 1BJ, UK March 24 2006 dbc@ecs.soton.ac.uk. Contents. Java for Scientific Computing - background MPJ Express overview The Gadget 2 code
E N D
MPJ meets Gadget A Java Code for Cosmological Simulations? Bryan Carpenter OMII, University of Southampton Southampton SO17 1BJ, UK March 24 2006 dbc@ecs.soton.ac.uk
Contents • Java for Scientific Computing - background • MPJ Express overview • The Gadget 2 code • Making a Java version of Gadget
Acknowledgements • Work done in collaboration with Mark Baker and Aamir Shafi. • Most work done by latter.
Java History • The Java language grabbed public attention in 1995, with the release of the HotJava experimental Web browser, and its later incorporation into the Netscape browser. • Very suddenly, Java became one of the most important programming languages in the industry. • Within a year or so of its release, some people were suggesting that Java might be good for high performance scientific computing. • A workshop on Java for Science and Engineering Computation was held at Syracuse University in late 1996 – a precursor of the subsequent Java Grande activities.
Java for High Performance Computing? • Various people (Java Grande, 1997-2003, RIP) have argued that in general Java provides a better programming platform than precursors. • In the parallel computing world there has been a long history of novel language concepts to support parallelism, multithreading etc. • Java seems to incorporate at least some of these ideas, and it has the benefit that it is a “mainstream” language. • In general Java encourages better software engineering and is more portable than, say, Fortran or C. • There is a huge body of supporting software for Java.
Performance • To enable safe execution of code on multiple platforms, Java programs are translated to portable byte code instructions for the Java Virtual Machine (JVM). • Modern JVMs perform compilation from byte code to native machine code on the fly, as the Java program is executed. • Typical JVMs implement many of the most important kinds of optimizations used by compilers for “traditional” programming languages. • Adaptive compilation may even allow optimizations that are impractical for static compilers, because run-time information is available. • Quality of compilation comparable to many C or Fortran compilers. • But Java has safety features that may limit performance. • Difficulty of managing memory explicitly, exact exception processing, …
MPJ Background • MPI was introduced in June 1994 as a standard message passing API for parallel scientific computing: • Language bindings for C, C++, and Fortran. • Java Grande Message Passing Workgroup defined Java bindings in 1998. • Previous efforts follow two approaches: • Pure Java approaches: • Remote Method Invocation (RMI), • Java Sockets, • Java Native Interface (JNI) approaches: • mpiJava – Java wrapper to platform native MPI.
MPJ Goals • Modular design: unify earlier approaches by pluggable communication devices. • Support high-level functionality of MPI-1.2 in pure Java code, as far as possible: • Point to point communications, • Collective communications, • Groups, communicators, and contexts, • Derived datatypes – buffering API. • Open source distribution: http://dsg.port.ac.uk/projects/mpj
MPJ Demos • There are a few existing demos for MPJ, inherited from the mpiJava project, e.g. • A 2d Fluid Flow example. • Some Monte Carlo simulations of 2d spin systems from condensed matter physics. • These have nice GUIs, but aren’t very representative of real codes used by computational scientists.
Gadget-2 • Gadget-2 is a free-software, production code for cosmological N-body (and hydrodynamic) computations. http://www.mpa-garching.mpg.de/gadget • Written by Volker Springel, of the Max Plank Institute for Astrophysics, Garching. • It is written in the C language – already parallelized using MPI. • Versions have been used in various research papers in astrophysics literature, including the Millennium Simulation.
Millennium Simulation • See the paper Simulating the Joint Evolution of Quasars, Galaxies and their Large-Scale Distribution by Springel et al on Gadget home page. • Follows evolution of 1010 dark matter “particles” from early Universe (z = 127) to current day. • Performed on 512 nodes of IBM p690. • Used 1TB of distributed memory. • 350,000 CPU hours – 28 days elapsed time. • Floating point performance around 0.2 TFLOPS. • Around 20Tb total data generated.
Dynamics in Gadget • Gadget is “just” simulating the movement of (a lot of) representative particles under the influence of Newton’s law of gravity. • Plus some hydrodynamic forces, but these don’t affect dominant Dark Matter. • Co-moving coordinates take account of expansion of Universe according to General Relativity, but otherwise basically classical mechanics. • Classical N-body problem.
Gadget Main Loop • This is a slightly simplified view of the Gadget code: … Initialize … while (not done) { move_particles() ; // update positions domain_Decomposition() ; compute_accelerations() ; advance_and_find_timesteps() ; // update velocities } • Most of the interesting work happens in compute_accelerations() (and domain_Decomposition() – see later).
Computing Forces • The function compute_accelerations() must compute forces experienced by all particles. • In particular must compute gravitational force. • Because gravity is a long range force, every particle affects every other. Naively, total cost is O(N2). • with N ≈ 1010, this is quite infeasible. • Need some kind of approximation. • Intuitive approach: when computing force on particle i, group together particles in selected regions distant from i, and treat groups as single particles, located at their centres of mass.
Barnes Hut Tree • First divide cubical region of 3d space into 23 = 8 regions, halving each dimension. • For every sub-region that contains any particles, divide again recursively to make an octtree, until “leaf” regions have at most one particle.
Two Dimensional Example • Picture borrowed from www.cs.berkeley.edu/~demmel/cs267/lecture26/lecture26.html
Barnes Hut Force Computation • To compute the force on a particle i, traverse tree starting from root: • if a node n is “distant from” i, just add contribution to force on i from centre of mass of n– no need to visit children of n; • if node n is “close to” i, visit children of n and recurse. • Hinges on definition of distant from/close to. • Basic idea is that a node representing some region of space is distant from a particle i if the angle it subtends is smaller than a threshold opening angle: n θ i
Complexity • On average, number of nodes “opened” to compute force on i is O(log N), as opposed to visiting O(N) particles in naïve algorithm. • A huge win when N ≈ 1010.
Domain Decomposition • Need to divide space and/or particle set into domains, where each domain is handled by a single processor. • Problem is that we can’t just divide space evenly, because some regions will have many more particle than others – poor load balancing. • Conversely, can’t just divide particles evenly, because particles move throughout space, and want to maintain physically close particles on the same processor, as far as practical – communication problem.
Peano-Hilbert Curve • Warren and Salmon originally suggested using a space-filling curve: • Picture borrowed from http://www.mpa-garching.mpg.de/gadget/gadget2-paper.pdf
Peano-Hilbert Key • Gadget applies the recursion 20 times, logically dividing space into up to 220 × 220 × 220 cells on the Peano-Hilbert curve. • Then can label each cell by its location along the Peano-Hilbert curve – 260 possible locations comfortably fit into a 64-bit word.
Decomposition based on P-H Curve • Because number of cells> > number of particles, segments of linear Peano-Hilbert curve sparsely populated. • Sort particles by their Peano-Hilbert key, then divide evenly into P domains. • Intuitively – stretch out the P-H curve with particles dotted along it; segment it into P parts where each part has the same number of particles. • Characteristics of this decomposition: • Good load balancing. • Domains simply connected and quite “compact” in real space, because particles that are close along P-H curve are close in real space (converse often but not always true). • Domains have relatively simple mapping to BH octtree nodes.
Distributed Representation of Tree • Every processor hold a copy of the root nodes, and a copy of all child nodes down to the point where all particles in of a node are held on a single remote processor. Remotely held nodes are called pseudo-particles. • To compute the force on a single local target particle, traverse tree from root as usual, and accumulate contributions from locally held particles. • Build an export list containing target particle and hosts of pseudo-particles encountered in walk.
Communication • After local computation for all target particles, process export list and send list of local target particles to all hosts that own pseudo-particle nodes needed for those particles. • All processors do another tree walk to compute their contributions to remotely owned (from their point of view) target particles. • These contributions are returned to the original processor, and added into the accelerations for the target particles.
Other Communications • Notably: • The Domain Decomposition itself requires (in principle) a distributed sort of the particle list. • Gadget approximates this sort, but still fairly intricate communications. • In Gadget all communication is uses the standard MPI library. It makes fairly extensive use of collective communication routines.
History • MPJ Express released 2005. • Realized we need a realistic exemplar code: both as a demo, and to drive further improvements of our software. • First thoughts that an N-body simulation code would make sense – Summer 2005. • Learned about Gadget in December that year. • Effort to make a Java version of Gadget started seriously early February 2006 – expecting many months work. • Non-trivial simulations running in March.
Gadget Code Statistics • Public distribution is around 17,000 lines of C, distributed over about 30 source files. • Dependencies on: • MPI library for parallelization. • GNU scientific library (but only a handful of functions) • FFTW – library for parallel Fourier transforms. • Comes with a set of initial conditions files that we can use to test our code, including “colliding galaxies” and “cluster formation”.
Translation of Gadget Code • Manually translated, but in first cut, deliberately keep the Java code as similar to C as possible. • e.g. nearly all of Java Gadget is currently implemented as static methods on one enormous class called Main. • Small supporting classes corresponding to structs of C code. • Currently not supporting a number of features, including: • PMTree algorithm (an optimization using Fourier Transforms on a mesh to compute long-range part of gravitational force), • Periodic Boundary Conditions, • Multiple Initial Conditions files, • COMPUTE_POTENTIAL_ENERGY, ISOTHERM_EQS, FLEXSTEPS, OUTPUTPOTENTIAL, FORCETEST, MAKEGLASS, PSEUDOSYMMETRIC, … • Not quite complete in a few other ways (“magic numbers”).
Handling Dependencies • Replace MPI calls with MPJ calls (!) • FFTW is not needed, because we disable the PMTree algorithm. • Few GSL functions that were needed (numeric quadrature, etc) were hand translated to Java.
Test Cases • Restrictions aside, we have successfully run Colliding Galaxies and Cluster Formation example simulations on 2, 4 and 8 processors. • These use pure Dark Matter – hydrodynamics code not yet tested. • Results are indistinguishable from running the original Gadget.
Custom serialization in Java Gadget-2 -- Motivation • Java’s built in object serialization can have detrimental effect on parallel application performance • Experiment • Compare the performance of sending • An array of 1, 1Kbytes, 1Mbytes byte array elements • An array of 1, 1Kbytes, 1Mbytes object array elements, where each object contains exactly one byte • Both tests are communicating same amount of data
Custom serialization in Java Gadget-2 • In the original C Gadget-2, initial conditions are read into an array of C struct called ParticleData and SphParticleData • Particles that need to be exported are copied to a contiguous memory region called CommBuffer • In Java version, ParticleData and SphParticleData are objects arrays • CommBuffer is a contiguous memory region, which is an instance of ByteBuffer class • Each primitive datatype in object arrays are copied to (packed) to CommBuffer by the sender and copied from (unpacked) CommBuffer by the receiver • Helps avoiding the overheads of Java’s object serialization
Initial Benchmarks • Here we are running the simulations defined by the initial conditions files distributed with Gadget. • The numbers of particles are relatively modest, so parallel speedups are less than perfect.
Interpretation • These are first-cut results – comparing the production C Gadget with a mostly unoptimized Java code. • In this version the Java code appears to be roughly a factor of two slower than C. • It is fairly clear that Java memory usage is very inefficient, probably this has an adverse effect on numerical performance (poor caching). • Need to investigate whether Java code can be rewritten to make better use of memory.
Possible enhancement to the MPJ API • Support for sending a basic datatype • For example, if we want to send an integer, it has to be part of or copied to an integer array before communication • Support for sending from and receiving to ByteBuffer • Our buffering API in MPJ Express allows efficient reuse of ByteBuffers • Application developers can get a chunk of memory (ByteBuffer) using buffering API, copy their data, and use it for communication • MPJ could now use this ByteBuffer for socket send method • No additional copying • Can help reduce latency