MPJ Express

MPJ Express and Lessons from JavaGadget Bryan Carpenter OMII, University of Southampton Southampton SO17 1BJ, UK February 7 2007 dbc@ecs.soton.ac.uk

Contents • MPJ Express overview • The Gadget 2 code • Making a Java version of Gadget

Acknowledgements • Work done in collaboration with Mark Baker and Aamir Shafir. • Most work done by latter!

MPJ Express

Java for High Performance Computing? • Various people have argued that Java provides a good programming platform for large scale computation (e.g. Java Grande, 1997 - ~2003). • Parallel computing has a long history of novel language concepts to support parallelism, multithreading etc. • Java seems to incorporate at least some of these ideas, and has the benefit that it is a “mainstream” language. • Java encourages better software engineering and is more portable than, say, Fortran or C. • There is a large body of supporting software for Java.

Performance • Modern JVMs perform compilation from byte code to native machine code, as the Java program is executed. • Typical JVMs implement many of the optimizations used by compilers for “traditional” programming languages. • Adaptive compilation may even allow optimizations that are impractical for static compilers, because run-time information is available? • Quality of compilation comparable to many C or Fortran compilers. • But Java has safety features that may limit performance. • Difficulty of managing memory explicitly, exact exception processing, …

MPI Background • MPI was introduced in 1994 as a standard message passing API for parallel scientific computing: • Language bindings for C, C++, and Fortran. • Original programming model was SPMD (Single Program Multiple Data) ― P copies of a single program run on a fixed number, P, of processors. • Only interaction is by message exchange ― no sharing of memory between processors (c.f. CSP). • Introduced idea of communicator as basic communication abstraction (c.f. channels). A communicator is shared by a group of processors. • Supports library development and collective operations.

Communicators Communicator C Proc 0 Proc 1 C.Send(…, 2) Proc 2 C.Recv(…, 0)

Collective Communications Communicator C Proc 0 Proc 1 C.Bcast(…, 0) C.Bcast(…, 0) Proc 2 C.Bcast(…, 0)

MPJ Background • Java Grande Message Passing Workgroup defined Java bindings for MPI in 1998. • Previous efforts follow two approaches: • Pure Java approaches: • Remote Method Invocation (RMI), • Java Sockets, • Java Native Interface (JNI) approaches: • mpiJava – Java wrapper to platform native MPI.

Minimal mpiJava Program import mpi.* class Hello { static public void main(String[] args) { MPI.Init(args) ; int myrank = MPI.COMM_WORLD.Rank() ; if(myrank == 0) { char[] message = “Hello, there”.toCharArray() ; MPI.COMM_WORLD.Send(message, 0, message.length, MPI.CHAR, 1, 99) ; } else { char[] message = new char [20] ; MPI.COMM_WORLD.Recv(message, 0, 20, MPI.CHAR, 0, 99) ; System.out.println(“received:” + new String(message) + “:”) ; } MPI.Finalize() ; } }

MPJ Express Goals • Modular design: unify earlier approaches by pluggable communication devices. • Support high-level functionality of MPI-1.2 in pure Java code, as far as possible: • Point to point communications, • Collective communications, • Groups, communicators, and contexts, • Derived datatypes – buffering API. • Thread safety ― support Java threads for parallelism within multicore nodes. • Open source distribution: http://mpj-express.org

MPJ Design

MPJ Demos • There were a few existing demos for MPJ, inherited from the earlier mpiJava project, e.g. • A 2d Fluid Flow example. • Some Monte Carlo simulations of 2d spin systems from condensed matter physics. • These have nice GUIs, but aren’t very representative of real codes used by computational scientists.

MPJ demo: CFD – inviscid flow

MPJ demo: Q-state Potts model

Gadget

Gadget-2 • Gadget-2 is a free-software, production code for cosmological N-body (and hydrodynamic) computations. http://www.mpa-garching.mpg.de/gadget • Written by Volker Springel, of the Max Plank Institute for Astrophysics, Garching. • It is written in the C language – already parallelized using MPI. • Versions have been used in various research papers in astrophysics literature, including the Millennium Simulation.

Millennium Simulation • See the paper Simulating the Joint Evolution of Quasars, Galaxies and their Large-Scale Distribution by Springel et al on Gadget home page. • Follows evolution of 1010 dark matter “particles” from early Universe (z = 127) to current day. • Performed on 512 nodes of IBM p690. • Used 1TB of distributed memory. • 350,000 CPU hours – 28 days elapsed time. • Floating point performance around 0.2 TFLOPS. • Around 20Tb total data generated.

Dynamics in Gadget • Gadget is “just” simulating the movement of (a lot of) representative particles under the influence of Newton’s law of gravity. • Plus some hydrodynamic forces, but these don’t affect dominant Dark Matter. • Co-moving coordinates take account of expansion of Universe according to General Relativity, but otherwise basically classical mechanics. • Classical N-body problem.

Gadget Main Loop • This is a slightly simplified view of the Gadget code: … Initialize … while (not done) { move_particles() ; // update positions domain_Decomposition() ; compute_accelerations() ; advance_and_find_timesteps() ; // update velocities } • Most of the interesting work happens in compute_accelerations() (and domain_Decomposition() – see later).

Computing Forces • The function compute_accelerations() must compute forces experienced by all particles. • In particular must compute gravitational force. • Because gravity is a long range force, every particle affects every other. Naively, total cost is O(N2). • with N ≈ 1010, this is quite infeasible. • Need some kind of approximation. • Intuitive approach: when computing force on particle i, group together particles in selected regions distant from i, and treat groups as single particles, located at their centres of mass.

Barnes Hut Tree • First divide cubical region of 3d space into 23 = 8 regions, halving each dimension. • For every sub-region that contains any particles, divide again recursively to make an oct-tree, until “leaf” regions have at most one particle.

Two Dimensional Example • Picture borrowed from www.cs.berkeley.edu/~demmel/cs267/lecture26/lecture26.html

Barnes Hut Force Computation • To compute the force on a particle i, traverse tree starting from root: • if a node n is “distant from” i, just add contribution to force on i from centre of mass of n– no need to visit children of n; • if node n is “close to” i, visit children of n and recurse. • Hinges on definition of distant from/close to. • Basic idea is that a node representing some region of space is distant from a particle i if the angle it subtends is smaller than a threshold opening angle: n θ i

Complexity • On average, number of nodes “opened” to compute force on i is O(log N), as opposed to visiting O(N) particles in naïve algorithm. • A huge win when N ≈ 1010.

Domain Decomposition • Need to divide space and/or particle set into domains, where each domain is handled by a single processor. • Problem is that we can’t just divide space evenly, because some regions will have many more particle than others – poor load balancing. • Conversely, can’t just divide particles evenly, because particles move throughout space, and want to maintain physically close particles on the same processor, as far as practical – communication problem.

Peano-Hilbert Curve • Warren and Salmon originally suggested using a space-filling curve: • Picture borrowed from http://www.mpa-garching.mpg.de/gadget/gadget2-paper.pdf

Peano-Hilbert Key • Gadget applies the recursion 20 times, logically dividing space into up to 220 × 220 × 220 cells on the Peano-Hilbert curve. • Then can label each cell by its location along the Peano-Hilbert curve – 260 possible locations comfortably fit into a 64-bit word.

Decomposition based on P-H Curve • Because number of cells> > number of particles, segments of linear Peano-Hilbert curve sparsely populated. • Sort particles by their Peano-Hilbert key, then divide evenly into P domains. • Intuitively – stretch out the P-H curve with particles dotted along it; segment it into P parts where each part has the same number of particles. • Characteristics of this decomposition: • Good load balancing. • Domains simply connected and quite “compact” in real space, because particles that are close along P-H curve are close in real space (converse often but not always true). • Domains have relatively simple mapping to BH octtree nodes.

Distribution of BH Tree in Gadget • Ibid.

Distributed Representation of Tree • Every processor holds a copy of the root nodes, and a copy of all child nodes down to the point where all particles in of a node are held on a single remote processor. Remotely held nodes are called pseudo-particles. • To compute the force on a single local target particle, traverse tree from root as usual, and accumulate force contributions from locally held particles. • Build an export list containing target particle and hosts of pseudo-particles encountered in walk.

Communication • After computation for all locally held target particles, process export list and send list of local target particles to all hosts that own pseudo-particle nodes needed for those particles. • All processors do another tree walk to compute their contributions to remotely owned (from their point of view) target particles. • These contributions are returned to the original processor, and added into the accelerations for the target particles.

Other Communications • Notably: • The Domain Decomposition itself requires (in principle) a distributed sort of the particle list. • Gadget only approximates this sort, but still fairly intricate communications. • In Gadget all communication is using the standard MPI library. It makes fairly extensive use of collective communication routines.

Java Gadget

Timeline • MPJ Express released 2005. • Realized we needed a realistic exemplar code: both as a demo, and to drive further improvements of our software. • First thoughts that an N-body simulation code would make sense – Summer 2005. • Learned about Gadget in December that year. • Effort to make a Java version of Gadget started seriously early February 2006 – expecting many months work. • Non-trivial simulations running in March.

Gadget Code Statistics • Public distribution is around 17,000 lines of C, distributed over about 30 source files. • Dependencies on: • MPI library for parallelization. • GNU scientific library (but only a handful of functions) • FFTW – library for parallel Fourier transforms. • Comes with a set of initial conditions files that we can use to test our code, including “colliding galaxies” and “cluster formation”.

Translation of Gadget Code • Manually translated, but in first cut, deliberately keep the Java code as similar to C as possible. • e.g. nearly all of Java Gadget is currently implemented as static methods on one enormous class called Main. • Small supporting classes corresponding to structs of C code. • Currently not supporting a number of features, including: • PMTree algorithm (an optimization using Fourier Transforms on a mesh to compute long-range part of gravitational force), • Multiple Initial Conditions files, • COMPUTE_POTENTIAL_ENERGY, ISOTHERM_EQS, FLEXSTEPS, OUTPUTPOTENTIAL, FORCETEST, MAKEGLASS, PSEUDOSYMMETRIC, …

Handling Dependencies • Replace MPI calls with MPJ calls (!) • Use of FFTW avoided by disabling the PMTree algorithm (but may slow down some simulations a lot). • Few GNU Scientific Library functions that were needed (numeric quadrature, etc) were hand translated to Java.

Test Cases • Restrictions aside, we successfully ran Colliding Galaxies and Cluster Formation example simulations varying numbers of processors. • These use pure Dark Matter – hydrodynamics code not yet tested. • Results are indistinguishable from running the original Gadget.

Colliding Galaxies (Java Gadget)

Cluster Formation (Java Gadget)

Optimizing Java Gadget • Initial Java version of Gadget was about 3 times slower than C. • Optimizations: • Replace communication of Object arrays (corresponding to C structs) with communication of component primitive elements (floats, ints, etc). • Still wasn’t enough – final version actually expunges large Object arrays from program – flattens all struct arrays into arrays of primitive types (see later comments). • Extended the MPJ Express API to manipulate Java ByteBuffers directly.

Flattening Object Arrays • To preserve memory locality, we decided to “flatten” large arrays of small objects (ParticleData) – replace an array of objects with an array of doubles containing object component fields, for example. • This makes the code less readable and maintainable than the original C code – clearly not acceptable. But we wanted speed. • Need a much better approach here. One possibility is some kind of macro pre-processing (it would be better if the Java language had structs…).

Benchmarks • Graphs taken from Aamir Shafi’s thesis – you will find many more results there. • Some benchmarks are on up to 8 processors of Starbug cluster in Portsmouth Distributed Systems Group. • Other benchmarks are on up to 32 processors from NW-GRID cluster at Daresbury Laboratory.

Starbug: Colliding Galaxies with 60,000 Particles

NW-Grid: Structure Formation with two million Particles

Conclusions • Possible to do large scale numerical simulations in Java • Speed is OK. • Memory consumption still an issue (Java Gadget needs perhaps three times more memory than C Gadget). • Made various improvements to MPJ Express implementation and API along the way. • As things stand, the Java Gadget code is not very maintainable – C-like, and then optimized some. • Needs more work to learn how to write maintainable code in Java that is also fast – e.g. how to preserve memory locality in arrays of objects.

MPJ Express

MPJ Express

Presentation Transcript

Programming Parallel Hardware using MPJ Express

MPJ Express

Travelling Salesperson Problem A Java mpj-express approach

mpjdev, the low-level MPJ device

Developing Native Device for MPJ Express Advisor: Dr. Aamir Shafi Co-advisor: Ms Samin Khaliq

High Performance Communication using MPJ Express

MPJ meets Gadget

MPJ: The second generation ‘MPI for Java’

MPJ Express: An Implementation of Message Passing Interface (MPI) in Java

MPJ Express: An Implementation of MPI in Java Version: 0.35 (Beta Release)

Express Sling Bags - Express Bag

Garmin Express | www.Garmin.Com/Express | Garmin.Com

Garmin.com/express | Www.garmin.com/express | Download Garmin Express

Garmin.com/express | Www.garmin.com/express | Download Garmin Express

MPJ: A Java-based Parallel Computing System

MPJ Express: An Implementation of Message Passing Interface (MPI) in Java

MPJ: The second generation ‘MPI for Java’

Garmin.com/Express, www.garmin.com/express, Garmin Express, Garmin Login,

Garmin Express | Garmin.com/Express Updates

Garmin express update at garmin.com/express