MPJ Express: An Implementation of Message Passing Interface (MPI) in Java

MPJ Express: An Implementation of Message Passing Interface (MPI) in Java Aamir Shafi http://mpj-express.org http://acet.rdg.ac.uk/projects/mpj

Writing Parallel Software • There are mainly two approaches for writing parallel software: • Software that can be executed on parallel hardware to exploit computational and memory resources • The first approach is to use messaging libraries (packages) written in already existing languages like C, Fortran, and Java: • Message Passing Interface (MPI) • Parallel Virtual Machine (PVM) • The second and more radical approach is to provide new languages: • HPC has a history of novel parallel languages • High Performance Fortran (HPF) • Unified Parallel C (UPC) • In this talk we talk about an implementation of MPI in Java called MPJ Express

Introduction to Java for HPC • Java was released by Sun in 1996: • A mainstream language in software industry, • Attractive features include: • Portability, • Automatic garbage collection, • Type-safety at compile time and runtime, • Built-in support for multi-threading: • A possible option to provide nested parallelism on multi-core systems, • Performance: • Just-In-Time compilers convert source code to byte code, • Modern JVMs perform compilation from byte code to native machine code on the fly • But Java has safety features that may limit performance.

Introduction to Java for HPC • Three existing approaches to Java messaging: • Pure Java (Sockets based), • Java Native Interface (JNI), and • Remote Method Invocation (RMI), • mpiJava has been perhaps the most popular Java messaging system • mpiJava (http://www.hpjava.org/mpiJava.html) • MPJ/Ibis (http://www.cs.vu.nl/ibis/mpj.html) • Motivation for a new Java messaging system: • Maintain compatibility with Java threads by providing thread-safety, • Handle contradicting issues of high-performance and portability.

Memory CPU Distributed Memory Cluster Proc 1 Proc 2 Proc 0 message LAN Ethernet Myrinet Infiniband etc Proc 3 Proc 7 Proc 6 Proc 4 Proc 5

Write machines files

Bootstrap MPJ Express runtime

Write Parallel Program

Compile and Execute

Introduction to MPJ Express • MPJ Express is an implementation of a Java messaging system, based on Java bindings: • Will eventually supersede mpiJava. • Aamir Shafi, Bryan Carpenter, and Mark Baker • Thread-safe communication devices using Java NIO and Myrinet: • Maintain compatibility with Java threads, • The buffering layer provides explicit memory management instead of relying on the garbage collector, • Runtime system for portable bootstrapping

James Gosling Says…

Who is using MPJ Express? • First released in September 2005 under LGPL (an open-source licence): • Approximately 1000 users all around the world • Some projects using this software: • Cartablanca is a simulation package that uses Jacobian-Free-Newton-Krylov (JFNK) methods to solve non-linear problems • The project is done at Los Alamos National Lab (LANL) in the US • Researchers at University of Leeds, UK have used this software in Modelling and Simulation in e-Social Science (MoSeS) project • Teaching Purposes: • Parallel Programming using Java (PPJ): • http://www.sc.rwth-aachen.de/Teaching/Labs/PPJ05/ • Parallel Processing SS 2006: • http://tramberend.inform.fh-hannover.de/

MPJ Express Design

Presentation Outline • Implementation Details: • Point-to-point communication • Communicators, groups, and contexts • Process topologies • Derived datatypes • Collective communications • MPJ Express Buffering Layer • Runtime System • Performance Evaluation

Java NIO Device • Uses non-blocking I/O functionality, • Implements two communication protocols: • Eager-send protocol for small messages, • Rendezvous protocol for large messages, • Locks around communication methods results in deadlocks: • In Java, the keyword synchronized ensures that only one object can call synchronized method at a time, • A process sending a message to itself using synchronous send, • Locks for thread-safety: • Writing messages: • A lock for send-communication-sets, • Locks for destination channels: • One for every destination process, • Obtained one after the other, • Reading messages: • A lock for receive-communication-sets.

Standard mode with eager send protocol (small messages)

Standard mode with rendezvous protocol (large messages)

MPJ Express Buffering Layer • MPJ Express requires a buffering layer: • To use Java NIO: • SocketChannels use byte buffers for data transfer, • To use proprietary networks like Myrinet efficiently, • Implement derived datatypes, • Various implementations are possible based on actual storage medium, • Direct or indirect ByteBuffers, • An mpjbuf buffer object consists of: • A static buffer to store primitive datatypes, • A dynamic buffer to store serialized Java objects, • Creating ByteBuffers on the fly is costly: • Memory management is based on Knuth’s buddy algorithm, • Two implementations of memory management.

MPJ Express Buffering Layer • Frequent creation and destruction of communication buffers hurts performance. • To tackle this, MPJ Express requires a buffering layer: • Provides two implementations of Knuth’s buddy algorithm, • To use Java NIO and proprietary networks: • Direct ByteBuffers, • Implement derived datatypes

Presentation Outline • Implementation Details: • Point-to-point communication • Communicators, groups, and contexts • Process topologies • Derived datatypes • Collective communications • MPJ Express Buffering Layer • Runtime System • Performance Evaluation

Communicators, groups, and contexts • MPI provides a higher level abstraction to create parallel libraries: • Safe communication space • Group scope for collective operations • Process Naming • Communicators + Groups provide: • Process Naming (instead of IP address + ports) • Group scope for collective operations • Contexts: • Safe communication

What is a group? • A data-structure that contains processes • Main functionality: • Keep track of ranks of processes • Explanation of figure • Group A contains eight processes • Group B and C are created from Group A • All group operations are local (no communication with remote processes)

Example of a group operation(Union) • Explanation of union operation • Two processes a and d are in both groups: • Thus, six processes are executing this operation • Each group has its own view of this group operations: • Apply theory of relativity • Re-assigning ranks in new groups: • Process 0 in group A is re-assigned rank 0 in Group C • Process 0 in group B is re-assigned rank 4 in Group C • If any existing process does not make it into the new group, it returns MPI.GROUP_EMPTY

What are communicators? • A data-structure that contains groups (and thus processes) • Why is it useful: • Process naming, ranks are names for application programmers • Easier than IPaddress + ports • Group communications as well as point to point communication • There are two types of communicators, • Intracommunicators: • Communication within a group • Intercommunicators: • Communication between two groups (must be disjoint)

What are contexts? • An unique integer: • An additional tag on the messages • Each communicator has a distinct context that provides a safe communication universe: • A context is agreed upon by all processes when a communicator is built • Intracommunicators has two contexts: • One for point-to-point communications • One for collective communications, • Intercommunicators has two contexts: • Explained in the coming slides

Process topologies • Used to specify processes in a geometric shape • Virtual topologies: have no connection with the physical layout of machines: • Its possible to make use of underlying machine architecture • These virtual topologies can be assigned to processes in an Intracommunicator • MPI provides: • Cartesian topology • Graph topology

Cartesian topology: Mapping four processes onto 2x2 topology • Each process is assigned a coordinate: • Rank 0: (0,0) • Rank 1: (1,0) • Rank 2: (0,1) • Rank 3: (1,1) • Uses: • Calculate rank by knowing grid (not globus one!) position • Calculate grid positions from ranks • Easier to locate rank of neighbours • Applications may have communication patterns: • Lots of messaging with immediate neighbours

Periods in cartesian topology • Axis 1 (y-axis is periodic): • Processes in top and bottom rows have valid neighbours towards top and bottom respectively • Axis 0 (x-axis is non-periodic): • Processes in right and left column have undefined neighbour towards right and left respectively

Derived datatypes • Besides, basic datatypes, it is possible to communicate heterogeneous, non-contiguous data. • Contiguous • Indexed • Vector • Struct

Indexed datatype • The elements that may form this datatype should be: • Same types • At non-contiguous locations • Add flexibility by specifying displacements int SIZE = 4; int [] blklen = new int[DIM],displ = new int[DIM]; for(i=0 ; i<DIM ; i++) { blklen[i]=DIM-i; displ[i]=(i*DIM)+i; } double[] params = new double[SIZE*SIZE]; double[] rparams = new double[SIZE*SIZE]; Datatype i = Datatype.Indexed(blklen, displ, MPI.INT); //array_of_block_lengths, array_displacements Send(params,0,1,i,dst,tag); //0 is offset, 1 is count Recv(rparams,0,1,i,src,tag);

Presentation Outline • Implementation Details: • Point-to-point communication • Communicators, groups, and contexts • Process topologies • Derived datatypes • Collective communications • Runtime System • Thread-safety in MPJ Express • Performance Evaluation

Collective communications • Provided as a convenience for application developers: • Save significant development time • Efficient algorithms may be used • Stable (tested) • Built on top of point-to-point communications, • These operations include: • Broadcast, Barrier, Reduce, Allreduce, Alltoall, Scatter, Scan, Allscatter • Versions that allows displacements between the data

Broadcast, scatter, gather, allgather, alltoall Image from MPI standard doc

Reduce collective operations • MPI.PROD • MPI.SUM • MPI.MIN • MPI.MAX • MPI.LAND • MPI.BAND • MPI.LOR • MPI.BOR • MPI.LXOR • MPI.BXOR • MPI.MINLOC • MPI.MAXLOC

Barrier with Tree Algorithm

Execution of barrier with eight processes • Eight processes, thus forms only one group • Each process exchanges an integer 4 times • Overlaps communications well

Intracomm.Bcast( … ) • Sends data from a process to all the other processes • Code from adlib: • A communication library for HPJava • The current implementation is based on n-ary tree: • Limitation: broadcasts only from rank=0 • Generated dynamically • Cost: O( log2(N) ) • MPICH1.2.5 uses linear algorithm: • Cost O(N) • MPICH2 has much improved algorithms • LAM/MPI uses n-ary trees: • Limitation, broadcast from rank=0

Broadcasting algorithm, total processes=8, root=0

The Runtime System

Thread-safety in MPI • The MPI 2.0 specification introduced the notion of thread-compliant MPI implementation, • Four levels of thread-safety: • MPI_THREAD_SINGLE, • MPI_THREAD_FUNNELED, • MPI_THREAD_SERIALIZED, • MPI_THREAD_MULTIPLE, • A blocked thread should not halt the execution of other threads, • “Issues in Developing Thread-Safe MPI Implementation” by Gropp et al.

Latency on Fast Ethernet

Throughput on Fast Ethernet

Latency on Gigabit Ethernet

Throughput on GigE

Choking experience 1

Latency on Myrinet

MPJ Express: An Implementation of Message Passing Interface (MPI) in Java

MPJ Express: An Implementation of Message Passing Interface (MPI) in Java

Presentation Transcript

Message Passing Interface

MPI – Message Passing Interface

Message Passing Interface (MPI)

MPI Message Passing Interface

MPI: Message-Passing Interface

Message Passing Interface

MPI: Message Passing Interface

Message Passing Interface (MPI)

MPI Message Passing Interface

Message Passing Interface (MPI)

Message Passing Interface (MPI) 2

Message Passing Interface

Message Passing Interface (MPI)

Message Passing Interface (MPI)

Message Passing Interface (MPI) 2

MPI Message Passing Interface

MPI: Message Passing Interface An Introduction

MPI – Message Passing Interface

MPI Message Passing Interface