390 likes | 493 Views
MPJ: The second generation ‘MPI for Java’. Aamir Shafi 26 th April, 2005 Distributed Systems Group http://dsg.port.ac.uk. People . Aamir Shafi Bryan Carpenter: Open Middleware Infrastructure Institute (OMII) Mark Baker. Presentation outline. Introduction
E N D
MPJ: The second generation ‘MPI for Java’ Aamir Shafi 26th April, 2005 Distributed Systems Group http://dsg.port.ac.uk
People • Aamir Shafi • Bryan Carpenter: • Open Middleware Infrastructure Institute (OMII) • Mark Baker
Presentation outline • Introduction • Design and implementation of MPJ • The runtime infrastructure • Implementation issues • Conclusion
Introduction • MPI was introduced in June 1994 as a standard message passing API for parallel scientific computing. • Language bindings for C, C++, and Fortran • ‘Java Grande Message Passing Workgroup’ defined Java bindings in 98 • Previous efforts follow two approaches: • JNI approach • Pure Java approach: • Remote Method Invocation (RMI) • Sockets
Introduction: Pure Java approach • RMI • Meant for client server applications • Java Sockets • Java New I/O package: • Adds non-blocking I/O to the Java language, • Direct Buffers: • Allocated in the native OS memory and the JVM attempts to provide faster I/O • Communication performance: • Comparison of Java NIO and C Netpipe drivers, • Java performs similar to C on Fast Ethernet. • A very naïve comparison
The latency is ~250 microseconds • After 1k, the latency starts increasing due to fragmentation of packets • Netpipe is a single-threaded simple benchmark
Max throughput is ~90 Mbps • It will be great if MPJ with all its complexities can reach ~80 Mbps
Introduction: JNI approach • Importance of JNI cannot be ignored: • Where Java fails, JNI makes it work • Advances in HPC communication hardware have continued to grow: • Network latency has been reduced to a couple of microseconds • ‘Pure Java’ looks like an impractical solution: • In the presence of myrinet, no application developer/user would opt for Fast Ethernet • Cons: • Not in essence with Java philosophy of ‘write once, run anywhere’
Introduction • For Java messaging: • There is no ‘one size fits all’ approach • Portability and high performance are often contradictory requirements: • Portability: Pure Java • High Performance: JNI • The choice between portability and high performance should best be left to application developers • The challenging issue is how to manage these contradictory requirements: • How to provide a flexible mechanism to help applications swap communication protocols?
Presentation outline • Introduction • Design and implementation • The runtime infrastructure • Implementation issues • Conclusion
Design • Aims: • Support swapping various communication devices • Two device levels: • The MPJ Device level (mpjdev) • Separates native MPI device from all other devices • ‘native MPI’ device is a special case • Possible to cut through and make use of native implementation of advanced MPI features • The xdev Device level (xdev) • ‘gmdev’ – xdev based on GM 2.x comms library • ‘niodev’ – xdev based on Java NIO API • ‘smpdev’ – xdev based on Threads API
Implementation • Point to point communications • Collective communications • Groups, communicators, and contexts • Derived datatypes • Vector, Indexed, Contiguous, and Struct • Explict packing and unpacking • Process Topologies • Cartesian • Graph • Possible to cut through to the native MPI implementation • As of today, three methods (Dims_create, Cancel, and Wtick are left unimplemented)
Presentation outline • Introduction • Design and implementation • The runtime infrastructure • Implementation issues • Conclusion
The runtime infrastructure • All MPI libraries face the task of bootstrapping MPI processes over network computers • RSH/SSH based scripts are the most common • LAM/MPI daemons and runtime system works on UNIX based OS • No version of LAM for Windows • MPICH has recently introduced SMPD (Super Multi Purpose Daemon): • According to docs: • Works on linux and Windows • Difficult (if not impossible) to interface with Java
Runtime: MPJDaemon and MPJStarter modules • Consists of two modules: • The daemon that runs on compute nodes (MPJDaemon) • The starter module that runs on head nodes (MPJStarter) • Installing MPJDaemon on compute nodes: • RSH/SSH based scripts can easily install daemon on UNIX based OSes: • Could be installed as services (/etc/init.d) • Two files are required to install as a service on Windows
Runtime: MPJDaemon on UNIX based OSes • $MPJ_HOME/bin/mpjdaemon is a rc shell that starts and stops the daemon • Installation as an app: • ‘cd $MPJ_HOME/bin’ • ./mpjdaemon start • Could use RSH/SSH script to install on whole UNIX cluster • Installation as a service • ‘cp $MPJ_HOME/bin/mpjdaemon /etc/init.d’ • Adding to the default runtime • ‘rc-update add mpjdaemon default’ (Gentoo Linux) • ‘/etc/init.d/mpjdaemon start/stop/status
Runtime: MPJDaemon on Windows • ‘cd %MPJ_HOME%/bin’ • ‘InstallMPJDaemon-NT.bat’ • This bat file installs the daemon as a service
Runtime: MPJDaemon as services • Apache Commons Daemon: • The source bundle does not even compile • The project is no more active • Spent a week trying to make it work on Windows: • Gave up! • Java Service Wrapper: • Simple and does what it says • Support for almost platforms available (where you can run Java) • Distributed under MIT License: • Redistribute without any restricitons
Runtime: JMX M&M • Claims monitoring and management of Java apps: • Start Java app with following switch: • –Dcom.sun.management.jmxremote • Run ‘jconsole’: • Possible to connect to remote and local JVMs • Useful if application is an Mbean: • Application attributes could be get/set remotely • Possibility: • MPJDaemon could be operated remotely
Runtime: Dynamic class loading(1) • The application (parallel program) and MPJ library is dynamically loaded into the daemon JVM: • No need to copy jar files • No shared file system assumption • MPJStarter starts the light-weight HTTP server (Jetty), which serves the jar file containing parallel program
Runtime: Dynamic class loading(2) • For example, ‘HiMPJ.java’ is a parallel program: • Requires mpj.jar to compile and run • Bundle it into a jarfile specifying a manifest file with CLASSPATH attribute pointing to mpj.jar • Write the manifest file, • Manifest-Version: 1.0 • Main-Class: HiMPJ • Class-Path: mpj.jar • ‘jar –cfm himpj.jar manifest HiMPJ.class’ • Copy it to $MPJ_HOME/lib directory • Executing MPJStarter: • ‘cd $MPJ_HOME/bin’ • ‘starter.[sh/bat] 2 himpj.jar ../lib –xdev niodev’ • JarClassLoader will load himpj.jar and mpj.jar into the daemons JVM
Presentation outline • Introduction • Design and implementation • The runtime infrastructure • Implementation issues • Conclusion
Issue 1: Shared memory device • Based on Java Threads API: • Each thread is an MPI process • Communicates with other threads by sending messages • All threads run in the same JVM: • Cannot have static variables in the parallel program • Static variables within the MPJ library require synchronized access
Issue 2: Synchronization problems with threads in smpdev • Each MPJDaemon is assigned number of processes to be executed: • In case of smpdev, all processes run on the same machine • MPJDaemon loads the parallel program: • ‘JarClassLoader.loadClass(parallelProgramName)’ • Once loaded, the program is started as follows: • ‘JarClassLoader.invokeClass(pClass, args)’
Issue 2: Synchronization problems with threads in smpdev • For example, MPJStarter request MPJDaemons to start 2 processes (threads) • MPJDaemon started two threads, which first load, and then start the program • Processes (threads) are started in this way do not share static variables and cannot synchronize • In order to share static variables and sync them, the class should be loaded just once, and exectued N times • It was implemented in this way because niodev requires the exact opposite behaviour – No sharing of static variables • Currently, the user specifies which device should be used: • In case of niodev, the loading is done twice • In case of smpdev, the loading is done only once
Issue 3: ‘cygwin’ • If running MPJ on cygwin, • ‘chmod o+w $MPJ_HOME/logs’ • ‘chmod a+x $MPJ_HOME/lib/*.dll’ • Is MPJDaemon a windows service, or a linux service on cygwin?
(Future) Issue 4: Specifying multiple devices • Currently, only one device can be specified: • Either niodev or smpdev will be selected as the primary comms device • But for SMP clusters, it would be ideal: • To use smpdev on a SMP node • Use niodev/gmdev for internode comms
(Future) Issue 5: Starting MPJ with native MPI device • mpiJava/native MPI device uses ‘mpirun’ to bootstrap MPI processes: • To bring it in line with other devices, native MPI device will have to be started by MPJ runtime infrastructure
Issue 6: Multiple users running MPJDaemons at the same time • Install daemons as an app, • Agree on the port numbers.
Presentation outline • Introduction • Design and Implementation • The runtime infrastructure • Implementation Issues • Conclusion
Summary • The key issue for Java messaging is not debating pure Java or JNI approach: • But, providing a flexible mechanism to swap various comm protocols • MPJ has a pluggable architecture: • We are implementing ‘niodev’, ‘gmdev’, ‘smpdev’, and native MPI device • MPJ runtime infrastructure allows bootstrapping MPI process across various platforms • MPJDaemons can be installed as native OS service
Conclusions • We are slowly but surely moving towards the first release of MPJ, the next generation of ‘MPI for Java’ • Current Status: • Unit Testing • MPJ follows the same API as mpiJava: • The parallel applications built on top of mpiJava will work with MPJ • There are some differences in the API: • Bsend, and explicit packing/unpacking -- see release docs for more details • Arguably, the first MPI library for Java that implements real messaging stuff in pure Java