Platforms for HPJava: Runtime Support for Scalable Programming in Java

Platforms for HPJava: Runtime Support for Scalable Programming in Java Sang Boem Lim Florida State University slim@csit.fsu.edu slim@csit.fsu.edu

Contents • Overview of HPJava • Library support for HPJava • High-level APIs • Low-level API • Applications and performance • Contributions • Conclusions slim@csit.fsu.edu

Goals • Our research is concerned with enabling parallel, high-performance computation--in particular development of scientific software in the network-aware programming language, java. • Issues concerned with the implementation of the run-time environment underlying HPJava. • High-level APIs (e.g. Adlib) • Low-level API for underlying communications (e.g. mpjdev) • Adlib is the first application-level library for HPspmd mode. • The mpjdev API is a underlying communication library to perform actual communications. • Implementations of mpjdev: • mpiJava-based implementation • Multithreaded implementation • LAPI implementation slim@csit.fsu.edu

SPMD Parallel Computing • SIMD– A single control unit dispatches instructions to each processing unit. • e.g.) Illiac IV, Connection Machine-2, ect. • introduced a new concept, distributed arrays • MIMD– Each processor is capable of executing a different program independent of the other processors. • asynchronous, flexible, but hard to program • e.g.) Cosmic Cube, Cray T3D, IBM SP3, etc. • SPMD– Each processor executes the same program asynchronously. Synchronization takes place only when processors need to exchange data. • loosely synchronous model (SIMD+MIMD) • HPF - an extension of Fortran 90 to support the data parallel programming model on distributed memory parallel computers. slim@csit.fsu.edu

Motivation • SPMD (Single Program, Multiple Data) programming has been very successful for parallel computing. • Many higher-level programming environments and libraries assume the SPMD style as their basic model—ScaLAPACK, DAGH, Kelp, Global Array Toolkit. • But the library-based SPMD approach to data-parallel programming lacks the uniformity and elegance of HPF. • Compared with HPF, creating distributed arrays and accessing their local and remote elements is clumsy and error-prone. • Because the arrays are managed entirely in libraries, the compiler offers little support and no safety net of compile-time or compiler-generated run-time checking. • These observations motivate our introduction of the HPspmd model—direct SPMD programming supported by additional syntax for HPF-like distributed arrays. slim@csit.fsu.edu

HPspmd • Proposed by Fox, Carpenter, Xiaoming Li around 1998. • Independent processes executing same program, sharing elements of distributed arrays described by special syntax. • Processes operate directly on locally owned elements. Explicit communication needed in program to permit access to elements owned by other processes. • Envisaged bindings for base languages like Fortran, C, Java, etc. slim@csit.fsu.edu

HPJava—Overview • Environment for parallel programming. • Extends Java by adding some predefined classes and some extra syntax for dealing with distributed arrays. • So far the only implementation of HPspmd model. • HPJava program translated to standard Java program which calls communication libraries and parallel runtime system. slim@csit.fsu.edu

HPJava Example Procs p = new Procs2(2, 2); on(p) { Range x = new ExtBlockRange(M, p.dim(0), 1), y = new ExtBlockRange(N, p.dim(1), 1); float [[-,-]] a = new float [[x, y]]; . . . Initialize edge values in ‘a’ (boundary conditions) float [[-,-]] b = new float [[x,y]], r = new float [[x,y]]; // r = residuals do { Adlib.writeHalo(a); overall (i = x for 1 : N – 2) overall (j = y for 1 : N – 2) { float newA = 0.25 * (a[i - 1, j] + a[i + 1, j] + a[i, j - 1] + a[i, j + 1] ); r [i, j] = Math.abs(newA – a [i, j]); b [i, j] = newA; } HPspmd.copy(a, b); // Jacobi relaxation. } while(Adlib.maxval(r) > EPS); } slim@csit.fsu.edu

Processes and Process Grids • An HPJava program is started concurrently in some set of processes. • Processes named through “grid” objects: Procs p = new Procs2(2, 3); • Assumes program currently executing on 6 or more processes. • Specify execution in a particular process grid by on construct: on(p) { . . . } slim@csit.fsu.edu

Distributed Arrays in HPJava • Many differences between distributed arrays and ordinary arrays of Java. New kind of container type with special syntax. • Type signatures, constructors use double brackets to emphasize distinction: Procs2 p = new Procs2(2, 3); on(p){ Range x = new BlockRange(M, p.dim(0)); Range y = new BlockRange(N, p.dim(1)); float [[-,-]] a = new float[[x, y]]; . . . } slim@csit.fsu.edu

2-dimensional array block-distributed over p p.dim(1) M = N = 8 0 1 2 a[0,0] a[0,1] a[0,2] a[1,0] a[1,1] a[1,2] a[2,0] a[2,1] a[2,2] a[3,0] a[3,1] a[3,2] a[0,3] a[0,4] a[0,5] a[1,3] a[1,4] a[1,5] a[2,3] a[2,4] a[2,5] a[3,3] a[3,4] a[3,5] a[0,6] a[0,7] a[1,6] a[1,7] a[2,6] a[2,7] a[3,6] a[3,7] 0 p.dim(0) a[4,0] a[4,1] a[4,2] a[5,0] a[5,1] a[5,2] a[6,0] a[6,1] a[6,2] a[7,0] a[7,1] a[7,2] a[4,3] a[4,4] a[4,5] a[5,3] a[5,4] a[5,5] a[6,3] a[6,4] a[6,5] a[7,3] a[7,4] a[7,5] a[4,6] a[4,7] a[5,6] a[5,7] a[6,6] a[6,7] a[7,6] a[7,7] 1 slim@csit.fsu.edu

The Range hierarchy of HPJava BlockRange CyclicRange ExtBlockRange Range IrregRange CollapsedRange Dimension slim@csit.fsu.edu

The overall construct • overall—a distributed parallel loop • General form parameterized by index triplet: overall (i = x for l : u : s) { . . .} i = distributed index, l = lower bound, u = upper bound, s = step. • In general a subscript used in a distributed array element must be a distributed indexin the array range. slim@csit.fsu.edu

Irregular distributed data structures • Can be described as distributed array of Java arrays. float [[-]][] a = new float [[x]][]; overall (i = x : ) a [i] = new float[ f(x`) ]; 0 1 [0] [1] [2] [3] Size = 4 Size = 2 Size = 5 Size = 3 slim@csit.fsu.edu

Library Support for HPJava slim@csit.fsu.edu

Historical Adlib I • Adlib library was completed in the Parallel Compiler Runtime Consortium (PCRC). • This version used C++ as an implementation language. • Initial emphasis was on High Performance Fortran (HPF). • Initially Adlib was not meant be user-level library. It was called by HPF compiler-generate code when HPF translated user application. • It was developed on top of portable MPI. • Used by two experimental HPF translators (SHPF, and “PCRC” HPF). slim@csit.fsu.edu

Historical Adlib II • Initially HPJava used a JNI wrapper interface to the C++ kernel of the PCRC library. • This version of implementation had limitations and disadvantages. • Most importantly this version was hard and inefficient to support Java object types. • It had performance disadvantages because all calls to C++ Adlib should go though JNI calls. • It did not provide a set of gather/scatter buffer operation to better support HPC applications. slim@csit.fsu.edu

Collective Communication Library • Java version of Adlib is the first library of its kind developed from scratch for application-level use in HPspmd model. • Borrows many ideas from the PCRC library, but for this project we rewrote high-level library for Java. • It is extended to support Java Object types, to target Java based communication platforms and to use Java exception handling—making it “safe” for Java. • Support collective operations on distributed arrays described by HPJava syntax. • The Java version of the Adlib library is developed on top of mpjdev. The mpjdev API can be implemented portably on network platforms and efficiently on parallel hardware. slim@csit.fsu.edu

Java version of Adlib • This API intended for an application level communication library which is suitable for HPJava programming. • There are three main families of collective operation in Adlib • regular collective communications • reduction operations • irregular communications • Complete APIs of Java Adlib have been presented in Appendix A of my dissertation. slim@csit.fsu.edu

Regular Collective Communications I • remap • To copy the values of the elements in the source array to the corresponding elements in the destination array. void remap (T [[-]] dst, T [[-]] src) ; • T stands as a shorthand for any primitive type or Object type of Java. • Destination and source must have the same size and shape but they can have any, unrelated, distribution formats. • Can implement a multicast if destination has replicated distribution formats. • shift void shift (T [[-]] dst, T [[-]] src, int amount, int dimension); • implements simpler pattern of communication than general remap. slim@csit.fsu.edu

Regular Collective Communications II • writeHalo void writeHalo (T [[-]] a); • applied to distributed arrays that have ghost regions. It updates those regions. • A more general form of writeHalo allows to specify that only a subset of the available ghost area is to be updated. void writeHalo(T [[-]] a, int wlo, int whi, int mode); wlo, whi: specify the widths at upper and lower ends of the bands to be update. slim@csit.fsu.edu

Solution of Laplace equation using ghost regions 1 0 Range x = new ExtBlockRange(M, p.dim(0), 1); Range y = new ExtBlockRange(N, p.dim(1), 1); float [[-,-]] a = new float [[x, y]]; . . . Initialize values in ‘a’ float [[-,-]] b = new float [[x,y]], r = new float [[x,y]]; do { Adlib.writeHalo(a); overall (i = x for 1 : N – 2) overall (j = y for 1 : N – 2) { float newA = 0.25 * (a[i - 1, j] + a[i + 1, j] + a[i, j - 1] + a[i, j + 1] ); r [i, j] = Math.abs(newA – a [i, j]); b [i, j] = newA; } HPspmd.copy(a, b); } while(Adlib.maxval(r) > EPS); a[0,0] a[0,1] a[0,2] a[1,0] a[1,1]a[1,2] a[2,0] a[2,1]a[2,2] a[0,1] a[0,2] a[0,3] a[1,1]a[1,2] a[1,3] a[2,1]a[2,2] a[2,3] 0 a[3,0] a[3,1] a[3,2] a[3,1] a[3,2] a[3,3] a[2,0] a[2,1] a[2,2] a[2,1] a[2,2] a[2,3] 1 a[3,0] a[3,1]a[3,2] a[4,0] a[4,1]a[4,1] a[5,0] a[5,1] a[5,2] a[3,1]a[3,2] a[3,3] a[4,1]a[4,2] a[4,3] a[5,1] a[5,2] a[5,3] slim@csit.fsu.edu

Illustration of the effect of executing the writeHalo function Physical Segment Of array “Declared” ghost Region of array segment Ghost area written By writeHalo slim@csit.fsu.edu

Other features of Adlib • Provide reduction operations (e.g. maxval() and sum()) and irregular communications (e.g. gather() and scatter()). • Complete API and implementation issues are described in depth in my dissertation. slim@csit.fsu.edu

Other High-level APIs • Java Grande Message-Passing Working Group • formed as a subset of the existing Concurrency and Applications working group of Java Grande Forum. • Discussion of a common API for MPI-like Java libraries. • To avoid confusion with standards published by the original MPI Forum the API was called MPJ. • java-mpi mailing list has about 195 subscribers. slim@csit.fsu.edu

mpiJava • Implements a Java API for MPI suggested in late ’97. • mpiJava is currently implemented as Java interface to an underlying MPI implementation—such as MPICH or some other native MPI implementation. • The interface between mpiJava and the underlying MPI implementation is via the Java Native Interface (JNI). • This software is available from http://www.hpjava.org/mpiJava.html • Around 1465 people downloaded this software. slim@csit.fsu.edu

Low-level API • One area of research is how to transfer data between the Java program and the network while reducing overheads of the Java Native Interface. • Should do this Portably on network platforms and efficiently on parallel hardware. • We developed a low-level Java API for HPC message passing, called mpjdev. • The mpjdev API is a device level communication library. This library is developed with HPJava in mind, but it is a standalone library and could be used by other systems. slim@csit.fsu.edu

mpjdev I • Meant for library developer. • Application level communication libraries like Java version of Adlib (or potentially MPJ) can be implemented on top of mpjdev. • API for mpjdev is small compared to MPI (only includes point-to-point communications) • Blocking mode (like MPI_SEND, MPI_RECV) • Non-blocking mode (like MPI_ISEND, MPI_IRECV) • The sophisticated data types of MPI are omitted. • provide a flexible suit of operations for copying data to and from the buffer. (like gather- and scatter-style operations.) • Buffer handling has similarity to JDK 1.4 new I/O. slim@csit.fsu.edu

mpjdev II • mpjdev could be implemented on top of Java sockets in a portable network implementation, or—on HPC platforms—through a JNI interface to a subset of MPI. • Currently there are three different implementations. • The initial version was targeted to HPC platforms, through a JNI interface to a subset of MPI. • For SMPs, and for debugging on a single processor, we implemented a pure-Java, multithreaded version. • We also developed a more system-specific mpjdev built on the IBM SP system using LAPI. • A Java sockets version which will provide a more portable network implementation and will be added in the future. slim@csit.fsu.edu

HPJava communication layers Other application- level API Java version of Adlib MPJ and mpjdev Pure Java Native MPI SMPs or Networks of PCs Parallel Hardware (e.g. IBM SP3, Sun HPC) slim@csit.fsu.edu

mpiJava-based Implementation • Assumes C binding of native method calls to MPI from mpiJava as basic communication protocol. • Can be divided into two parts. • Java APIs (Buffer and Comm classes) which are used to call native methods via JNI. • C native methods that construct the message vector and perform communication. • For elements of Object type, the serialized data are stored into a Java byte [] array. • copying into the existing message vector if it has space to hold serialized data array. • or using separate send if the original message vector is not large enough. slim@csit.fsu.edu

Multithreaded Implementation • The processes of an HPJava program are mapped to the Java threads of a single JVM. • This allows to debug and demonstrate HPJava programs without facing the ordeal of installing MPI or running on a network. • As a by-product, it also means we can run HPJava programs on shared memory parallel computers. • e.g. high-end UNIX servers • Java threads of modern JVMs are usually executed in parallel on this kinds machines. slim@csit.fsu.edu

LAPI Implementation • The Low-level Application Programming Interface (LAPI) is a low level communication interface for the IBM Scalable Powerparallel (SP) supercomputer Switch. • This switch provides scalable high performance communication between SP nodes. • LAPI functions can be divided into three different characteristic groups. • Active message infrastructure: allows programmers to write and install their own set of handlers. • Two Remote Memory Copy (RMC) interfaces. • PUT operation: copies data from the address space of the origin process into the address space of the target process. • GET operation: opposite of the PUT operation. • We produced two different implementations of mpjdev using LAPI. • Active message function (LAPI_Amsend) + GET operation (LAPI_Get) • Active message function (LAPI_Amsend) slim@csit.fsu.edu

LAPI Implementation: Active message and GET operation slim@csit.fsu.edu

LAPI Implementation: Active message slim@csit.fsu.edu

Applications and Performance slim@csit.fsu.edu

Environments • System: IBM SP3 supercomputing system with AIX 4.3.3 operating system and 42 nodes. • CPU: A node has four processors (Power3 375 MHZ) and 2 gigabytes of shared memory. • Network MPI Setting: Shared “css0” adapter with User Space (US) communication mode. • Java VM: IBM’s JIT • Java Compiler: IBM J2RE 1.3.1 with “-O” option. • HPF Compiler: IBM xlhpf95 with “-qhot” and “-O3” options. • Fortran 95 Compiler: IBM xlf95 with “-O5” option. slim@csit.fsu.edu

HPJava can out-perform sequential Java by up to 17 times. • On 36 processors HPJava can get about 79% of the performance of HPF. slim@csit.fsu.edu

slim@csit.fsu.edu

Multigrid • The multigrids method is a fast algorithm for solution of linear and nonlinear problems. It uses hierarchy grids with restrict and interpolate operations between current grids (fine grid) and restricted grids (coarse grid). • General stratagem is: • make the error smooth by performing a relaxation method. • restricting a smoothed version of the error term to a coarse grid, computing a correction term on the coarse grid, then interpolating this correction back to the original fine grid. • Perform some step of the relaxation method again to improve the original approximation to the solution. slim@csit.fsu.edu

Speedup is relatively modest. This seems to be due to the complex pattern of communication in this algorithm. slim@csit.fsu.edu

Speedup of HPJava Benchmarks slim@csit.fsu.edu

HPJava with GUI • Illustrate how our HPJava can be used with a Java graphical user interface. • The Java multithreaded implementation of mpjdev makes it possible for HPJava to cooperate with Java AWT. • For test and demonstration of multithreaded version of mpjdev, We implemented computational fluid dynamics (CFD) code using HPJava. • Illustrates usage of Java object in our communication library. • You can view this demonstration and source code at http://www.hpjava.org/demo.html slim@csit.fsu.edu

Removed the graphical part of the CFD code and did performance tests on the computational part only. • Changed a 2 dimensional Java object distributed array into a 3 dimensional double distributed array to eliminate object serialization overhead. • Using HPC implementation of underlying communication to run the code on an SP. slim@csit.fsu.edu

LAPI mpjdev Performance • We found that current version of Java thread synchronization is not implemented with high performance. • The Java thread consumes more then five times a long as POSIX thread, to perform wait and awake thread function calls. • 57.49 microseconds (Java thread) vs. 10.68 microseconds (POSIX) • This result suggests we should look for a new architectural design for mpjdev using LAPI. • Consider using POSIX threads by calling JNI to the C instead of Java threads. slim@csit.fsu.edu

Contributions slim@csit.fsu.edu

Contributions I HPJava • My main contributions to HPJava was to develop runtime communication libraries. • Java version of Adlib has been developed as application-level communication library suitable for data parallel programming in Java. • The mpjdev API has been developed as device level communication library. • Other contributions: The Type-Analyzer for analyzing byte classes’ hierarchy, some part of the type-checker, and Pre-translator of HPJava was developed. • Some Applications of HPJava and full test codes of Adlib and mpjdev were developed. slim@csit.fsu.edu

Contributions II mpiJava • Main contribution to mpiJava project was to add support for direct communication of Java objects via serialization. • Complete set of test cases of mpiJava for Java object types were developed. • Maintaining mpiJava. slim@csit.fsu.edu

Platforms for HPJava: Runtime Support for Scalable Programming in Java