Platforms for HPJava: Runtime Support for Scalable Programming in Java

Platforms for HPJava: Runtime Support for Scalable Programming in Java Sang Boem Lim Florida State University slim@csit.fsu.edu slim@csit.fsu.edu

Contents • Overview of HPJava • Library support for HPJava • High-level APIs • Low-level API • Applications and performance • Contributions • Conclusions slim@csit.fsu.edu

Goals • Our research is concerned with enabling parallel, high-performance computation--in particular development of scientific software in the network-aware programming language, java. • Issues concerned with the implementation of the run-time environment underlying HPJava. • High-level APIs (e.g. Adlib) • Low-level API for underlying communications (e.g. mpjdev) • Adlib is the first application-level library for HPspmd mode. • The mpjdev API is a underlying communication library to perform actual communications. • Implementations of mpjdev: • mpiJava-based implementation • Multithreaded implementation • LAPI implementation slim@csit.fsu.edu

SPMD Parallel Computing • SIMD– A single control unit dispatches instructions to each processing unit. • e.g.) Illiac IV, Connection Machine-2, ect. • introduced a new concept, distributed arrays • MIMD– Each processor is capable of executing a different program independent of the other processors. • asynchronous, flexible, but hard to program • e.g.) Cosmic Cube, Cray T3D, IBM SP3, etc. • SPMD– Each processor executes the same program asynchronously. Synchronization takes place only when processors need to exchange data. • loosely synchronous model (SIMD+MIMD) • HPF - an extension of Fortran 90 to support the data parallel programming model on distributed memory parallel computers. slim@csit.fsu.edu

Motivation • SPMD (Single Program, Multiple Data) programming has been very successful for parallel computing. • Many higher-level programming environments and libraries assume the SPMD style as their basic model—ScaLAPACK, DAGH, Kelp, Global Array Toolkit. • But the library-based SPMD approach to data-parallel programming lacks the uniformity and elegance of HPF. • Compared with HPF, creating distributed arrays and accessing their local and remote elements is clumsy and error-prone. • Because the arrays are managed entirely in libraries, the compiler offers little support and no safety net of compile-time or compiler-generated run-time checking. • These observations motivate our introduction of the HPspmd model—direct SPMD programming supported by additional syntax for HPF-like distributed arrays. slim@csit.fsu.edu

HPspmd • Proposed by Fox, Carpenter, Xiaoming Li around 1998. • Independent processes executing same program, sharing elements of distributed arrays described by special syntax. • Processes operate directly on locally owned elements. Explicit communication needed in program to permit access to elements owned by other processes. • Envisaged bindings for base languages like Fortran, C, Java, etc. slim@csit.fsu.edu

HPJava—Overview • Environment for parallel programming. • Extends Java by adding some predefined classes and some extra syntax for dealing with distributed arrays. • So far the only implementation of HPspmd model. • HPJava program translated to standard Java program which calls communication libraries and parallel runtime system. slim@csit.fsu.edu

HPJava Example Procs p = new Procs2(2, 2); on(p) { Range x = new ExtBlockRange(M, p.dim(0), 1), y = new ExtBlockRange(N, p.dim(1), 1); float [[-,-]] a = new float [[x, y]]; . . . Initialize edge values in ‘a’ (boundary conditions) float [[-,-]] b = new float [[x,y]], r = new float [[x,y]]; // r = residuals do { Adlib.writeHalo(a); overall (i = x for 1 : N – 2) overall (j = y for 1 : N – 2) { float newA = 0.25 * (a[i - 1, j] + a[i + 1, j] + a[i, j - 1] + a[i, j + 1] ); r [i, j] = Math.abs(newA – a [i, j]); b [i, j] = newA; } HPspmd.copy(a, b); // Jacobi relaxation. } while(Adlib.maxval(r) > EPS); } slim@csit.fsu.edu

Processes and Process Grids • An HPJava program is started concurrently in some set of processes. • Processes named through “grid” objects: Procs p = new Procs2(2, 3); • Assumes program currently executing on 6 or more processes. • Specify execution in a particular process grid by on construct: on(p) { . . . } slim@csit.fsu.edu

Distributed Arrays in HPJava • Many differences between distributed arrays and ordinary arrays of Java. New kind of container type with special syntax. • Type signatures, constructors use double brackets to emphasize distinction: Procs2 p = new Procs2(2, 3); on(p){ Range x = new BlockRange(M, p.dim(0)); Range y = new BlockRange(N, p.dim(1)); float [[-,-]] a = new float[[x, y]]; . . . } slim@csit.fsu.edu

2-dimensional array block-distributed over p p.dim(1) M = N = 8 0 1 2 a[0,0] a[0,1] a[0,2] a[1,0] a[1,1] a[1,2] a[2,0] a[2,1] a[2,2] a[3,0] a[3,1] a[3,2] a[0,3] a[0,4] a[0,5] a[1,3] a[1,4] a[1,5] a[2,3] a[2,4] a[2,5] a[3,3] a[3,4] a[3,5] a[0,6] a[0,7] a[1,6] a[1,7] a[2,6] a[2,7] a[3,6] a[3,7] 0 p.dim(0) a[4,0] a[4,1] a[4,2] a[5,0] a[5,1] a[5,2] a[6,0] a[6,1] a[6,2] a[7,0] a[7,1] a[7,2] a[4,3] a[4,4] a[4,5] a[5,3] a[5,4] a[5,5] a[6,3] a[6,4] a[6,5] a[7,3] a[7,4] a[7,5] a[4,6] a[4,7] a[5,6] a[5,7] a[6,6] a[6,7] a[7,6] a[7,7] 1 slim@csit.fsu.edu

The Range hierarchy of HPJava BlockRange CyclicRange ExtBlockRange Range IrregRange CollapsedRange Dimension slim@csit.fsu.edu

The overall construct • overall—a distributed parallel loop • General form parameterized by index triplet: overall (i = x for l : u : s) { . . .} i = distributed index, l = lower bound, u = upper bound, s = step. • In general a subscript used in a distributed array element must be a distributed indexin the array range. slim@csit.fsu.edu

Irregular distributed data structures • Can be described as distributed array of Java arrays. float [[-]][] a = new float [[x]][]; overall (i = x : ) a [i] = new float[ f(x`) ]; 0 1 [0] [1] [2] [3] Size = 4 Size = 2 Size = 5 Size = 3 slim@csit.fsu.edu

Library Support for HPJava slim@csit.fsu.edu

Historical Adlib I • Adlib library was completed in the Parallel Compiler Runtime Consortium (PCRC). • This version used C++ as an implementation language. • Initial emphasis was on High Performance Fortran (HPF). • Initially Adlib was not meant be user-level library. It was called by HPF compiler-generate code when HPF translated user application. • It was developed on top of portable MPI. • Used by two experimental HPF translators (SHPF, and “PCRC” HPF). slim@csit.fsu.edu

Historical Adlib II • Initially HPJava used a JNI wrapper interface to the C++ kernel of the PCRC library. • This version of implementation had limitations and disadvantages. • Most importantly this version was hard and inefficient to support Java object types. • It had performance disadvantages because all calls to C++ Adlib should go though JNI calls. • It did not provide a set of gather/scatter buffer operation to better support HPC applications. slim@csit.fsu.edu

Collective Communication Library • Java version of Adlib is the first library of its kind developed from scratch for application-level use in HPspmd model. • Borrows many ideas from the PCRC library, but for this project we rewrote high-level library for Java. • It is extended to support Java Object types, to target Java based communication platforms and to use Java exception handling—making it “safe” for Java. • Support collective operations on distributed arrays described by HPJava syntax. • The Java version of the Adlib library is developed on top of mpjdev. The mpjdev API can be implemented portably on network platforms and efficiently on parallel hardware. slim@csit.fsu.edu

Java version of Adlib • This API intended for an application level communication library which is suitable for HPJava programming. • There are three main families of collective operation in Adlib • regular collective communications • reduction operations • irregular communications • Complete APIs of Java Adlib have been presented in Appendix A of my dissertation. slim@csit.fsu.edu

Regular Collective Communications I • remap • To copy the values of the elements in the source array to the corresponding elements in the destination array. void remap (T [[-]] dst, T [[-]] src) ; • T stands as a shorthand for any primitive type or Object type of Java. • Destination and source must have the same size and shape but they can have any, unrelated, distribution formats. • Can implement a multicast if destination has replicated distribution formats. • shift void shift (T [[-]] dst, T [[-]] src, int amount, int dimension); • implements simpler pattern of communication than general remap. slim@csit.fsu.edu

Regular Collective Communications II • writeHalo void writeHalo (T [[-]] a); • applied to distributed arrays that have ghost regions. It updates those regions. • A more general form of writeHalo allows to specify that only a subset of the available ghost area is to be updated. void writeHalo(T [[-]] a, int wlo, int whi, int mode); wlo, whi: specify the widths at upper and lower ends of the bands to be update. slim@csit.fsu.edu

Solution of Laplace equation using ghost regions 1 0 Range x = new ExtBlockRange(M, p.dim(0), 1); Range y = new ExtBlockRange(N, p.dim(1), 1); float [[-,-]] a = new float [[x, y]]; . . . Initialize values in ‘a’ float [[-,-]] b = new float [[x,y]], r = new float [[x,y]]; do { Adlib.writeHalo(a); overall (i = x for 1 : N – 2) overall (j = y for 1 : N – 2) { float newA = 0.25 * (a[i - 1, j] + a[i + 1, j] + a[i, j - 1] + a[i, j + 1] ); r [i, j] = Math.abs(newA – a [i, j]); b [i, j] = newA; } HPspmd.copy(a, b); } while(Adlib.maxval(r) > EPS); a[0,0] a[0,1] a[0,2] a[1,0] a[1,1]a[1,2] a[2,0] a[2,1]a[2,2] a[0,1] a[0,2] a[0,3] a[1,1]a[1,2] a[1,3] a[2,1]a[2,2] a[2,3] 0 a[3,0] a[3,1] a[3,2] a[3,1] a[3,2] a[3,3] a[2,0] a[2,1] a[2,2] a[2,1] a[2,2] a[2,3] 1 a[3,0] a[3,1]a[3,2] a[4,0] a[4,1]a[4,1] a[5,0] a[5,1] a[5,2] a[3,1]a[3,2] a[3,3] a[4,1]a[4,2] a[4,3] a[5,1] a[5,2] a[5,3] slim@csit.fsu.edu

Illustration of the effect of executing the writeHalo function Physical Segment Of array “Declared” ghost Region of array segment Ghost area written By writeHalo slim@csit.fsu.edu

Other features of Adlib • Provide reduction operations (e.g. maxval() and sum()) and irregular communications (e.g. gather() and scatter()). • Complete API and implementation issues are described in depth in my dissertation. slim@csit.fsu.edu

Other High-level APIs • Java Grande Message-Passing Working Group • formed as a subset of the existing Concurrency and Applications working group of Java Grande Forum. • Discussion of a common API for MPI-like Java libraries. • To avoid confusion with standards published by the original MPI Forum the API was called MPJ. • java-mpi mailing list has about 195 subscribers. slim@csit.fsu.edu

mpiJava • Implements a Java API for MPI suggested in late ’97. • mpiJava is currently implemented as Java interface to an underlying MPI implementation—such as MPICH or some other native MPI implementation. • The interface between mpiJava and the underlying MPI implementation is via the Java Native Interface (JNI). • This software is available from http://www.hpjava.org/mpiJava.html • Around 1465 people downloaded this software. slim@csit.fsu.edu

Low-level API • One area of research is how to transfer data between the Java program and the network while reducing overheads of the Java Native Interface. • Should do this Portably on network platforms and efficiently on parallel hardware. • We developed a low-level Java API for HPC message passing, called mpjdev. • The mpjdev API is a device level communication library. This library is developed with HPJava in mind, but it is a standalone library and could be used by other systems. slim@csit.fsu.edu

mpjdev I • Meant for library developer. • Application level communication libraries like Java version of Adlib (or potentially MPJ) can be implemented on top of mpjdev. • API for mpjdev is small compared to MPI (only includes point-to-point communications) • Blocking mode (like MPI_SEND, MPI_RECV) • Non-blocking mode (like MPI_ISEND, MPI_IRECV) • The sophisticated data types of MPI are omitted. • provide a flexible suit of operations for copying data to and from the buffer. (like gather- and scatter-style operations.) • Buffer handling has similarity to JDK 1.4 new I/O. slim@csit.fsu.edu

mpjdev II • mpjdev could be implemented on top of Java sockets in a portable network implementation, or—on HPC platforms—through a JNI interface to a subset of MPI. • Currently there are three different implementations. • The initial version was targeted to HPC platforms, through a JNI interface to a subset of MPI. • For SMPs, and for debugging on a single processor, we implemented a pure-Java, multithreaded version. • We also developed a more system-specific mpjdev built on the IBM SP system using LAPI. • A Java sockets version which will provide a more portable network implementation and will be added in the future. slim@csit.fsu.edu

HPJava communication layers Other application- level API Java version of Adlib MPJ and mpjdev Pure Java Native MPI SMPs or Networks of PCs Parallel Hardware (e.g. IBM SP3, Sun HPC) slim@csit.fsu.edu

mpiJava-based Implementation • Assumes C binding of native method calls to MPI from mpiJava as basic communication protocol. • Can be divided into two parts. • Java APIs (Buffer and Comm classes) which are used to call native methods via JNI. • C native methods that construct the message vector and perform communication. • For elements of Object type, the serialized data are stored into a Java byte [] array. • copying into the existing message vector if it has space to hold serialized data array. • or using separate send if the original message vector is not large enough. slim@csit.fsu.edu

Multithreaded Implementation • The processes of an HPJava program are mapped to the Java threads of a single JVM. • This allows to debug and demonstrate HPJava programs without facing the ordeal of installing MPI or running on a network. • As a by-product, it also means we can run HPJava programs on shared memory parallel computers. • e.g. high-end UNIX servers • Java threads of modern JVMs are usually executed in parallel on this kinds machines. slim@csit.fsu.edu

LAPI Implementation • The Low-level Application Programming Interface (LAPI) is a low level communication interface for the IBM Scalable Powerparallel (SP) supercomputer Switch. • This switch provides scalable high performance communication between SP nodes. • LAPI functions can be divided into three different characteristic groups. • Active message infrastructure: allows programmers to write and install their own set of handlers. • Two Remote Memory Copy (RMC) interfaces. • PUT operation: copies data from the address space of the origin process into the address space of the target process. • GET operation: opposite of the PUT operation. • We produced two different implementations of mpjdev using LAPI. • Active message function (LAPI_Amsend) + GET operation (LAPI_Get) • Active message function (LAPI_Amsend) slim@csit.fsu.edu

LAPI Implementation: Active message and GET operation slim@csit.fsu.edu

LAPI Implementation: Active message slim@csit.fsu.edu

Applications and Performance slim@csit.fsu.edu

Environments • System: IBM SP3 supercomputing system with AIX 4.3.3 operating system and 42 nodes. • CPU: A node has four processors (Power3 375 MHZ) and 2 gigabytes of shared memory. • Network MPI Setting: Shared “css0” adapter with User Space (US) communication mode. • Java VM: IBM’s JIT • Java Compiler: IBM J2RE 1.3.1 with “-O” option. • HPF Compiler: IBM xlhpf95 with “-qhot” and “-O3” options. • Fortran 95 Compiler: IBM xlf95 with “-O5” option. slim@csit.fsu.edu

HPJava can out-perform sequential Java by up to 17 times. • On 36 processors HPJava can get about 79% of the performance of HPF. slim@csit.fsu.edu

slim@csit.fsu.edu

Multigrid • The multigrids method is a fast algorithm for solution of linear and nonlinear problems. It uses hierarchy grids with restrict and interpolate operations between current grids (fine grid) and restricted grids (coarse grid). • General stratagem is: • make the error smooth by performing a relaxation method. • restricting a smoothed version of the error term to a coarse grid, computing a correction term on the coarse grid, then interpolating this correction back to the original fine grid. • Perform some step of the relaxation method again to improve the original approximation to the solution. slim@csit.fsu.edu

Speedup is relatively modest. This seems to be due to the complex pattern of communication in this algorithm. slim@csit.fsu.edu

Speedup of HPJava Benchmarks slim@csit.fsu.edu

HPJava with GUI • Illustrate how our HPJava can be used with a Java graphical user interface. • The Java multithreaded implementation of mpjdev makes it possible for HPJava to cooperate with Java AWT. • For test and demonstration of multithreaded version of mpjdev, We implemented computational fluid dynamics (CFD) code using HPJava. • Illustrates usage of Java object in our communication library. • You can view this demonstration and source code at http://www.hpjava.org/demo.html slim@csit.fsu.edu

Removed the graphical part of the CFD code and did performance tests on the computational part only. • Changed a 2 dimensional Java object distributed array into a 3 dimensional double distributed array to eliminate object serialization overhead. • Using HPC implementation of underlying communication to run the code on an SP. slim@csit.fsu.edu

LAPI mpjdev Performance • We found that current version of Java thread synchronization is not implemented with high performance. • The Java thread consumes more then five times a long as POSIX thread, to perform wait and awake thread function calls. • 57.49 microseconds (Java thread) vs. 10.68 microseconds (POSIX) • This result suggests we should look for a new architectural design for mpjdev using LAPI. • Consider using POSIX threads by calling JNI to the C instead of Java threads. slim@csit.fsu.edu

Contributions slim@csit.fsu.edu

Contributions I HPJava • My main contributions to HPJava was to develop runtime communication libraries. • Java version of Adlib has been developed as application-level communication library suitable for data parallel programming in Java. • The mpjdev API has been developed as device level communication library. • Other contributions: The Type-Analyzer for analyzing byte classes’ hierarchy, some part of the type-checker, and Pre-translator of HPJava was developed. • Some Applications of HPJava and full test codes of Adlib and mpjdev were developed. slim@csit.fsu.edu

Contributions II mpiJava • Main contribution to mpiJava project was to add support for direct communication of Java objects via serialization. • Complete set of test cases of mpiJava for Java object types were developed. • Maintaining mpiJava. slim@csit.fsu.edu

Platforms for HPJava: Runtime Support for Scalable Programming in Java

Platforms for HPJava: Runtime Support for Scalable Programming in Java

Presentation Transcript

What Is An Exception?

Java for High Performance Computing

Object Oriented Programming with JAVA

lab2 PROGRAMMING 1

Setting Up Java

Java for High Performance Computing

Java program

Introduction to Java web programming

Scalable Fault Tolerance Schemes using Adaptive Runtime Support

Advanced Programming in Java

Java Programming

Intro to Java Programming

Parallel programming in Java

Developing Mobile Apps with the ArcGIS Runtime SDK for .NET

What is Java? Java is a new programming language from Sun Microsystems. (mid-1995)

Introduction to Java Applet Programming

CSE 452: Programming Languages

Chapter One Fundamentals of Programming

Femto Java Developing Java applications for tiny footprint platforms

The Java Programming Language

Introduction to JAVA Programming