620 likes | 698 Views
“Towards an SSI for HP Java”. Francis Lau The University of Hong Kong With contributions from C.L. Wang, Ricky Ma, and W.Z. Zhu. Cluster Coming of Age. HPC Cluster the de facto standard equipment Grid? Clusters Fortran or C + MPI the norm 99% on top of bare-bone Linux or the like
E N D
“Towards an SSI for HP Java” Francis Lau The University of Hong Kong With contributions from C.L. Wang, Ricky Ma, and W.Z. Zhu
Cluster Coming of Age • HPC • Cluster the de facto standard equipment • Grid? • Clusters • Fortran or C + MPI the norm • 99% on top of bare-bone Linux or the like • Ok if application is embarrassingly parallel and regular ICPP-HPSECA03
Cluster for the Mass Commercial: Data mining, Financial Modeling, Oil Reservoir Simulation, Seismic Data Processing, Vehicle and Aircraft Simulation Government: Nuclear Stockpile Stewardship, Climate and Weather, Satellite Image Processing, Forces Modeling Academic: Fundamental Physics (particles, relativity, cosmology), Biochemistry, Environmental Engineering, Earthquake Prediction • Two modes: • For number crunching in Grande type applications (superman) • As a CPU farm to support high-throughput computing (poor man) ICPP-HPSECA03
Cluster Programming • Auto-parallelization tools have limited success • Parallelization a chore but “have to do it” (or let’s hire someone) • Optimization for performance not many users’ cup of tea • Partitioning and parallelization • Mapping • Remapping (experts?) ICPP-HPSECA03
Amateur Parallel Programming • Common problems • Poor parallelization: few large chunks or many small chunks • Load imbalances: large and small chunks • Meeting the amateurs half-way • They do crude parallelization • System does the rest: mapping/remapping (automatic optimization) • And I/O? ICPP-HPSECA03
Automatic Optimization • “Feed the fat boy with two spoons, and a few slim ones with one spoon” • But load information could be elusive • Need smart runtime supports • Goal is to achieve high performance with good resource utilization and load balancing • Large chunks that are single-threaded a problem ICPP-HPSECA03
The Good “Fat Boys” • Large chunks that span multiple nodes • Must be a program with multiple execution “threads” • Threads can be in different nodes – program expands and shrinks • Threads/programs can roam around – dynamicmigration • This encourages fine-grain programming cluster node “amoeba” ICPP-HPSECA03
Mechanism and Policy • Mechanism for migration • Traditional process migration • Thread migration • Redirection of I/O and messages • Objects sharing between nodes for threads • Policy for good dynamic load balancing • Message traffic a crucial parameter • Predictive • Towards the “single system image” ideal ICPP-HPSECA03
Single System Image • If user does only crude parallelization and system does the rest … • If processes/threads can roam, and processes expand/shrink … • If I/O (including sockets) can be at any node anytime … • We achieve at least 50% of SSI • The rest is difficult Single Entry Point File System Virtual Networking I/O and Memory Space Process Space Management / Programming View … ICPP-HPSECA03
Bon Java! • Java (for HPC) in good hands • JGF Numerics Working Group, IBM Ninja, … • JGF Concurrency/Applications Working Group (benchmarking, MPI, …) • The workshops • Java has many advantages (vs. Fortran and C/C++) • Performance not an issue any more • Threads as first-class citizens! • JVM can be modified “Java has the greatest potential to deliver an attractive productive programming environment spanning the very broad range of tasks needed by the Grande programmer ” – The Java Grande Forum Charter ICPP-HPSECA03
Process vs. Thread Migration • Process migration easier than thread migration • Threads are tightly coupled • They share objects • Two styles to explore • Process, MPI (“distributed computing”) • Thread, shared objects (“parallel computing”) • Or combined • Boils down to messages vs. distributed shared objects ICPP-HPSECA03
Two Projects @ HKU • M-JavaMPI– “M” for “Migration” • Process migration • I/O redirection • Extension to grid • No modification of JVM and MPI • JESSICA – “Java-Enabled Single System Image Computing Architecture” • By modifying JVM • Thread migration, Amoeba mode • Global object space, I/O redirection • JIT mode (Version 2) ICPP-HPSECA03
Design Choices • Bytecode instrumentation • Insert code into programs, manually or via pre-processor • JVM extension • Make thread state accessible from Java program • Non-transparent • Modification of JVM is required • Checkpointing the whole JVM process • Powerful but heavy penalty • Modification of JVM • Runtime support • Totally transparent to the applications • Efficient but very difficult to implement ICPP-HPSECA03
M-JavaMPI • Support transparent Java process migration and provide communication redirection services • Communication using MPI • Implemented as a middleware on top of standard JVM • No modifications of JVM and MPI • Checkpointing the Java process + code insertion by preprocessor ICPP-HPSECA03
System Architecture ICPP-HPSECA03
Preprocessing • Bytecode is modified before passing to JVM for execution • “Restoration functions” are inserted as exception handlers, in the form of encapsulated “try-catch” statements • Re-arrangement of bytecode, and addition of local variables ICPP-HPSECA03
The Layers • Java-MPI API layer • Restorable MPI layer • Provides restorable MPI communications • No modification of MPI library • Migration Layer • Captures and save the execution state of the migrating process in the source node, and restores the execution state of the migrated process in the destination node • Cooperates with the Restorable MPI layer to reconstruct the communication channels of the parallel application ICPP-HPSECA03
State Capturing and Restoring • Program code: re-used in the destination node • Data: captured and restored by using the object serialization mechanism • Execution context: captured by using JVMDI and restored by inserted exception handlers • Eager (all) strategy: For each frame, local variables, referenced objects, the name of the class and class method, and program counter are saved using object serialization ICPP-HPSECA03
State Capturing using JVMDI public class A { int a; char b; … } public class A { try { … } catch (RestorationException e) { a = saved value of local variable a; b = saved value of local variable b; pc = saved value of program counter when the program is suspended jump to the location where the program is suspended } } ICPP-HPSECA03
Message Redirection Model • MPI daemon in each node to support message passing between distributed java processes • IPC between Java program and MPI daemon in the same node through shared memory and semaphores client-server client-server ICPP-HPSECA03
Process migration steps Source Node Destination Node ICPP-HPSECA03
Experiments • PC Cluster • 16-node cluster • 300 MHz Pentium II with 128MB of memory • Linux 2.2.14 with Sun JDK 1.3.0 • 100Mb/s fast Ethernet • All Java programs executed in interpreted mode ICPP-HPSECA03
Bandwidth:PingPong Test • Native MPI: 10.5 MB/s • Direct Java-MPI binding: 9.2 MB/s • Restorable MPI layer: 7.6 MB/s ICPP-HPSECA03
Latency:PingPong Test • Native MPI: 0.2 ms • Direct Java-MPI binding: 0.23 ms • Restorable MPI layer: 0.26 ms ICPP-HPSECA03
Migration Cost: capturing and restoring objects ICPP-HPSECA03
Migration Cost: capturing and restoring frames ICPP-HPSECA03
Application Performance • PI calculation • Recursive ray-tracing • NAS integer sort • Parallel SOR ICPP-HPSECA03
Time spent in calculating PI and ray-tracing with and without the migration layer ICPP-HPSECA03
Problem size (no. of integers) Time used (sec) in environment without M-JavaMPI Time used (sec) in environment with M-JavaMPI Overhead introduced by M-JavaMPI (in %) Total Comp Comm Total Comp Comm Total Comm Class S: 65536 0.023 0.009 0.014 0.026 0.009 0.017 13% 21% Class W:1048576 0.393 0.182 0.212 0.424 0.182 0.242 7.8% 14% Class A: 8388608 3.206 1.545 1.66 3.387 1.546 1.840 5.6% 11% Execution time of NAS program with different problem sizes (16 nodes) No noticeable overhead introduced in the computation part; while in the communication part, an overhead of about 10-20% ICPP-HPSECA03
Time spent in executing SOR using different numbers of nodes with and without migration layer ICPP-HPSECA03
Cost of Migration Time spent in executing the SOR program on an array of size 256x256 without and with one migration during the execution ICPP-HPSECA03
Applications Average migration time PI 2 Ray-tracing 3 NAS 2 SOR 3 Cost of Migration • Time spent in migration (in seconds) for different applications ICPP-HPSECA03
Dynamic Load Balancing • A simple test • SOR program was executed using six nodes in an unevenly loaded environment with one of the nodes executing a computationally intensive program • Without migration : 319s • With migration: 180s ICPP-HPSECA03
In Progress • M-JavaMPI in JIT mode • Develop system modules for automatic dynamic load balancing • Develop system modules for effective fault-tolerant supports ICPP-HPSECA03
Java Virtual Machine Application Class File Java API Class File • Class Loader • Loads class files • Interpreter • Executes bytecode • Runtime Compiler • Converts bytecode to native code Class loader Bytecode Interpreter 0a0b0c0d0c6262431 c1d688662a0b0c0d0 c1334514726522723 Runtime compiler 01010101000101110 10101011000111010 10110011010111011 Native code ICPP-HPSECA03
A Multithreaded Java Program Threads in JVM public class ProducerConsumerTest { public static void main(String[] args) { CubbyHole c = new CubbyHole(); Producer p1 = new Producer(c, 1); Consumer c1 = new Consumer(c, 1); p1.start(); c1.start(); } } Thread 3 Thread 2 Thread 1 Java Method Area (Code) PC Class loader Stack Frame Execution Engine Stack Frame Class files Heap (Data) object object ICPP-HPSECA03
Load variable from main memory to working memory before use. Upon T1 performs unlock, variable is written back to main memory Upon T2 performs lock, flush variable in working memory When T2 uses variable, it will be loaded from main memory Java Memory Model (How to maintain memory consistency between threads) JMM T1 T2 Variable is modified in T1’s working memory. Per-Thread working memory Main memory Garbage Bin Object master copy Heap Area Variable Threads: T1, T2 ICPP-HPSECA03
Problems in Existing DJVMs • Mostly based on interpreters • Simple but slow • Layered design using distributed shared memory system (DSM) cannot be tightly coupled with JVM • JVM runtime information cannot be channeled to DSM • False sharing if page-based DSM is employed • Page faults block the whole JVM • Programmer to specify thread distribution lack of transparency • Need to rewrite multithreaded Java applications • No dynamic thread distribution (preemptive thread migration) for load balancing ICPP-HPSECA03
Related Work • Method shipping:IBM cJVM • Like remote method invocation (RMI) : when accessing object fields, the proxy redirects the flow of execution to the node where the object's master copy is located. • Executed in Interpreter mode. • Load balancing problem : affected by the object distribution. • Page shipping:Rice U. Java/DSM, HKU JESSICA • Simple. GOS was supported by some page-based Distributed Shared Memory (e.g., TreadMarks, JUMP, JiaJia) • JVM runtime information can’t be channeled to DSM. • Executed in Interpreter mode. • Object shipping:Hyperion, Jackal • Leverage some object-based DSM • Executed in native mode: Hyperion: translate Java bytecode to C. Jackal: compile Java source code directly to native code ICPP-HPSECA03
Distributed Java Virtual Machine (DJVM) JESSICA2: A distributed Java Virtual Machine (DJVM)spanning multiple cluster nodes can provide a true parallel execution environment for multithreaded Java applications with aSingle System Imageillusion to Java threads. Java Threads created in a program Global Object Space OS OS OS OS PC PC PC PC High Speed Network ICPP-HPSECA03
JESSICA2 Main Features JESSICA2 • Transparent Java thread migration • Runtime capturing and restoring of thread execution context. • No source code modification; no bytecode instrumentation (preprocessing); no new API introduced • Enables dynamic load balancing on clusters • Operated in Just-In-Time (JIT) compilation Mode • Global Object Space • A shared global heap spanning all cluster nodes • Adaptive object home migration protocol • I/O redirection Transparent migration JIT GOS ICPP-HPSECA03
Transparent Thread Migration in JIT Mode • Simple for interpreters (e.g., JESSICA) • Interpreter sits in the bytecode decoding loop which can be stopped upon a migration flag checking • The full state of a thread is available in the data structure of interpreter • No register allocation • JIT mode execution makes things complex (JESSICA2) • Native code has no clear bytecode boundary • How to deal with machine registers? • How to organize the stack frames (all are in native form now)? • How to make extracted thread states portable and recognizable by the remote JVM? • How to restore the extracted states (rebuild the stack frames) and restart the execution in native form? Need to modify JIT compiler to instrument native code ICPP-HPSECA03
Frame An overview of JESSICA2 Java thread migration • Frame parsing • Restore execution Thread GOS (heap) (3) Frames Frames Frames Migration Manager (4a) Object Access GOS (heap) Method Area Frame PC • Stack analysis • Stack capturing (2) Method Area Thread Scheduler JVM PC (4b) Load method from NFS Source node (1) Alert Destination node Load Monitor ICPP-HPSECA03
Essential Functions • Migration points selection • At the start of loop, basic block or method • Register context handler • Spill dirty registers at migration point without invalidation so that native code can continue the use of registers • Use register recovering stub at restoring phase • Variable type deduction • Spill type in stacks using compression • Java frames linking • Discover consecutive Java frames ICPP-HPSECA03
Dynamic Thread State Capturing and Restoring in JESSICA2 migration point Bytecode verifier migration point Selection (Restore) cmp mflag,0 jz ... invoke register allocation bytecode translation cmp obj[offset],0 jz ... 1. Add migration checking 2. Add object checking 3. Add type & register spilling Intermediate Code mov 0x110182, slot ... Register recovering code generation reg slots Native Code Global Object Access (Capturing) Linking & Constant Resolution Native stack scanning Java frame mov slot1->reg1 mov slot2->reg2 ... C frame ICPP-HPSECA03 Frame Native thread stack
How to Maintain Memory Consistency in a Distributed Environment? T1 T2 T3 T4 T5 T6 T7 T8 Heap Heap OS OS OS OS PC PC PC PC High Speed Network ICPP-HPSECA03
Embedded Global Object Space (GOS) • Take advantage of JVM runtime information for optimization (e.g., object types, accessing threads, etc.) • Use threaded I/O interface inside JVM for communication to hide the latency Non-blocking GOS access • OO-based to reduce false sharing • Home-based, compliant with JVM Memory Model (“Lazy Release Consistency”) • Master heap (home objects) and cache heap (local and cached objects): reduce object access latency ICPP-HPSECA03
Object Cache ICPP-HPSECA03
Adaptive object home migration • Definition • “home” of an object = the JVM that holds the master copy of an object • Problems • cache objects need to be flushed and re-fetched from the home whenever synchronization happens • Adaptive object home migration • if # of accesses from a thread dominates the total # of accesses to an object, the object home will be migrated to the node where the thread is running ICPP-HPSECA03
I/O redirection • Timer • Use the time in master node as the standard time • Calibrate the time in worker node when they register to master node • File I/O • Use half word of “fd” as node number • Open file • For read, check local first, then master node • For write, go to master node • Read/Write • Go to the node specified by the node number in fd • Network I/O • Connectionless send: do it locally • Others, go to master ICPP-HPSECA03