JESSICA2: A Distributed Java Virtual Machine with Transparent Thread Migration Support

JESSICA2: A Distributed Java Virtual Machine with Transparent Thread Migration Support Wenzhang Zhu, Cho-Li Wang, Francis Lau The Systems Research Group Department of Computer Science and Information Systems The University of Hong Kong

HKU JESSICA Project • JESSICA: “Java-Enabled Single-System-Image Computing Architecture” : • Project started in 1996. First version (JESSICA1) in 1999. • A middleware that runs on top of the standard UNIX/Linux operating system to support parallel execution of multi-threaded Java applications in a cluster of computers. • JESSICA hides the physical boundaries between machines and makes the cluster appear as a single computer to applications -- a single-system image (SSI). • Special feature : preemptive thread migration which allows a thread to freely move between machines. • Part of the RGC’s Area of Excellence project in 1999-2002.

JESSICA Team Members • Supervisors: • Dr. Francis C.M. Lau • Dr. Cho-Li Wang • Research Students: • Ph.D: Wenzhang Zhu (Thread Migration) • Ph.D: WeiJian Fang (Global Heap) • M.Phil: Zoe Ching Han Yu (Distributed Garbage Collection) • Ph.D: Benny W. L. Cheung (Software Distributed Shared Memory) • Graduated: Matchy Ma (JESSICA1) JESSICA Team Members The Systems Research Group

Outline • Introduction of Cluster Computing • Motivations • Related works • JESSICA2 features • Performance Analysis • Conclusion & Future works

What’s a cluster ? • A cluster is a type of parallel or distributed processing system, which consists of a collection of interconnected stand-alone/complete computerscooperatively working together as a single, integrated computing resource – IEEE TFCC. • My definition : a HPC system that integrates mainstream commodity components to process large-scale problems  low cost, self-made, yet powerful.

Cluster Computer Architecture Programming Environment (Java, C, MPI, HPF, DSM) Management & Monitoring & Job Scheduling Cluster Applications (Web, Storage, Computing, Rendering, Financing..) Single System Image Infrastructure Availability Infrastructure OS OS OS OS Node Node Node Node High-Speed LAN (Fast/Gigabit Ethernet, SCI, Myrinet)

Single System Image (SSI) ? • JESSICA Project: Java-Enabled Single-System-Image Computing Architecture • A single system image is the illusion, created by software or hardware, that presents a collection of resources as one, more powerful resource. • Ultimate Goals of SSI : makes the cluster appear like a single machine to the user, to applications, and to the network. Single Entry Point, Single File System, Single Virtual Networking, Single I/O and Memory Space, Single Process Space, Single Management / Programming View …

Count Share Rmax [GF/s] Rpeak [GF/s] Processors MPP Constellations Cluster SMP Total 500 100 % 221949.97 342516.00 201169 224 44.8 % 104899.21 168829.00 111104 187 37.4 % 40246.50 59038.00 33828 80 16 % 37596.16 69774.00 50181 9 1.8 % 39208.10 44875.00 6056 Top 500 computers by “classification” (June 2002) (Source: http://www.top500.org/ ) MPP Massively Parallel ProcessorConstellation E.g., cluster of HPCsCluster Cluster of PCsSMP Symmetric Multiprocessor • About the TOP500 List: • the 500 most powerful computer systems installed in the world. • Compiled twice a year since June 1993 • Ranked by their performance on the LINPACK Benchmark

#1 Supercomputer: NEC’s Earth Simulator Linpack : 35.86 Tflop/s (Tera FLOPS = 1012 floating point operations per second = 450 x Pentium 4 PCs) Interconnect: Single stage crossbar (1800 miles of cable) 83,000 copper cables, 16 GB/s cross section bandwidth Area of computer = 4 tennis courts, 3 floors • Built by NEC, 640 processor nodes, each consisting of 8 vector processors, total of 5120 processors, 40 TFlop/s peak, and 10 TB memory. (Source: NEC)

Other Supercomputers in TOP500 • #2 #3 Supercomputer: ASCI Q • 7.7 TF/s Linpack performance. • Los Alamos National Laboratory, U.S. • HP Alphserver SC (375 x 32-way multiprocessors, total 11,968 processors), 12 terabytes of memory and 600 terabytes of disk storage • #4: IBM ASCI White (U.S.) • 8,192copper microprocessors (IBM SP POWER3), and contains 160 trillion bytes (TB) of memory with more than 160 TB of IBM disk storage capacity; Linpack: 7.22 Tflops. Located at Lawrence Livermore National Laboratory. • 512-node, 16-way symmetric multiprocessor. Covers an area the size of two basketball courts, weighs 106 tons. 2,000 miles of copper wiring. Cost: US$110 million.

TOP500 Nov 2002 List • 2 new PC clusters made the TOP 10: • #5 is a Linux NetworX/Quadrics cluster at Lawrence Livermore National Laboratory. • #8 is a HPTi/Myrinet cluster at the Forecast Systems Laboratory atNOAA. • A total of 55 Intel based and 8 AMD based PC clusters are in the TOP500. • The number of clusters in the TOP500 grew again to a total of 93 systems.

Poor Man’s Cluster • HKU Ostrich Cluster • 32 x 733 MHz Pentium III PCs, 384MB Memory • Hierarchical Ethernet-based network : four 24-port Fast Ethernet switches + one 8-port Gigabit Ethernet backbone switch)

Rich Man’s Cluster • Computational Plant (C-Plant cluster) • 1536Compaq DS10L 1U servers (466 MHz Alpha 21264 (EV6) microprocessor, 256 MB ECC SDRAM) • Each node contains a 64-bit, 33 MHz Myrinet network interface card (1.28 Gbps/s) connected to a 64-port Mesh64 switch. 48 cabinets, each of which contains 32 nodes (48x32=1536)

The HKU Gideon 300 Cluster(Operating in mid Oct. 2002) Linpack performance: 355 Gflops #175 in TOP500 (Nov. 2002 List) 300 PCs (2.0GHz Pentium 4, 512MB DDR mem, 40GB disk, Linux OS) connected by a 312-port Foundry FastIron 1500 (Fast Ethernet) switch

Building Gideon 300

JESSICA2 : Introduction • Research Goal • High Performance Java Computing using Clusters • Why Java? • The dominant language for server-side programming. • More than 2 million Java developers [CNETAsia: 06/2002] • Platform independent: “Compile once, run anywhere” • Code mobility (i.e., dynamic class loading) and data mobility (i.e., object serialization). • Built-in multithreading support at language level (parallel programming using MPI, PVM, RMI, RPC, HPF, DSM is difficult) • Why cluster? • Large scale server-side applications need high-performance multithreaded programming supports • A cluster provides a scalable hardware platform for true parallel execution.

Java Virtual Machine Application Class File Java API Class File • Class Loader • Loads class files • Interpreter • Executes bytecode • Runtime Compiler • Converts bytecode to native code Class loader Bytecode Interpreter 0a0b0c0d0c6262431 c1d688662a0b0c0d0 c1334514726522723 Runtime compiler 01010101000101110 10101011000111010 10110011010111011 Native code

A Multithreaded Java Program Threads in JVM public class ProducerConsumerTest { public static void main(String[] args) { CubbyHole c = new CubbyHole(); Producer p1 = new Producer(c, 1); Consumer c1 = new Consumer(c, 1); p1.start(); c1.start(); } } Thread 3 Thread 2 Thread 1 Java Method Area (Code) PC Class loader Stack Frame Execution Engine Stack Frame Class files Heap (Data) object object

Load variable from main memory to working memory before use. Upon T1 performs unlock, variable is written back to main memory Upon T2 performs lock, flush variable in working memory When T2 uses variable, it will be loaded from main memory Java Memory Model (How to maintain memory consistency between threads) Threads in a JVM T1 T2 Variable is modified in T1’s working memory. Per-Thread working memory Main memory Garbage Bin Object master copy Heap Area Variable Threads: T1, T2

Distributed Java Virtual Machine (DJVM) JESSICA2: A distributed Java Virtual Machine (DJVM)spanning multiple cluster nodes can provide a true parallel execution environment for multithreaded Java applications with aSingle System Imageillusion to Java threads. Java Threads created in a program Global Object Space OS OS OS OS PC PC PC PC High Speed Network

Problems in Existing DJVMs • Mostly based on interpreters • Simple but slow • Layered design using distributed shared memory system (DSM)  can’t be tightly coupled with JVM • JVM runtime information can’t be channeled to DSM • False sharing if page-based DSM is employed • Page faults block the whole JVM • Programmer specifies thread distribution  lacks of transparency • Need to rewrite multithreaded Java applications • No dynamic thread distribution (preemptive thread migration) for load balancing.

Related Work • Method shipping: IBM cJVM • Like remote method invocation (RMI) : when accessing object fields, the proxy redirects the flow of execution to the node where the object's master copy is located. • Executed in Interpreter mode. • Load balancing problem : affected by the object distribution. • Page shipping : Rice U. Java/DSM, HKU JESSICA • Simple. GOS was supported by some page-based Distributed Shared Memory (e.g., TreadMarks, JUMP, JiaJia) • JVM runtime information can’t be channeled to DSM. • Executed in Interpreter mode. • Object shipping:Hyperion, Jackal • Leverage some object-based DSM • Executed in native mode: Hyperion: translate Java bytecode to C. Jackal: compile Java source code directly to native code

JESSICA2 Main Features JESSICA2 Transparent migration • Transparent Java thread migration • Runtime capturing and restoring of thread execution context. • No source code modification. No bytecode instrumenting (preprocessing) No new API introduced. • Enable dynamic load balancing on clusters • Operated in Just-In-Time (JIT) compilation Mode • Global Object Space • A shared global heap spanning all cluster nodes • Adaptive object home migration protocol • I/O redirection JIT GOS

JESSICA2 Architecture Java Bytecode or Source Code public class ProducerConsumerTest { public static void main(String[] args) { CubbyHole c = new CubbyHole(); Producer p1 = new Producer(c, 1); Consumer c1 = new Consumer(c, 1); p1.start(); c1.start(); } }

Transparent Thread Migration in JIT Mode • Simple for interpreters (e.g. JESSICA) • Interpreter sits in the bytecode decoding loop which can be stopped upon a migration flag checking • The full state of a thread is available in the data structure of interpreter • No register allocation • JIT mode execution makes things complex (JESSICA2) • Native code has no clear bytecode boundary • How to deal with machine registers? • How to organize the stack frames (all are in native form now) ? • How to make extracted thread states portable and recognizable by the remote JVM ? • How to restore the extracted states (rebuild the stack frames) and restart the execution in native form ? Need to modify JIT compiler to instrument native codes

Frame An overview of JESSICA2 Java thread migration • Frame parsing • Restore execution Thread GOS (heap) (3) Frames Frames Frames Migration Manager (4a) Object Access GOS (heap) Method Area Frame PC • Stack analysis • Stack capturing (2) Method Area Thread Scheduler JVM PC (4b) Load method from NFS Source node (1) Alert Destination node Load Monitor

What are those functions? • Migration points selection • Delayed to the head of loop basic block or method • Register context handler • Spill dirty registers at migration point without invalidation so that native codes can continue the use of registers • Use register recovering stub at restoring phase • Variable type deduction • Spill type in stacks using compression • Java frames linking • Discover consecutive Java frames

Dynamic Thread State Capturing and Restoring in JESSICA2 migration point Bytecode verifier migration point Selection (Restore) cmp mflag,0 jz ... invoke register allocation bytecode translation cmp obj[offset],0 jz ... 1. Add migration checking 2. Add object checking 3. Add type & register spilling Intermediate Code mov 0x110182, slot ... Register recovering code generation reg slots Native Code Global Object Access (Capturing) Linking & Constant Resolution Native stack scanning Java frame mov slot1->reg1 mov slot2->reg2 ... C frame Frame Native thread stack

How to Maintain Memory Consistency in a Distributed Environment ? T1 T2 T3 T4 T5 T6 T7 T8 Heap Heap OS OS OS OS PC PC PC PC High Speed Network

Embedded Global Object Space (GOS) • Main Features: • Take advantage of JVM runtime information for optimization (e.g. object types, accessing threads, etc.) • Use threaded I/O interface inside JVM for communication to hide the latency  Non-blocking GOS access • OO based to reduce false sharing • Home-based, compliant with JVM Memory Model (“Lazy Release Consistency”) • Master Heap (home objects) and Cache Heap (local and cached objects) : reduce object access latency

Object Cache

Adaptive object home migration • Definition • “home” of an object = the JVM that holds the master copy of an object • Problems • cache objects need to be flushed and re-fetched from the home whenever synchronization happens • Adaptive object home migration • if # of accesses from a thread dominates the total # of accesses to an object, the object home will be migrated to the node where the thread is running

I/O redirection • Timer • Use the time in Master node as the standard time • Calibrate the time in worker node when they register to master node • File I/O • Use half word of fd as node number • Open file • For read, check local first, then master node • For write, go to master node • Read/Write • Go to the node specified by the node number in fd • Network I/O • Connectionless send: do locally • Others, go to master

Experimental Setting • Modified Kaffe Open JVM version 1.0.6 • Linux PC Clusters: • Pentium II PCs at 540MHz (Linux 2.2.1 kernel) Connected by Fast Ethernet • HKU Gideon 300 Cluster (RayTracing)

Parallel Ray Tracing on JESSICA2(Running at 64-node Gideon 300 cluster) Linux 2.4.18-3 kernel (Redhat 7.3) 64 nodes: 108 seconds 1 node: 3430 seconds (~ 1 hour) Speedup = 4402/108=40.75

Conclusions • Transparent Java thread migration in JIT compiler enable the high-performance execution of multithreaded Java application on clusters while keeping the merits of Java • JVM approach => dynamic class loading • Just-in-Time compilation for speed • An embedded GOS layer can take advantage of the JVM runtime information to reduce communication overhead

Thanks • HKU SRG: • http://www.srg.csis.hku.hk/ • JESSICA2 Webpage: • http://www.csis.hku.hk/~clwang/projects/JESSICA2.html

JESSICA2: A Distributed Java Virtual Machine with Transparent Thread Migration Support

JESSICA2: A Distributed Java Virtual Machine with Transparent Thread Migration Support

Presentation Transcript

Java-based Technologies to Support Parallel and Distributed Applications

Windows Kernel Internals Process Architecture

Programming with Java

PVM : Parallel Virtual Machine “ The poor man ’ s super-computer ”

4. Processes and Processors in Distributed Systems

Distributed System Design: An Overview*

Chapter 23

Parallel and Distributed Algorithms

RMI Observing the Distributed Pattern .

Concurrent Programming

Rappels: Java et internet

Chapter 22: Distributed Databases

Support Vector Machine 支持向量機

Processes

Outline

CORBA, DCOM and Java

Concurrent Programming

Java and Java Computing

Support Vector Machine

Java 2D Graphics