490 likes | 627 Views
Cluster Computing with Java Threads. Philip J. Hatcher University of New Hampshire Philip.Hatcher@unh.edu. Collaborators. UNH/Hyperion Mark MacBeth and Keith McGuigan ENS-Lyon/DSM-PM2 Gabriel Antoniu, Luc Bougé and Raymond Namyst. Focus. Use Java “as is” for high-performance computing
E N D
Cluster Computing with Java Threads Philip J. Hatcher University of New Hampshire Philip.Hatcher@unh.edu
Collaborators • UNH/Hyperion • Mark MacBeth and Keith McGuigan • ENS-Lyon/DSM-PM2 • Gabriel Antoniu, Luc Bougé and Raymond Namyst
Focus • Use Java “as is” for high-performance computing • support computationally intensive applications • utilize parallel computing hardware
Outline • Our Vision • Java Threads • The PM2 Run-time Environment • Hyperion: Java Threads on Clusters • Evaluation • Related Work • Conclusions
Why Java? • Soon to be ubiquitous! • use of Java is growing very rapidly • Designed for portability: • develop programs on your desktop • run programs on a distant cluster
Why Java? • Explicitly parallel! • includes a threaded programming model • Relaxed memory model • consistency model aids an implementation on distributed-memory parallel computers
Unique Opportunity • Use Java to bring parallelism to the “masses” • Let’s not miss it! • But, programmers will not accept syntax or model changes
Open Question • Parallelism via Java access to distributed-computing techniques? • e.g. RMI (remote method invocation) • Or, parallelism via Java threads?
That is, ... • Does a user prefer to view a cluster as a collection of distinct machines? • Or, does a user prefer to view a cluster as a “black box” that will simply run Java code faster?
Climb out of the box! • Use Java threads “as is” to program clusters of computers. • Program for the threaded Java virtual machine. • Allow the implementation to handle the details of executing in a cluster.
Java Threads • Threads are objects. • The class java/lang/Thread contains all of the methods for initializing, running, suspending, querying and destroying threads.
java/lang/Thread methods • Thread() - constructor for thread object. • start() - start the thread executing. • run() - method invoked by ‘start’. • stop(), suspend(), resume(), join(), yield(). • setPriority().
Java Synchronization • Java uses monitors, which protect a region of code by allowing only one thread at a time to execute it. • Monitors utilize locks. • There is a lock associated with each object.
synchronized keyword • synchronized ( Exp ) Block • public class Q { synchronized void put(…) { … }}
java/lang/Object methods • wait() - the calling thread, which must hold the lock for the object, is placed in a wait set associated with the object. The lock is then released. • notify() - an arbitrary thread in the wait set of this object is awakened and then competes again to get lock for object. • notifyall() - all waiting threads awakened.
Shared-Memory Model • Java threads execute in a virtual shared memory. • All threads are able to access all objects. • But threads may not access each other’s stacks.
Java Memory Consistency • A variant of release consistency. • Threads can keep locally cached copies of objects. • Consistency is provided by requiring that: • a thread's object cache be flushed upon entry to a monitor. • local modifications made to cached objects be transmitted to the central memory when a thread exits a monitor.
Thread library: Marcel User-level Supports SMP POSIX-like Preemptive thread migration Communication library: Madeleine Portable: BIP, SISCI/SCI, MPI, TCP, PVM Efficient PM2: A Distributed, Multithreaded Runtime Environment
DSM Protocol Policy DSM Protocol lib DSM Page Manager DSM Comm Madeleine Comms Marcel Threads DSM-PM2: Architecture DSM-PM2 • DSM comm: • send page request • send page • send invalidate request • … • DSM page manager: • set/get page owner • set/get page access • add/remove to/from copyset • ... PM2
DSM-PM2: Performance SCI cluster has 450 MHz Pentium II nodes Myrinet cluster has 200 MHz Pentium Pro nodes
Hyperion • Executes threaded Java programs on clusters. • Built on top of PM2 and DSM-PM2. • Provides both portability and efficiency
Reversing the Bytecode Stream • Conventionally, users “pull” bytecode to their machines for local execution. • Our vision: • users develop their high-performance Java programs using the Java toolset on their desktop. • they then “push” the resulting bytecode to a Hyperion server for high-performance cycles.
Supporting High Performance • Utilizes a bytecode-to-C translator. • Parallel execution via spreading of Java threads across nodes of the cluster. • Java threads implemented as lightweight threads using PM2 library.
Compiling Java • Hyperion designed for computationally intensive applications, so small overhead of translating bytecode is not important. • Translating to C allows us to leverage the native C compiler and optimizer.
prog prog.class prog.[ch] prog.java javac java2c gcc -06 (bytecode) Sun's Java compiler Instruction-wise translation libs General Hyperion Overview Runtime libraries
The Hyperion Run-Time System • Collection of modules to allow “plug-and-play” implementations: • inter-node communication • threads • memory and synchronization • etc
Load balancer Native Java API Thread subsystem Memory subsystem Comm. subsystem PM2 API: pm2_rpc, pm2_thread_create, etc. PM2 DSM subsystem Thread subsystem Comm. Subsystem Hyperion Internal Structure
Thread and Object Allocation • Currently, threads are allocated to processors in round-robin fashion. • Currently, an object is allocated to the processor that holds the thread that is creating the object. • Currently, DSM-PM2 is used to implement the Java memory model.
Hyperion’s DSM API • loadIntoCache • invalidateCache • updateMainMemory • get • put
DSM Implementation • Node-level caches. • Page-based and home-based protocol. • Log mods made to remote objects. • Use explicit in-line checks in get/put. • Each node allocates objects from a different range of the virtual address space.
Details • Objects are aligned on 64-byte boundaries. • An object reference is the address of the base of the object. • The bottom 6 bits of the ref can be used to store the node number of the object’s home.
More details • loadIntoCache checks the 6 bits to see if an object is remote. • If so, and if not already locally cached, DSM-PM2 is used to load the page(s) containing the object. • When a remote object is cached, a bit is turned on in its header.
Yet more details • The put primitive checks the header bit to see if a modification should be logged. • updateMainMemory sends the logged changes to the home node.
Evaluation • Minimal-cost map-coloring application. • Branch-and-bound algorithm. • 64 threads, each with its own priority queue. • Current best solution is shared. • Problem size: 29 eastern-most states of USA with 4 colors of differing costs.
Experimental Setting • Two Linux 2.2 clusters: • eight 200 MHz Pentium Pro processors connected by Myrinet switch and using MPI over BIP. • four 450 MHz Pentium II processors connected by a SCI network and using SISCI. • gcc 2.7.2.3 with -O6
Baseline Performance • Compared serial Java to serial C for map-coloring application. • Each program has single queue, single thread.
Serial Java versus Serial C Java v2: DSM checks disabled Java v3: DSM and array-bound checks disabled Executing on a single 450 MHz Pentium II
Inline checks are expensive! • Genericity of DSM-PM2 allows an alternative implementation. • Use page-fault detection rather than inline check to detect non-local object.
Using Page Faults: details • An object reference is the address of the base of the object. • loadIntoCache does nothing. • DSM-PM2 is used to handle page faults generated by the get/put primitives.
More details • When an object is allocated, its address is appended to a list attached to the page that contains its header. • When a page is loaded on a remote node, the list is used to turn on the header bit for all object headers on the page. • The put primitive uses the header bit in the same manner as inline-check version.
Inline Check versus Page Fault • IC has higher overhead for accessing objects (either local or locally cached). • PF has higher overhead (signal handling and memory protection) for loading a page into the cache.
IC versus PF: serial map-coloring Java XX v2: DSM checks disabled Java XX v3: DSM and array-bound checks disabled Executing on a single 450 MHz Pentium II
IC versus PF: parallel map-coloring Executing on 450MHz/SCI cluster.
Related Work • Java/MPI: cluster nodes are explicit • Java/RMI: ditto • Remote objects via RMI: nearly transparent • e.g. JavaParty, Do! • Distributed interpreters • e.g. Java/DSM, MultiJav, cJVM
Conclusions • Approach is clean: Java “as is” • Approach is promising • good parallelizability for map-coloring • need better scalar compilation • e.g. array bound-check removal • need further parallel application studies • are thread/object placement heuristics sufficient for programmers to write efficient programs?