580 likes | 714 Views
On the Conformance of a Cluster Implementation of the Java Memory Model. Philip J. Hatcher University of New Hampshire Philip.Hatcher@unh.edu. Collaborators. ENS-Lyon Gabriel Antoniu, Luc Bougé, Raymond Namyst. Cluster Environment. collection of machines connected by network
E N D
On the Conformance of a Cluster Implementation of the Java Memory Model Philip J. Hatcher University of New Hampshire Philip.Hatcher@unh.edu
Collaborators • ENS-Lyon • Gabriel Antoniu, Luc Bougé, Raymond Namyst
Cluster Environment • collection of machines connected by network • commercial distributed-memory parallel computer • “build-your-own” parallel computer
Cluster Implementation of Java • Single JVM running on cluster. • Nodes of cluster are transparent. • Multithreaded applications exploit multiple processors of cluster.
Examples • Java/DSM (Rice Univ. - Houston) • transparent heterogeneous computing • cJVM (IBM Research - Haifa) • scalable Java server • Jackal (Vrije Univ. – Amsterdam) • high-performance computing
Hyperion • Cluster implementation of Java developed at the Univ. of New Hampshire. • Currently built on top of the PM2 distributed, multithreaded runtime environment from ENS-Lyon.
Motivation • Use Java “as is” for high-performance computing • support computationally intensive applications • utilize parallel computing hardware
Why Java? • Explicitly parallel! • includes a threaded programming model • Relaxed memory model • consistency model aids an implementation on distributed-memory parallel computers
Java Threads • Threads are objects. • The class java/lang/Thread contains all of the methods for initializing, running, suspending, querying and destroying threads.
java/lang/Thread methods • Thread() - constructor for thread object. • start() - start the thread executing. • run() - method invoked by ‘start’. • stop(), suspend(), resume(), join(), yield(). • setPriority().
Java Synchronization • Java uses monitors, which protect a region of code by allowing only one thread at a time to execute it. • Monitors utilize locks. • There is a lock associated with each object.
synchronized keyword • synchronized ( Exp ) Block • public class Q { synchronized void put(…) { … }}
java/lang/Object methods • wait() - the calling thread, which must hold the lock for the object, is placed in a wait set associated with the object. The lock is then released. • notify() - an arbitrary thread in the wait set of this object is awakened and then competes again to get lock for object. • notifyall() - all waiting threads awakened.
Shared-Memory Model • Java threads execute in a virtual shared memory. • All threads are able to access all objects. • But threads may not access each other’s stacks.
Java Memory Consistency • A variant of release consistency. • Threads can keep locally cached copies of objects. • Consistency is provided by requiring that: • a thread's object cache be flushed upon entry to a monitor. • local modifications made to cached objects be transmitted to the central memory when a thread exits a monitor.
prog prog.class prog.[ch] prog.java javac java2c gcc -06 (bytecode) Sun's Java compiler Instruction-wise translation libs General Hyperion Overview Runtime libraries
The Hyperion Run-Time System • Collection of modules to allow “plug-and-play” implementations: • inter-node communication • threads • memory and synchronization • etc
Thread and Object Allocation • Currently, threads are allocated to processors in round-robin fashion. • Currently, an object is allocated to the processor that holds the thread that is creating the object. • Currently, DSM-PM2 is used to implement the Java memory model.
Load balancer Native Java API Thread subsystem Memory subsystem Comm. subsystem PM2 API: pm2_rpc, pm2_thread_create, etc. PM2 DSM subsystem Thread subsystem Comm. Subsystem Hyperion Internal Structure
Thread library: Marcel User-level Supports SMP POSIX-like Preemptive thread migration Communication library: Madeleine Portable: BIP, SISCI/SCI, MPI, TCP, PVM Efficient PM2: A Distributed, Multithreaded Runtime Environment
DSM Protocol Policy DSM Protocol lib DSM Page Manager DSM Comm Madeleine Comms Marcel Threads DSM-PM2: Architecture DSM-PM2 • DSM comm: • send page request • send page • send invalidate request • … • DSM page manager: • set/get page owner • set/get page access • add/remove to/from copyset • ... PM2
Hyperion’s DSM API • loadIntoCache • invalidateCache • updateMainMemory • get • put
DSM Implementation • Node-level caches. • Page-based and home-based protocol. • Use page faults to detect remote objects. • Log modifications made to remote objects. • Each node allocates objects from a different range of the virtual address space.
Using Page Faults: details • An object reference is the address of the base of the object. • loadIntoCache does nothing. • DSM-PM2 is used to handle page faults generated by the get/put primitives.
More details • When an object is allocated, its address is appended to a list attached to the page that contains its header. • When a page is loaded on a remote node, the list is used to turn on the header bit for all object headers on the page. • The put primitive checks the header bit to see if a modification should be logged. • updateMainMemory sends the logged changes to the home node.
Benchmarking • Two Linux 2.2 clusters: • twelve 200 MHz Pentium Pro processors connected by Myrinet switch and using BIP. • six 450 MHz Pentium II processors connected by a SCI network and using SISCI. • gcc 2.7.2.3 with -O6
Correctness • Does the Hyperion approach fully support the Java Memory Model? • Hyperion follows the operational specification of the JMM with two exceptions: • page granularity • node-level caches
Java Memory Model: the actors Threads lock unlock load store use assign Main Memory lock unlock read write
Entry to Monitor lock load use lock read lock load use read lock
Exit from Monitor assign store unlock write unlock assign store unlock unlock write
Serializability • Main memory actions are serializable for a single variable or lock.
Page Granularity • Hyperion fetches remote memory locations with page granularity. • always OK to perform load/read actions “early”.
Node-level Caches • All threads on a node share a cache. • Cache contains values being accessed whose homes are on other nodes. • Values whose homes are on this node are directly accessed. • as if every use is immediately preceded by a load/read and every assign is immediately followed by a store/write.
Node-level Caches • If one thread invalidates the cache, the cache is invalidated for all threads. • always OK to do “extra” load/read actions. • If one thread updates main memory, all threads do. • always OK to do “early” store/write actions.
Node-level Caches • If one thread performs a load/read, then all threads see the result. • as if all threads perform the load/read. • If one thread performs an assign, then other threads see the result before the subsequent store/write. • hmmm...
Implementation Traces • load/read across cluster implemented by request-send message pair. • store/write across cluster implemented by transmit-receive message pair.
An Example Home(x) send x (value 7) to Node 0 send x (7) to Node 1 receive x (17) from Node 0 receive x (19) from Node 1 Node 0 request x T0: use x (7) T1: assign x = 17 T2: use x (17) transmit x (17) Node 1 request x T3: use x (7) T4: assign x = 19 T5: use x (19) transmit x (19)
Serialization Main Memory read x (value 7) for T0 read x (7) for T3 write x (17) for T1 read x (17) for T2 write x (19) for T4 read x (19) for T5
Algorithm • The model-satisfying serialization can always be constructed from the node-level traces of the Hyperion actions. • Merge node traces, controlled by the pairing of request-send and transmit-receive actions. • Correct, if memory actions at the home node are serializable.
Therefore: • Node-level caches conform to the JMM. • Simplify implementation. • Reduce memory consumption. • Facilitate pre-fetching. • However, “invalidate one, invalidate all”.
Implementation Difficulties • Node-level concurrency makes the implementation tricky. (duh!) • concurrent page fetches • cache invalidate while page fetch pending? • thread reading page as it is being installed
Java API Additions? • These would be desirable: • barrier synchronizations • data reductions • query cluster configuration • i.e. number of nodes • Is this cheating? • no longer “as is” Java?
API implementation • Careful implementation of API extensions can lessen potential cost of “invalidate one, invalidate all”. • Implementation of barrier should only invalidate local cache once all threads have reached barrier.
Level of Transparency • Consider the current Hyperion thread/object allocation strategies: • not mandated by Java Language Spec • might be superceded by smarter run-time lib • but, still good guidelines for programmer? • i.e. if I didn’t create the object, it might be expensive to access. • not unreasonable to expect user to be aware that there might be an extended memory hierarchy.
Final Thoughts • Java is extremely attractive vehicle for programming parallel computers. • Can acceptable performance be obtained? • Scalar execution • DSM implementation • Application locality • Scalability