On the Conformance of a Cluster Implementation of the Java Memory Model

On the Conformance of a Cluster Implementation of the Java Memory Model Philip J. Hatcher University of New Hampshire Philip.Hatcher@unh.edu

Collaborators • ENS-Lyon • Gabriel Antoniu, Luc Bougé, Raymond Namyst

Cluster Environment • collection of machines connected by network • commercial distributed-memory parallel computer • “build-your-own” parallel computer

Cluster Implementation of Java • Single JVM running on cluster. • Nodes of cluster are transparent. • Multithreaded applications exploit multiple processors of cluster.

Examples • Java/DSM (Rice Univ. - Houston) • transparent heterogeneous computing • cJVM (IBM Research - Haifa) • scalable Java server • Jackal (Vrije Univ. – Amsterdam) • high-performance computing

Hyperion • Cluster implementation of Java developed at the Univ. of New Hampshire. • Currently built on top of the PM2 distributed, multithreaded runtime environment from ENS-Lyon.

Motivation • Use Java “as is” for high-performance computing • support computationally intensive applications • utilize parallel computing hardware

Why Java? • Explicitly parallel! • includes a threaded programming model • Relaxed memory model • consistency model aids an implementation on distributed-memory parallel computers

Java Threads • Threads are objects. • The class java/lang/Thread contains all of the methods for initializing, running, suspending, querying and destroying threads.

java/lang/Thread methods • Thread() - constructor for thread object. • start() - start the thread executing. • run() - method invoked by ‘start’. • stop(), suspend(), resume(), join(), yield(). • setPriority().

Java Synchronization • Java uses monitors, which protect a region of code by allowing only one thread at a time to execute it. • Monitors utilize locks. • There is a lock associated with each object.

synchronized keyword • synchronized ( Exp ) Block • public class Q { synchronized void put(…) { … }}

java/lang/Object methods • wait() - the calling thread, which must hold the lock for the object, is placed in a wait set associated with the object. The lock is then released. • notify() - an arbitrary thread in the wait set of this object is awakened and then competes again to get lock for object. • notifyall() - all waiting threads awakened.

Shared-Memory Model • Java threads execute in a virtual shared memory. • All threads are able to access all objects. • But threads may not access each other’s stacks.

Java Memory Consistency • A variant of release consistency. • Threads can keep locally cached copies of objects. • Consistency is provided by requiring that: • a thread's object cache be flushed upon entry to a monitor. • local modifications made to cached objects be transmitted to the central memory when a thread exits a monitor.

prog prog.class prog.[ch] prog.java javac java2c gcc -06 (bytecode) Sun's Java compiler Instruction-wise translation libs General Hyperion Overview Runtime libraries

The Hyperion Run-Time System • Collection of modules to allow “plug-and-play” implementations: • inter-node communication • threads • memory and synchronization • etc

Thread and Object Allocation • Currently, threads are allocated to processors in round-robin fashion. • Currently, an object is allocated to the processor that holds the thread that is creating the object. • Currently, DSM-PM2 is used to implement the Java memory model.

Load balancer Native Java API Thread subsystem Memory subsystem Comm. subsystem PM2 API: pm2_rpc, pm2_thread_create, etc. PM2 DSM subsystem Thread subsystem Comm. Subsystem Hyperion Internal Structure

Thread library: Marcel User-level Supports SMP POSIX-like Preemptive thread migration Communication library: Madeleine Portable: BIP, SISCI/SCI, MPI, TCP, PVM Efficient PM2: A Distributed, Multithreaded Runtime Environment

DSM Protocol Policy DSM Protocol lib DSM Page Manager DSM Comm Madeleine Comms Marcel Threads DSM-PM2: Architecture DSM-PM2 • DSM comm: • send page request • send page • send invalidate request • … • DSM page manager: • set/get page owner • set/get page access • add/remove to/from copyset • ... PM2

Hyperion’s DSM API • loadIntoCache • invalidateCache • updateMainMemory • get • put

DSM Implementation • Node-level caches. • Page-based and home-based protocol. • Use page faults to detect remote objects. • Log modifications made to remote objects. • Each node allocates objects from a different range of the virtual address space.

Using Page Faults: details • An object reference is the address of the base of the object. • loadIntoCache does nothing. • DSM-PM2 is used to handle page faults generated by the get/put primitives.

More details • When an object is allocated, its address is appended to a list attached to the page that contains its header. • When a page is loaded on a remote node, the list is used to turn on the header bit for all object headers on the page. • The put primitive checks the header bit to see if a modification should be logged. • updateMainMemory sends the logged changes to the home node.

Benchmarking • Two Linux 2.2 clusters: • twelve 200 MHz Pentium Pro processors connected by Myrinet switch and using BIP. • six 450 MHz Pentium II processors connected by a SCI network and using SISCI. • gcc 2.7.2.3 with -O6

Pi (50M intervals)

Jacobi (1024x1024)

Traveling Salesperson (17 cities)

All-pairs Shortest Path (2K nodes)

Barnes-Hut (16K bodies)

Correctness • Does the Hyperion approach fully support the Java Memory Model? • Hyperion follows the operational specification of the JMM with two exceptions: • page granularity • node-level caches

Java Memory Model: the actors Threads lock unlock load store use assign Main Memory lock unlock read write

Entry to Monitor lock load use lock read lock load use read lock

Exit from Monitor assign store unlock write unlock assign store unlock unlock write

Serializability • Main memory actions are serializable for a single variable or lock.

Page Granularity • Hyperion fetches remote memory locations with page granularity. • always OK to perform load/read actions “early”.

Node-level Caches • All threads on a node share a cache. • Cache contains values being accessed whose homes are on other nodes. • Values whose homes are on this node are directly accessed. • as if every use is immediately preceded by a load/read and every assign is immediately followed by a store/write.

Node-level Caches • If one thread invalidates the cache, the cache is invalidated for all threads. • always OK to do “extra” load/read actions. • If one thread updates main memory, all threads do. • always OK to do “early” store/write actions.

Node-level Caches • If one thread performs a load/read, then all threads see the result. • as if all threads perform the load/read. • If one thread performs an assign, then other threads see the result before the subsequent store/write. • hmmm...

Implementation Traces • load/read across cluster implemented by request-send message pair. • store/write across cluster implemented by transmit-receive message pair.

An Example Home(x) send x (value 7) to Node 0 send x (7) to Node 1 receive x (17) from Node 0 receive x (19) from Node 1 Node 0 request x T0: use x (7) T1: assign x = 17 T2: use x (17) transmit x (17) Node 1 request x T3: use x (7) T4: assign x = 19 T5: use x (19) transmit x (19)

Serialization Main Memory read x (value 7) for T0 read x (7) for T3 write x (17) for T1 read x (17) for T2 write x (19) for T4 read x (19) for T5

Algorithm • The model-satisfying serialization can always be constructed from the node-level traces of the Hyperion actions. • Merge node traces, controlled by the pairing of request-send and transmit-receive actions. • Correct, if memory actions at the home node are serializable.

Therefore: • Node-level caches conform to the JMM. • Simplify implementation. • Reduce memory consumption. • Facilitate pre-fetching. • However, “invalidate one, invalidate all”.

Implementation Difficulties • Node-level concurrency makes the implementation tricky. (duh!) • concurrent page fetches • cache invalidate while page fetch pending? • thread reading page as it is being installed

Java API Additions? • These would be desirable: • barrier synchronizations • data reductions • query cluster configuration • i.e. number of nodes • Is this cheating? • no longer “as is” Java?

API implementation • Careful implementation of API extensions can lessen potential cost of “invalidate one, invalidate all”. • Implementation of barrier should only invalidate local cache once all threads have reached barrier.

Level of Transparency • Consider the current Hyperion thread/object allocation strategies: • not mandated by Java Language Spec • might be superceded by smarter run-time lib • but, still good guidelines for programmer? • i.e. if I didn’t create the object, it might be expensive to access. • not unreasonable to expect user to be aware that there might be an extended memory hierarchy.

Final Thoughts • Java is extremely attractive vehicle for programming parallel computers. • Can acceptable performance be obtained? • Scalar execution • DSM implementation • Application locality • Scalability

On the Conformance of a Cluster Implementation of the Java Memory Model

On the Conformance of a Cluster Implementation of the Java Memory Model

Presentation Transcript

A Simplified Model of Memory

A Simplified Model of Memory

Model of Memory

Implementation of a Static Condor Cluster

Java Thread and Memory Model

Traditional Economic Model of Quality of Conformance

The Multisystem Model of Memory

The 3 box model of memory

Model of Memory

Memory: A recap of the multi-store model

The Working Model of Memory

Model of Memory

The Java Memory Model

Evaluation of the Working Memory Model

The Java Memory Model and Simulator

A Pure Java Implementation of Soar

Implementation of the ASSURE Model

A MODEL OF WORKING MEMORY

Analyzing the CRF Java Memory Model

The Working Model of Memory

Implementation of the Relational Model

Implementation of the Relational Model