1 / 58

On the Conformance of a Cluster Implementation of the Java Memory Model

On the Conformance of a Cluster Implementation of the Java Memory Model. Philip J. Hatcher University of New Hampshire Philip.Hatcher@unh.edu. Collaborators. ENS-Lyon Gabriel Antoniu, Luc Bougé, Raymond Namyst. Cluster Environment. collection of machines connected by network

lorna
Download Presentation

On the Conformance of a Cluster Implementation of the Java Memory Model

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. On the Conformance of a Cluster Implementation of the Java Memory Model Philip J. Hatcher University of New Hampshire Philip.Hatcher@unh.edu

  2. Collaborators • ENS-Lyon • Gabriel Antoniu, Luc Bougé, Raymond Namyst

  3. Cluster Environment • collection of machines connected by network • commercial distributed-memory parallel computer • “build-your-own” parallel computer

  4. Cluster Implementation of Java • Single JVM running on cluster. • Nodes of cluster are transparent. • Multithreaded applications exploit multiple processors of cluster.

  5. Examples • Java/DSM (Rice Univ. - Houston) • transparent heterogeneous computing • cJVM (IBM Research - Haifa) • scalable Java server • Jackal (Vrije Univ. – Amsterdam) • high-performance computing

  6. Hyperion • Cluster implementation of Java developed at the Univ. of New Hampshire. • Currently built on top of the PM2 distributed, multithreaded runtime environment from ENS-Lyon.

  7. Motivation • Use Java “as is” for high-performance computing • support computationally intensive applications • utilize parallel computing hardware

  8. Why Java? • Explicitly parallel! • includes a threaded programming model • Relaxed memory model • consistency model aids an implementation on distributed-memory parallel computers

  9. Java Threads • Threads are objects. • The class java/lang/Thread contains all of the methods for initializing, running, suspending, querying and destroying threads.

  10. java/lang/Thread methods • Thread() - constructor for thread object. • start() - start the thread executing. • run() - method invoked by ‘start’. • stop(), suspend(), resume(), join(), yield(). • setPriority().

  11. Java Synchronization • Java uses monitors, which protect a region of code by allowing only one thread at a time to execute it. • Monitors utilize locks. • There is a lock associated with each object.

  12. synchronized keyword • synchronized ( Exp ) Block • public class Q { synchronized void put(…) { … }}

  13. java/lang/Object methods • wait() - the calling thread, which must hold the lock for the object, is placed in a wait set associated with the object. The lock is then released. • notify() - an arbitrary thread in the wait set of this object is awakened and then competes again to get lock for object. • notifyall() - all waiting threads awakened.

  14. Shared-Memory Model • Java threads execute in a virtual shared memory. • All threads are able to access all objects. • But threads may not access each other’s stacks.

  15. Java Memory Consistency • A variant of release consistency. • Threads can keep locally cached copies of objects. • Consistency is provided by requiring that: • a thread's object cache be flushed upon entry to a monitor. • local modifications made to cached objects be transmitted to the central memory when a thread exits a monitor.

  16. prog prog.class prog.[ch] prog.java javac java2c gcc -06 (bytecode) Sun's Java compiler Instruction-wise translation libs General Hyperion Overview Runtime libraries

  17. The Hyperion Run-Time System • Collection of modules to allow “plug-and-play” implementations: • inter-node communication • threads • memory and synchronization • etc

  18. Thread and Object Allocation • Currently, threads are allocated to processors in round-robin fashion. • Currently, an object is allocated to the processor that holds the thread that is creating the object. • Currently, DSM-PM2 is used to implement the Java memory model.

  19. Load balancer Native Java API Thread subsystem Memory subsystem Comm. subsystem PM2 API: pm2_rpc, pm2_thread_create, etc. PM2 DSM subsystem Thread subsystem Comm. Subsystem Hyperion Internal Structure

  20. Thread library: Marcel User-level Supports SMP POSIX-like Preemptive thread migration Communication library: Madeleine Portable: BIP, SISCI/SCI, MPI, TCP, PVM Efficient PM2: A Distributed, Multithreaded Runtime Environment

  21. DSM Protocol Policy DSM Protocol lib DSM Page Manager DSM Comm Madeleine Comms Marcel Threads DSM-PM2: Architecture DSM-PM2 • DSM comm: • send page request • send page • send invalidate request • … • DSM page manager: • set/get page owner • set/get page access • add/remove to/from copyset • ... PM2

  22. Hyperion’s DSM API • loadIntoCache • invalidateCache • updateMainMemory • get • put

  23. DSM Implementation • Node-level caches. • Page-based and home-based protocol. • Use page faults to detect remote objects. • Log modifications made to remote objects. • Each node allocates objects from a different range of the virtual address space.

  24. Using Page Faults: details • An object reference is the address of the base of the object. • loadIntoCache does nothing. • DSM-PM2 is used to handle page faults generated by the get/put primitives.

  25. More details • When an object is allocated, its address is appended to a list attached to the page that contains its header. • When a page is loaded on a remote node, the list is used to turn on the header bit for all object headers on the page. • The put primitive checks the header bit to see if a modification should be logged. • updateMainMemory sends the logged changes to the home node.

  26. Benchmarking • Two Linux 2.2 clusters: • twelve 200 MHz Pentium Pro processors connected by Myrinet switch and using BIP. • six 450 MHz Pentium II processors connected by a SCI network and using SISCI. • gcc 2.7.2.3 with -O6

  27. Pi (50M intervals)

  28. Jacobi (1024x1024)

  29. Traveling Salesperson (17 cities)

  30. All-pairs Shortest Path (2K nodes)

  31. Barnes-Hut (16K bodies)

  32. Correctness • Does the Hyperion approach fully support the Java Memory Model? • Hyperion follows the operational specification of the JMM with two exceptions: • page granularity • node-level caches

  33. Java Memory Model: the actors Threads lock unlock load store use assign Main Memory lock unlock read write

  34. Entry to Monitor lock load use lock read lock load use read lock

  35. Exit from Monitor assign store unlock write unlock assign store unlock unlock write

  36. Serializability • Main memory actions are serializable for a single variable or lock.

  37. Page Granularity • Hyperion fetches remote memory locations with page granularity. • always OK to perform load/read actions “early”.

  38. Node-level Caches • All threads on a node share a cache. • Cache contains values being accessed whose homes are on other nodes. • Values whose homes are on this node are directly accessed. • as if every use is immediately preceded by a load/read and every assign is immediately followed by a store/write.

  39. Node-level Caches • If one thread invalidates the cache, the cache is invalidated for all threads. • always OK to do “extra” load/read actions. • If one thread updates main memory, all threads do. • always OK to do “early” store/write actions.

  40. Node-level Caches • If one thread performs a load/read, then all threads see the result. • as if all threads perform the load/read. • If one thread performs an assign, then other threads see the result before the subsequent store/write. • hmmm...

  41. Implementation Traces • load/read across cluster implemented by request-send message pair. • store/write across cluster implemented by transmit-receive message pair.

  42. An Example Home(x) send x (value 7) to Node 0 send x (7) to Node 1 receive x (17) from Node 0 receive x (19) from Node 1 Node 0 request x T0: use x (7) T1: assign x = 17 T2: use x (17) transmit x (17) Node 1 request x T3: use x (7) T4: assign x = 19 T5: use x (19) transmit x (19)

  43. Serialization Main Memory read x (value 7) for T0 read x (7) for T3 write x (17) for T1 read x (17) for T2 write x (19) for T4 read x (19) for T5

  44. Algorithm • The model-satisfying serialization can always be constructed from the node-level traces of the Hyperion actions. • Merge node traces, controlled by the pairing of request-send and transmit-receive actions. • Correct, if memory actions at the home node are serializable.

  45. Therefore: • Node-level caches conform to the JMM. • Simplify implementation. • Reduce memory consumption. • Facilitate pre-fetching. • However, “invalidate one, invalidate all”.

  46. Implementation Difficulties • Node-level concurrency makes the implementation tricky. (duh!) • concurrent page fetches • cache invalidate while page fetch pending? • thread reading page as it is being installed

  47. Java API Additions? • These would be desirable: • barrier synchronizations • data reductions • query cluster configuration • i.e. number of nodes • Is this cheating? • no longer “as is” Java?

  48. API implementation • Careful implementation of API extensions can lessen potential cost of “invalidate one, invalidate all”. • Implementation of barrier should only invalidate local cache once all threads have reached barrier.

  49. Level of Transparency • Consider the current Hyperion thread/object allocation strategies: • not mandated by Java Language Spec • might be superceded by smarter run-time lib • but, still good guidelines for programmer? • i.e. if I didn’t create the object, it might be expensive to access. • not unreasonable to expect user to be aware that there might be an extended memory hierarchy.

  50. Final Thoughts • Java is extremely attractive vehicle for programming parallel computers. • Can acceptable performance be obtained? • Scalar execution • DSM implementation • Application locality • Scalability

More Related