210 likes | 354 Views
Bridging the Gap Between Distributed Shared Memory and Message Passing. Holger Karl Department of Computer Science Courant Institute of Mathematical Sciences New York University Partially supported by DARPA/Rome laboratory and Deutsche Forschungsgemeinschaft. Overview. Motivation
E N D
Bridging the Gap Between Distributed Shared Memory and Message Passing Holger Karl Department of Computer Science Courant Institute of Mathematical Sciences New York University Partially supported by DARPA/Rome laboratory and Deutsche Forschungsgemeinschaft
Overview • Motivation • Related approaches • Charlotte • Annotating Charlotte • Some experiments • Summary
Motivation • Scenario • Distributed computing using the World Wide Web • Use Java to overcome heterogeneity, security concerns and program installation problems • Idea: volunteer computing • Applets can participate in distributed applications • Challenges • Only make use of standard Web browsers/Java virtual machines • Consider the Web’s and Java’s idiosyncrasies • Inherent faultiness, machines come and go • Host-of-origin imposes a star-like communication topology • Provide a simple programming interface, e.g., DSM • No hardware support (e.g., page protection) available • Do not forget efficiency
Charlotte - Programs public class Matrix extends Droutine { // define routine public void drun (int n, int Id){ // do computation } … public void run () { … parBegin(); addRoutine (this, Size); parEnd(); … } } • Alternating sequential and parallel steps • Sequential steps executed by a manager application • Parallel steps executed by worker applets • Routines are defined in parallel steps • Routines are methods of objects, derived from a Charlotte class Droutine
Charlotte - Memory Semantics • Memory partitioned into private and shared parts • Shared memory with CRCW-common semantics • Implemented at object level • Charlotte provides classes for every primitive type: int Dint, float Dfloat, etc • Manager has master copy of memory • Shared objects have get() and set() methods • get() brings invalid data from manager if necessary, amortized by bringing “pages” of shared data from the manager • set() marks data as modified to be flushed to manager at end of parallel step • Updates from different routines incorporated atomically at end of parallel step • Allows eager scheduling for fault tolerance • Routines can be executed multiple times without violating exactly once semantics
Charlotte - Matrix Multiplication public class Matrix extends Droutine { // define routine public void drun (int n, int i){ for (int j=0; j<Size; j++){ sum = 0; for (int k=0; k<Size; k++) sum += A[i][k].get() * B[k][j].get(); C[i][j].set(sum); } … }
Charlotte - Pros and Cons • Pros • Simple programming model • Well-defined DSM semantics • Adaptive parallelism • Fault tolerant with respect to worker crashes • Cons • Efficiency problems • Shared data is accessed via method invocations • Loading data from manager incurs latency • Choosing good page size is difficult • Sending modified data back to manager requires inspection of status flags • Reason: runtime system does not know which data is read or modified by routines • Make this information explicit in Charlotte programs!
Annotations public class Matrix extends Droutine { public void drun (int n, int i) {…} public Locations dloc_read (int n, int i) { Locations loc = new Locations(); loc.add (B); loc.add (A[i]); return loc; } … } • Specify read and write locations for routines • Use methods of class Droutine • dloc_read() • dloc_write() • First step: Correctness-preserving annotations • Only send read data at beginning of routine
Relying on Annotations • Second step: Rely on annotations • Data transfer between manager and worker happens only according to annotations • No if-statement in get() necessary • No status update in set() necessary • No latencies for requesting data updates • But, wrong annotations lead to wrong results • Use Uint (unchecked) instead of Dint, etc. • Identical interface • Remaining overhead: method invocation for data access • Charlotte’s distributed types (Dint) and unchecked distributed types (Uint) can be freely mixed • Simple to switch back and forth
Sharing Primitive Types • Third step: Use primitive types (int) instead of objects (Uint) • Possible since get() and set() are trivial for Uint • Possible since annotations describe data movement completely • Avoids method invocation overhead • But: different interface • No get()/set() syntactic changes • Call-by-reference / call-by-value • Particularly suited to arrays of distributed types • Freely mix Dint, Uint and shared int
Additional Optimizations • Manager keeps track of workers’ valid data sets • Use routines’ read sets to select routine to give to worker • Choose routine with minimal amount of missing data • Example: 3 routines, 2 workers • Data can be cached at workers between parallel steps • Requires data to be declared unmodified at the beginning of parallel step Worker 1 valid: 1-100 read_set: 160-210 20-80 90-150 Worker 2 valid: 150-200
Discussion • Gradual incorporation of semantic knowledge into Charlotte programs • Large flexibility to mix different levels of annotation semantics in one program • Easy to switch back and forth among these levels • Close to message passing behavior • Charlotte’s nice properties are preserved (fault tolerance, adaptive parallelism) • Is it efficient?
Annotating DSM Code • Munin • Annotating data with expected access patterns, system chooses appropriate consistency protocols • Aurora • Object-based system in C++, different consistency models are dynamically selected at runtime • CRL, Cid • Library of C-functions, explicit mapping of shared memory in local memory • Jade • Annotations used to extract parallelism from sequential code • All lack flexibility to mix different guarantee levels
Setup: Multiplication of 200x200 integer matrices PentiumPro 200 at NYU with FastEthernet (100MBit/sec) Pentium 90 at Humboldt University Berlin Kaffe Virtual Machine Version 0.92 (Java JIT compiler) under Linux Sequential runtimes: 8.1 sec. (P90), 2.3 sec. (PPro200) Ping between NYU and HUB: typically 130 msec. Experiments - Setup
Experiments • Multiply two 200x200 integer matrices • Runtimes for • Standard Charlotte (Dint) • Standard Charlotte plus correctness-preserving annotations (Dint+A) • Unchecked classes (Uint) • Shared primitive types (int) • A message passing implementation (mes.pas.) • Shared primitive types competitive with message passing • Reasonable absolute speedups (compared to sequential runtime)
Experiments (contd.) • Ratios between various optimizations • Annotations/Presending give a factor of about three • Getting rid of objects is another factor of two • Altogether: up to nine times faster than standard Charlotte • And better scalability
Experiments (contd.) • Manager at NYU, worker at HU Berlin • Similar behavior, if slightly smaller improvement
Experiments (contd.) • Colocation and Caching • Example: multiply matrix A by two different matrices B1 and B2 • Small impact in a LAN environment • Up to 25% improvement for high-latency connections • Colocation results in smaller standard deviation of runtimes
Future Work • Investigating more elaborate examples • Data with both regular and irregular access patterns • Compiler-generated annotations • New programming interface • no nested parallelism • parBegin() / parEnd() makes some annotations awkward • Overlapping communication/computation • Multiple worker threads in one virtual machine • Applying annotations to Calypso (a page-based DSM system) • Down the road: use annotations for QoS
Summary • Annotations to describe read and write sets of parallel routines • Correctness-preserving • Correctness-sensitive • Or directly sharing primitive types • Possible to mix these different levels in a single program and to switch back and forth between them • Big flexibility for programmer • Maintains Charlotte’s advantages like adaptive parallelism and fault tolerance • Performance competitive with message passing programs