Bridging the Gap Between Distributed Shared Memory and Message Passing

Bridging the Gap Between Distributed Shared Memory and Message Passing Holger Karl Department of Computer Science Courant Institute of Mathematical Sciences New York University Partially supported by DARPA/Rome laboratory and Deutsche Forschungsgemeinschaft

Overview • Motivation • Related approaches • Charlotte • Annotating Charlotte • Some experiments • Summary

Motivation • Scenario • Distributed computing using the World Wide Web • Use Java to overcome heterogeneity, security concerns and program installation problems • Idea: volunteer computing • Applets can participate in distributed applications • Challenges • Only make use of standard Web browsers/Java virtual machines • Consider the Web’s and Java’s idiosyncrasies • Inherent faultiness, machines come and go • Host-of-origin imposes a star-like communication topology • Provide a simple programming interface, e.g., DSM • No hardware support (e.g., page protection) available • Do not forget efficiency

Related Approaches

Charlotte - Programs public class Matrix extends Droutine { // define routine public void drun (int n, int Id){ // do computation } … public void run () { … parBegin(); addRoutine (this, Size); parEnd(); … } } • Alternating sequential and parallel steps • Sequential steps executed by a manager application • Parallel steps executed by worker applets • Routines are defined in parallel steps • Routines are methods of objects, derived from a Charlotte class Droutine

Charlotte - Memory Semantics • Memory partitioned into private and shared parts • Shared memory with CRCW-common semantics • Implemented at object level • Charlotte provides classes for every primitive type: int Dint, float  Dfloat, etc • Manager has master copy of memory • Shared objects have get() and set() methods • get() brings invalid data from manager if necessary, amortized by bringing “pages” of shared data from the manager • set() marks data as modified to be flushed to manager at end of parallel step • Updates from different routines incorporated atomically at end of parallel step • Allows eager scheduling for fault tolerance • Routines can be executed multiple times without violating exactly once semantics

Charlotte - Matrix Multiplication public class Matrix extends Droutine { // define routine public void drun (int n, int i){ for (int j=0; j<Size; j++){ sum = 0; for (int k=0; k<Size; k++) sum += A[i][k].get() * B[k][j].get(); C[i][j].set(sum); } … }

Charlotte - Pros and Cons • Pros • Simple programming model • Well-defined DSM semantics • Adaptive parallelism • Fault tolerant with respect to worker crashes • Cons • Efficiency problems • Shared data is accessed via method invocations • Loading data from manager incurs latency • Choosing good page size is difficult • Sending modified data back to manager requires inspection of status flags • Reason: runtime system does not know which data is read or modified by routines • Make this information explicit in Charlotte programs!

Annotations public class Matrix extends Droutine { public void drun (int n, int i) {…} public Locations dloc_read (int n, int i) { Locations loc = new Locations(); loc.add (B); loc.add (A[i]); return loc; } … } • Specify read and write locations for routines • Use methods of class Droutine • dloc_read() • dloc_write() • First step: Correctness-preserving annotations • Only send read data at beginning of routine

Relying on Annotations • Second step: Rely on annotations • Data transfer between manager and worker happens only according to annotations • No if-statement in get() necessary • No status update in set() necessary • No latencies for requesting data updates • But, wrong annotations lead to wrong results • Use Uint (unchecked) instead of Dint, etc. • Identical interface • Remaining overhead: method invocation for data access • Charlotte’s distributed types (Dint) and unchecked distributed types (Uint) can be freely mixed • Simple to switch back and forth

Sharing Primitive Types • Third step: Use primitive types (int) instead of objects (Uint) • Possible since get() and set() are trivial for Uint • Possible since annotations describe data movement completely • Avoids method invocation overhead • But: different interface • No get()/set()  syntactic changes • Call-by-reference / call-by-value • Particularly suited to arrays of distributed types • Freely mix Dint, Uint and shared int

Additional Optimizations • Manager keeps track of workers’ valid data sets • Use routines’ read sets to select routine to give to worker • Choose routine with minimal amount of missing data • Example: 3 routines, 2 workers • Data can be cached at workers between parallel steps • Requires data to be declared unmodified at the beginning of parallel step Worker 1 valid: 1-100 read_set: 160-210 20-80 90-150 Worker 2 valid: 150-200

Discussion • Gradual incorporation of semantic knowledge into Charlotte programs • Large flexibility to mix different levels of annotation semantics in one program • Easy to switch back and forth among these levels • Close to message passing behavior • Charlotte’s nice properties are preserved (fault tolerance, adaptive parallelism) • Is it efficient?

Annotating DSM Code • Munin • Annotating data with expected access patterns, system chooses appropriate consistency protocols • Aurora • Object-based system in C++, different consistency models are dynamically selected at runtime • CRL, Cid • Library of C-functions, explicit mapping of shared memory in local memory • Jade • Annotations used to extract parallelism from sequential code • All lack flexibility to mix different guarantee levels

Setup: Multiplication of 200x200 integer matrices PentiumPro 200 at NYU with FastEthernet (100MBit/sec) Pentium 90 at Humboldt University Berlin Kaffe Virtual Machine Version 0.92 (Java JIT compiler) under Linux Sequential runtimes: 8.1 sec. (P90), 2.3 sec. (PPro200) Ping between NYU and HUB: typically 130 msec. Experiments - Setup

Experiments • Multiply two 200x200 integer matrices • Runtimes for • Standard Charlotte (Dint) • Standard Charlotte plus correctness-preserving annotations (Dint+A) • Unchecked classes (Uint) • Shared primitive types (int) • A message passing implementation (mes.pas.) • Shared primitive types competitive with message passing • Reasonable absolute speedups (compared to sequential runtime)

Experiments (contd.) • Ratios between various optimizations • Annotations/Presending give a factor of about three • Getting rid of objects is another factor of two • Altogether: up to nine times faster than standard Charlotte • And better scalability

Experiments (contd.) • Manager at NYU, worker at HU Berlin • Similar behavior, if slightly smaller improvement

Experiments (contd.) • Colocation and Caching • Example: multiply matrix A by two different matrices B1 and B2 • Small impact in a LAN environment • Up to 25% improvement for high-latency connections • Colocation results in smaller standard deviation of runtimes

Future Work • Investigating more elaborate examples • Data with both regular and irregular access patterns • Compiler-generated annotations • New programming interface • no nested parallelism • parBegin() / parEnd() makes some annotations awkward • Overlapping communication/computation • Multiple worker threads in one virtual machine • Applying annotations to Calypso (a page-based DSM system) • Down the road: use annotations for QoS

Summary • Annotations to describe read and write sets of parallel routines • Correctness-preserving • Correctness-sensitive • Or directly sharing primitive types • Possible to mix these different levels in a single program and to switch back and forth between them • Big flexibility for programmer • Maintains Charlotte’s advantages like adaptive parallelism and fault tolerance • Performance competitive with message passing programs

Bridging the Gap Between Distributed Shared Memory and Message Passing

Bridging the Gap Between Distributed Shared Memory and Message Passing

Presentation Transcript

Bridging the Gap Between Strategy and Results

Distributed Shared Memory

Bridging the gap between business and technology

Bridging the Gap Between Statistics and Engineering

Bridging the gap between print and online

Distributed Shared Memory

More Shared Memory Programming And Intro to Message Passing

Shared Memory and Message Passing

Bridging the gap between school and University

Distributed Shared Memory

Programming distributed memory systems: Message Passing Interface (MPI)

Distributed-Memory (Message-Passing) Paradigm

Distributed shared memory

Distributed Shared Memory

BRIDGING THE GAP BETWEEN WRITING AND MATH

Distributed Shared Memory

Distributed Shared Memory

Bridging the Gap Between Research and Practice

Bridging the Gap Between Faith---and --- Work

Distributed Shared Memory

Bridging the Gap between Cavers and Scientists …

Bridging the Gap Between Technology and Business