420 likes | 436 Views
Determinate Imperative Programming. Vijay Saraswat, Radha Jagadeesan, Armando Solar-Lezama, Christoph von Praun November, 2006 IBM Research. This work has been supported in part by the Defense Advanced Research Projects Agency (DARPA) under contract No. NBCH30390004. Problem:
E N D
Determinate Imperative Programming Vijay Saraswat, Radha Jagadeesan, Armando Solar-Lezama, Christoph von Praun November, 2006 IBM Research This work has been supported in part by the Defense Advanced Research Projects Agency (DARPA) under contract No. NBCH30390004
Problem: Many concurrent imperative programs are determinate. Determinacy is not apparent from the syntax. Design a language in which all programs are guaranteed determinate. Basic idea A variable is the stream of values written to it by a thread. Many examples Semantics Implementation Future work Outline
Recent Publications "Concurrent Clustered Programming", V. Saraswat, R. Jagadeesan. CONCUR conference, August 2005. "X10: An Object-Oriented Approach to Non-Uniform Cluster Computing", P. Charles, C. Donawa, K. Ebcioglu, C. Grothoff, A. Kielstra, C. von Praun, V. Saraswat, V. Sarkar. OOPSLA Onwards! conference, October 2005. “A Theory of Memory Models”, V Saraswat, R Jagadeesan, M. Michael, C. von Praun, to appear PPoPP 2007. “Experiences with an SMP Implementation for X10 based on the Java Concurrency Utilities Rajkishore Barik, Vincent Cave, Christopher Donawa, Allan Kielstra,Igor Peshansky, Vivek Sarkar. Workshop on Programming Models for Ubiquitous Parallelism (PMUP), September 2006. "X10: an Experimental Language for High Productivity Programming of Scalable Systems", K. Ebcioglu, V. Sarkar, V. Saraswat. P-PHEC workshop, February 2005. Tutorials TiC 2006, PACT 2006, OOPSLA06 X10 Core Team Rajkishore Barik Vincent Cave Chris Donawa Allan Kielstra Igor Peshansky Christoph von Praun Vijay Saraswat Vivek Sarkar Tong Wen X10 Tools Philippe Charles Julian Dolby Robert Fuhrer Frank Tip Mandana Vaziri Emeritus Kemal Ebcioglu Christian Grothoff Research colleagues R. Bodik, G. Gao, R. Jagadeesan, J. Palsberg, R. Rabbah, J. Vitek Several others at IBM Acknowledgments
PEs, L1 $ PEs, L1 $ PEs, PEs, PEs, PEs, PEs, L1 $ PEs, L1 $ . . . . . . . . . . . . SMP Node SMP Node . . . . . . L2 Cache L2 Cache . . . Interconnect Memory Memory A new era of mainstream parallel processing The Challenge Parallelism scaling replaces frequency scaling as foundation for increased performance Profound impact on future software Multi-core chips Heterogeneous Parallelism Cluster Parallelism Our response: Use X10 as a new language for parallel hardware that builds on existing tools, compilers, runtimes, virtual machines and libraries
Workload Apps Servers System 64 Racks, 64x32x32 131,072 processors Rack 32 Node Cards 2048 processors Mode Card 16 compute cards (16 compute, 0-2 IO cards) 64 processors Network 360 TF/s 32 TB 1.3M Watts 5.6 TF/s 512 GB 20 KWatts 1 m2 footprint Compute Card 2 chips, 1x2x1 4 processors 180 GF/s 16 GB Chip 2 processors 11.2 GF/s 1.0 GB Server Trends: Concurrency, Distribution, Heterogeneity at all levels HPC Scale Out Shared Administrative Domain Appliance Commercial Scale Out Blade Multi-Core Chip
The X10 Programming Model Place = collection of resident activities & objects Storage classes • Immutable Data • PGAS • Local Heap • Remote Heap • Activity Local Locality Rule Any access to a mutable datum must be performed by a local activity remote data accesses can be performed by creating remote activities Ordering Constraints (Memory Model) Locally Synchronous: Guaranteed coherence for local heap Sequential consistency Globally Asynchronous: No ordering of inter-place activities use explicit synchronization for coherence Few concepts, done right.
X10 v0.41 Cheat sheet DataType: ClassName | InterfaceName | ArrayType nullable DataType futureDataType Kind : value | reference Stm: async [ ( Place ) ] [clocked ClockList ] Stm when( SimpleExpr ) Stm finish Stm next; c.resume() c.drop() for( i : Region ) Stm foreach( i : Region ) Stm ateach( I : Distribution ) Stm Expr: ArrayExpr ClassModifier : Kind MethodModifier: atomic x10.lang has the following classes (among others) point, range, region, distribution, clock, array Some of these are supported by special syntax. Forthcoming support: closures, generics, dependent types, place types, implicit syntax, array literals.
X10 v0.41 Cheat sheet: Array support Region: Expr : Expr -- 1-D region [ Range, …, Range ] -- Multidimensional Region Region && Region -- Intersection Region || Region -- Union Region – Region -- Set difference BuiltinRegion Dist: Region -> Place -- Constant distribution Distribution | Place -- Restriction Distribution | Region -- Restriction Distribution || Distribution -- Union Distribution – Distribution -- Set difference Distribution.overlay ( Distribution ) BuiltinDistribution ArrayExpr: new ArrayType ( Formal ) { Stm } Distribution Expr-- Lifting ArrayExpr [ Region ] -- Section ArrayExpr | Distribution-- Restriction ArrayExpr || ArrayExpr-- Union ArrayExpr.overlay(ArrayExpr) -- Update ArrayExpr. scan([fun [, ArgList] ) ArrayExpr. reduce([fun [, ArgList] ) ArrayExpr.lift([fun [, ArgList] ) ArrayType: Type [Kind] [ ] Type [Kind] [ region(N) ] Type [Kind] [ Region ] Type [Kind] [ Distribution ] Language supports type safety, memory safety, place safety, clock safety.
Memory Model Please see: http://www.saraswat.org/rao.html • X10 v 0.41 specifies sequential consistency per place. • Not workable. • We are considering a weaker memory model. • Built on the notion of atomic: identify a stepas the basic building block. • A step is a partial write function. • Use links for non hb-reads. • A process is a pomset of steps closed under certain transformations: • Composition • Decomposition • Augmentation • Linking • Propagation • There may be opportunity for a weak notion of atomic: decouple atomicity from ordering. Correctly synchronized programs behave as SC. Correctly synchronized programs= programs whose SC executions have no races.
async (P) S Creates a new child activity at place P, that executes statement S Returns immediately Smay reference finalvariables in enclosing blocks Activities cannot be named Activity cannot be aborted or cancelled async Stmt ::= asyncPlaceExpSingleListopt Stmt cf Cilk’s spawn // global dist. array final double a[D] = …; final int k = …; async ( a.distribution[99] ) { // executed at A[99]’s // place atomic a[99] = k; } Memory model: hb edge between stm before async and start of async.
finish S Execute S, but wait until all (transitively) spawned asyncs have terminated. Rooted exception model Trap all exceptions thrown by spawned activities. Throw an (aggregate) exception if any spawned async terminates abruptly. implicit finish at main activity finish is useful for expressing “synchronous” operations on (local or) remote data. finish Stmt ::= finish Stmt cf Cilk’s sync finish ateach(point [i]:A) A[i] = i; finish async (A.distribution [j]) A[j] = 2; // all A[i]=i will complete // before A[j]=2; Memory model: hb edge between last stm of each async and stm after finish S.
foreach (point p: R) S Creates |R| async statements in parallel at current place. Termination of all (recursively created) activities can be ensured with finish. finish foreach is a convenient way to achieve master-slave fork/join parallelism (OpenMP programming model) foreach foreach ( FormalParam: Expr ) Stmt for (point p: R) async { S } foreach (point p:R) S
Atomic blocks are conceptually executed in a single step while other activities are suspended: isolation and atomicity. An atomic block ... must be nonblocking must not create concurrent activities (sequential) must not access remote data (local) atomic Stmt ::= atomicStatement MethodModifier ::= atomic // target defined in lexically // enclosing scope. atomic boolean CAS(Object old, Object new) { if (target.equals(old)) { target = new; return true; } return false; } // push data onto concurrent // list-stackNode node = new Node(data);atomic { node.next = head; head = node; } Memory model: end of tx hb start of next tx in the same place.
Clocks: Motivation • Activity coordination using finish and force() is accomplished by checking for activity termination • However, there are many cases in which a producer-consumer relationship exists among the activities, and a “barrier”-like coordination is needed without waiting for activity termination • The activities involved may be in the same place or in different places Phase 0 Phase 1 . . . . . . Activity 0 Activity 1 Activity 2
clock c = clock.factory.clock(); Allocate a clock, register current activity with it. Phase 0 of c starts. async(…) clocked (c1,c2,…) S ateach(…) clocked (c1,c2,…) S foreach(…) clocked (c1,c2,…) S Create async activities registered on clocks c1, c2, … c.resume(); Nonblocking operation that signals completion of work by current activity for this phase of clock c next; Barrier --- suspend until all clocks that the current activity is registered with can advance. c.resume() is first performed for each such clock, if needed. Next can be viewed like a “finish” of all computations under way in the current phase of the clock Clocks (1/2)
c.drop(); Unregister with c. A terminating activity will implicitly drop all clocks that it is registered on. c.registered() Return true iff current activity is registered on clock c c.dropped()returns the opposite of c.registered() Activity is deregistered from a clock when it terminates. Static semantics An activity may operate only on those clocks it is registered with. In finish S,Smay not contain any (top-level) clocked asyncs. Dynamic semantics A clock c can advance only when all its registered activities have executed c.resume(). An activity may not pass-on clocks on which it is not live to sub-activities. ClockUseException An activity may not transmit a clock into the scope of a finish. ClockUseException Clocks (2/2) Memory model: hb edge between next stm of all registered activities on c, and the subsequent stm in each activity. No explicit operation to register a clock.
Example (TutClock1.x10) finish async { final clock c = clock.factory.clock(); foreach (point[i]: [1:N]) clocked (c) { while ( true ) { int old_A_i = A[i]; int new_A_i = Math.min(A[i],B[i]); if ( i > 1 ) new_A_i = Math.min(new_A_i,B[i-1]); if ( i < N ) new_A_i = Math.min(new_A_i,B[i+1]); A[i] = new_A_i; next; int old_B_i = B[i]; int new_B_i = Math.min(B[i],A[i]); if ( i > 1 ) new_B_i = Math.min(new_B_i,A[i-1]); if ( i < N ) new_B_i = Math.min(new_B_i,A[i+1]); B[i] = new_B_i; next; if ( old_A_i == new_A_i && old_B_i == new_B_i ) break; } // while } // foreach } // finish async parent transmits clock to child exiting from while loop terminates activity for iteration i, and automatically deregisters activity from clock
Permit variables to be marked as clocked final, e.g. clocked(c) final double[.] a = … In each phase of the clock, the variable is immutable. Writes for such a variable are performed on a shadow copy. Main copy of variable updated with value in shadow copy when clock moves to the next phase. Clocked final variables cannot introduce non-determinism. Assuming multiple writers write the same value in each phase. Clocked final variables
Clocked final example: Array relaxation G elements are assigned to at most once in each phase of clock c. Each activity is registered on c. clocked (c) final int [0:M-1,0:N-1] G = …; finish foreach (int i,j in [1:M-1,1:N-1]) clocked (c) { for (int p in [0:TimeStep-1]) { G[i,j] = omega/4*(G[i-1,j]+G[i+1,j]+G[i,j-1]+G[i,j+1])+(1-omega)*G[i,j]; next; } } Read current value of cell. Value written into ghost copy of G[i,j] Wait for clock to advance. Write visible (only) when clock advances.
Variables Variable=Value in a Box Read: fetch current value Write: change value Stability condition: Value does not change unless a write is performed Very powerful Permit repeated many-writer, many-reader communication through arbitrary reference graphs Mutability in the presence of sharing Permits different variables to change at different rates. Asynchrony introduces indeterminacy May write out either 0 or 1. Bugs due to races are very difficult to debug. Imperative Programming Revisited int x = 0; async x=1; print(x); Reader-reader, reader-writer, writer-writer conflicts.
NAS parallel benchmarks Conjugate gradient Multigrid LU factorization Stencil computations Jacobi, SOR Molecular dynamics Detecting stable properties Clocks! Short circuit technique Single producer, multiple copying consumers Kahn networks, StreamIt Pipelining Graph algorithms Connected components Determinate programming design patterns Parallelism for performance/scaling, not control.
Reactive computing: arrival-order indeterminism “Races in the world” E.g. Bank accounts Resource contention: any of several possible outcomes is acceptable Mutual exclusion Load balancing Shared work list Algorithm may permit any one of many possible solutions One solution for N-queens Some minimal spanning tree Some Delauney triangulation Determinate programming anti-patterns But the program may still contain determinate concurrent components.
Asynchronous Kahn networks Nodes can be thought of as (continuous) functions over streams. Pop/peek Push Node-local state may mutate arbitrarily Concurrent Constraint Programming Tell constraints Ask if a constraint is true Subsumes Kahn networks (dataflow). Subsumes (det) concurrent logic programming, lazy functional programming Determinate Concurrent Imperative frameworks Do not support arbitrary mutable variables.
Safe Asynchrony (Steele 1991) Parent may communicate with children. Children may communicate with parent. Siblings may communicate with each other only through commutative, associative writes (“commuting writes”). Determinate Concurrent Imperative Frameworks Good: int x=0; finish foreach (int i in 1:N) { x += i; } print(x); // N*(N+1)/2 Bad: int x=0; finish foreach (int i in 1:N) { x += i; async print(x); } Useful but limited. Does not permit dataflow synch.
Determinate X10 DataType: ClassName | InterfaceName | ArrayType nullable DataType futureDataType localDataType det DataType indet DataType Kind : value | reference Stm: async [ ( Place ) ] [clocked ClockList ] Stm when( SimpleExpr ) Stm finish Stm next; c.resume() c.drop() for( i : Region ) Stm foreach( i : Region ) Stm ateach( I : Distribution ) Stm Expr: ArrayExpr ClassModifier : Kind MethodModifier: atomic Constructs not available Constructs added.
Instances of value classes always considered local. A mutable object is local only if it is marked as local when created: new local T(…) A value of a local type can be assigned only into a variable of local type, or a field of a local object. Variables of a local type are not visible to contained asyncs. Local objects of terminated activity become local objects of parent. An async spawned in a finish may assign a value of a local type to a local variable of the parent activity. A value may be cast to local T; the cast may fail. E.g. local T x = (local T) this; Invariant: Each activity owns the local objects and local variables it creates. Only local objects can reference local objects. local variables Ownership type system used to maintain locality.
A det location is represented in memory as a stream (indexed sequence) of immutable values. Each activity maintains an index i + clean/dirty bit for every det location. Initially i=1, v[0] contains initial value. Read: If clean, block until v[i] is written and return v[i++] else return v[i-1]. Mark as clean. Write: Write into v[i++]. Mark as dirty. Note: index updated only as a result of activity’s operations. World Map = Collection of indices for an activity. Index transmission rules. async: Activity initialized with current world map of parent activity. finish: world map of activity is lubbed with world map of finished activities. (clean lub dirty = dirty) det locations The clock of clocked final is made implicit.
Can be recovered as det locations + a mutable shared index (“current”). All activities read and update location through current. Therefore stream representation is not necessary, only the “current” value need be kept. An activity’s world map does not need to contain index for indet locations. Indet locations
det example: Array relaxation det int [0:M-1,0:N-1] G = …; finish foreach (local int i,j in [1:M-1,1:N-1]) { for (local int p in [0:TimeStep-1]) { G[i,j] = omega/4*(G[i-1,j]+G[i+1,j]+G[i,j-1]+G[i,j+1])+(1-omega)*G[i,j]; } } All clock manipulations are implicit.
Some simple examples det int x=0; finish { async { int r1 = x; int r2 = x; println(r1); println(r2); } async {x=1;x=2;} } 0 1 Convention: A type not marked det is assumed marked local. Only one result – independent of the scheduler!
Some simple examples det int x=0; finish { async {int r1 = x; int r2 = x; println(r1); println(r2);} async {x=1;} async {x=1; int r3 = x; async {x=2;}} } println(x); 0 1 2 All programs are determinate.
Some StreamIt examples Det X10 StreamIt 0 1 … void -> void pipeline Minimal { add IntSource; add IntPrinter; } void ->int filter IntSource { int x; init {x=0;} work push 1 { push(x++);} } int->void filter IntPrinter { work pop 1 { print(pop());} } det int x=0; async while (true) x++; async while (true) println(x); The communication is through assignment to x, so the same result is obtained with: 0 1 … det int x=0; async while (true) ++x; async while (true) println(x); Each shared variable is a multi-reader, multi-writer stream.
Some StreamIt examples: fibonacci det int x=1, y=1; async while (true) y=x; async while (true) x+=y; Activity 1 Activity 2 Can express any recursive, asynchronous Kahn network.
StreamIt examples: Moving Average void->void pipeline MovingAverage { add intSource(); add Averager(10); add IntPrinter(); } int->int filter Average(int n) { work pop 1 push 1 peek n { int sum=0; for (int i=0; i < n; i++) sum += peek(i); push(sum/n); pop(); } } det int y=0; det int x=0; async while (true) x++; async while (true) { int sum=x; for (int i in 1:N-1) sum += peek(x, i); y = sum/N; } • peek(x, i) reads the i’th future value, without popping it. Blocks if necessary.
Canon matrix multiplication void canon (det double[N,N] c, det double[N,N] a, det double[N,N] b) { finish foreach (int i,j in [0:N-1,0:N-1]) { a[i,j] = a[i,(j+1) % N]; b[i,j] = b[(i+j)%N, j]; } for (int k in [0:N-1]) finish foreach (int i,j in [0:N-1,0:N-1]) { c[i,j] = c[i+j] + a[i,j]*b[i,j]; a[i,j] = a[i,(j+1)%N]; b[i,j] = b[(i+1)%N, j]; } } The natural sequential program works (for finish foreach).
Each activity’s world map increases monotonically with time. Use garbage collection to erase past unreachable values. Programs with no sibling communication may be executed in buffers with unit windows. Considering permitting user to specify bounds on variables (cf push/pop specifications in StreamIt). This will force writes to become blocking as well. Implementation Scheduling strategy affects size of buffers, not result.
Formalization MJ/CF Very straightforward additions to field read/write. Paper contains details. Paper contains ideas on detecting deadlock (stabilities) at runtime and recovering from them. Programmability being investigated. Devise static type system to establish deadlock-freedom. Implementation. Leverage connection with StreamIt, and static scheduling. Coarser granularity for indices. Use same clock for many variables. Permits “coordinated” changes to multiple variables. Introduce fusion operation (x -> y) to support CCP. Future work
StreamIt examples: Bandpass filter float->float pipeline BandPassFilter(float rate, float low, float high, int taps) { add BPFCore(rate, low, high, taps); add Subtracter();} float ->float splitjoin BPFCore (float rate, float low, float high, int taps) { split duplicate; add LowPass(rate, low, taps, 0); add LowPass(rate, high, taps, 0); join roundrobin;} float->float filter Subtracter { Work pop 2 push 1 { push(peek(1)-peek(0)); pop(); pop();}} float bandPassFilter(float rate, float low, float high, int taps, int in) { int tmp=in; det int in1=tmp, in2=tmp; async while (true) in1=in; async while (true) in2=in; det int o1 = lowPass(rate, low, taps, 0, in1), o2 = lowPass(rate, high, taps, 0, in2); det int o = o1-o2; async while(true) o = o1-o2; return o; } Functions return streams.
Histogram • Permit “commuting” writes to be performed simultaneously in the same phase. • Phase is completed when all activities that can write have written. <int N> [1:N][] histogram([1:N][] A) { final int[] B = new int [1:N]; finish foreach(int i in A) B[A[i]]++; return B; } B’s phase is not yet complete. A subsequent read will complete it.
Cilk programs with races int x; cilk void foo() { x = x +1; } cilk int main() { x=0; spawn foo(); spawn foo(); sync; printf(“x is \%d\n”, x); return 0; } Determinate: Will always print 1 in CF. CF smoothly combines Cilk and StreamIt.