360 likes | 775 Views
2. A new era of mainstream parallel processing. The Challenge Parallelism scaling replaces frequency scaling as foundation for increased performance ? Profound impact on future software. . Multi-core chips. Cluster Parallelism. Heterogeneous Parallelism. 3. MPI. Library for message-passingStandardized by MPI Forum (academics, industry) mid 90s.Widely available with vendor-supported implementations.By far the most widely used infrastructure in HPC for parallel computing..
E N D
1. 1 The X10 Programming Language
Vijay Saraswat
IBM TJ Watson Research Center
August 2007
2. 2 A new era of mainstream parallel processing Contacts: Vivek Sarkar/Watson/IBM, Vijay Saraswat/Watson/IBM
Parallelism scaling replaces frequency scaling as the foundation for increased performance. Parallelism scaling can be observed at three important levels of the hardware stack:
Multi-core parallelism
Hetreogeneous parallelism (as in the Cell processor)
Cluster parallelism as in Blue Gene or in commodity scale-out clusters
The move towards parallelism as the primary driver for system performance will have a profound software impact on software, because all software will need to be enabled to exploit parallelism. Some areas of commercial software (e.g. transaction systems) are already prepared for this trend from past investments in SMP-enablement. However, a lot of other software is predominantly single-threaded has been riding the frequency scaling curve predicted by Moore’s Law for the last two decades. The goal of the X10 project in IBM Research is to respond to this future software crisis by establishing new foundations for programming models, languages, tools, compilers, runtimes, virtual machines, and libraries for parallel hardware.Contacts: Vivek Sarkar/Watson/IBM, Vijay Saraswat/Watson/IBM
Parallelism scaling replaces frequency scaling as the foundation for increased performance. Parallelism scaling can be observed at three important levels of the hardware stack:
Multi-core parallelism
Hetreogeneous parallelism (as in the Cell processor)
Cluster parallelism as in Blue Gene or in commodity scale-out clusters
The move towards parallelism as the primary driver for system performance will have a profound software impact on software, because all software will need to be enabled to exploit parallelism. Some areas of commercial software (e.g. transaction systems) are already prepared for this trend from past investments in SMP-enablement. However, a lot of other software is predominantly single-threaded has been riding the frequency scaling curve predicted by Moore’s Law for the last two decades. The goal of the X10 project in IBM Research is to respond to this future software crisis by establishing new foundations for programming models, languages, tools, compilers, runtimes, virtual machines, and libraries for parallel hardware.
3. 3 MPI Library for message-passing
Standardized by MPI Forum (academics, industry) mid 90s.
Widely available with vendor-supported implementations.
By far the most widely used infrastructure in HPC for parallel computing. … But very low-level
Explicit, static management of handshakes is cumbersome, error-prone.
Explicit management of distribution is cumbersome, error-prone (cf MG, HPL)
Not suitable for fine-grained concurrency, adaptive computation, uneven task lengths
Code that is not Bulk Synchronous Parallel
Performance challenges from network support for one-sided memory access, multicore.
4. 4 Java Java 1.1 had support for multi-threading.
Program can create multiple threads.
Upto a limit.
Monitor-based concurrency control
wait, notify. Monolithic heap – no support for distribution.
Cumbersome memory model.
Lock-based concurrency control … no support for lock-free algorithms.
No support for fine-grained concurrency.
No support for closures, value types.
Poor support for arrays.
5. 5 The X10 Programming Model
6. 6 async async (P) S
Creates a new child activity at place P, that executes statement S
Returns immediately
S may reference final variables in enclosing blocks
Activities cannot be named
Activity cannot be aborted or cancelled
7. 7 finish finish S
Execute S, but wait until all (transitively) spawned asyncs have terminated.
Rooted exception model
Trap all exceptions thrown by spawned activities.
Throw an (aggregate) exception if any spawned async terminates abruptly.
implicit finish at main activity
finish is useful for expressing
“synchronous” operations on
(local or) remote data.
8. 8 atomic, when Atomic blocks are
Executed in a single step, conceptually, while other activities are suspended.
An atomic block may not
Block
Access remote data.
Create activities.
Contain a conditional block.
Essentially, body is a bounded, sequential, non-blocking activity
Hence executing in a single place.
9. 9 X10 v1.01 Cheat sheet PPoPP: Vijay – upgrade to 1.01 syntax.PPoPP: Vijay – upgrade to 1.01 syntax.
10. 10 X10 v1.01 Cheat sheet: Array support : Vijay – upgrade to 1.01 syntax.
: Vijay – upgrade to 1.01 syntax.
11. 11 An example of finish and async: dfs spanning tree
12. 12 An example of clocking: bfs spanning tree
13. 13 Types
14. 14 Dependent types Class or interface that is a function of values.
Programmer specifies properties of a type – public final instance fields.
Programmer may specify refinement types as predicates on properties
T(v1,…,vn: c)
all instances of t with the values fi==vi satisfying c.
c is a boolean expression over predefined predicates.
15. 15 Place types Every X10 reference inherits the property (place loc) from X10RefClass.
Constraints can be placed on this property, e.g.
loc==here
loc == x.loc
No constraints implied data can be anywhere.
Place types are checked by place-shifting operators (async, future).
16. 16 Region and distribution types
17. 17 Work–stealing for fine grained scheduling
18. 18 CWS extensions being investigated Decouple call-stack from deque
? strict series/parallel graphs. (cf. dfs/bfs)
Global Termination Detection
Detect when all workers are stealing, none is executing work-items, and there are no messages in flight.
Global Quiescence Detection
Do this repeatedly (for a global clock).
? Need two deques/worker.
Permit use of area-specific data-structures
E.g. for reduction.
19. 19 Work-stealing with network traffic Use polling mode.
On network call, worker may discover incoming asyncs and move them to its deque When idle, steal from other workers or poll.
20. 20 Workstealing via Dekker
21. 21 Deadlock freedom
22. 22 Deadlock freedom Where is this useful?
Whenever synchronization pattern of a program is independent of the data read by the program
True for a large majority of HPC codes.
(Usually not true of reactive programs.)
More general, data-dependent type systems for deadlock-freedom? Central theorem of X10:
Arbitrary programs with async, atomic, finish, clocks are deadlock-free.
Key intuition:
atomic is deadlock-free.
finish has a tree-like structure.
clocks are made to satisfy conditions which ensure tree-like structure.
Hence no cycles in wait-for graph.
23. 23 Determinacy in X10
24. 24 Imperative Programming Revisited Variables
Variable=Value in a Box
Read: fetch current value
Write: change value
Stability condition: Value does not change unless a write is performed
Very powerful
Permit repeated many-writer, many-reader communication through arbitrary reference graphs
Mutability in the presence of sharing
Permits different variables to change at different rates. Asynchrony introduces indeterminacy
May write out either 0 or 1.
Bugs due to races are very difficult to debug.
25. 25 Determinate Imperative Programming Key idea
Sequence of assignments to a det variable is viewed as a stream.
Each activity carries an index for each det variable.
Index is increased on every read and write.
Ensure through type-system that at each index exactly one writer can write.
? No races!
Any (recursive) asynchronous Kahn network can be represented thus.
26. 26 Current Status
27. 27 X10DT: Enhancing productivity Code editing
Refactoring
Code visualization Data visualization
Debugging
Static performance analysis
28. 28 Operational X10 implementation (since 02/2005) X10 Compiler (06/2007)
29. 29 X10Flash Distributed runtime
In C/C++
On top of messaging library (GASNet, ARMCI, LAPI)
Targeted for high-performance clusters of SMPs.
X10lib
Runtime also made available as a standalone library.
Supporting global address space, places, asyncs, clocks, futures etc.
Performance goal
To be competitive with MPI
Release schedule
Internal demonstration 12/07
External release 2008
30. 30 Conclusion New era for concurrency research.
Concurrency is now mainstream – affects million of developers.
Practical research focused on concurrent language design, analysis techniques, type systems, compiler development, application support….can have a major impact.
31. 31 Acknowledgments Recent Publications
"Concurrent Clustered Programming", V. Saraswat, R. Jagadeesan. CONCUR conference, August 2005.
"X10: An Object-Oriented Approach to Non-Uniform Cluster Computing", P. Charles, C. Donawa, K. Ebcioglu, C. Grothoff, A. Kielstra, C. von Praun, V. Saraswat, V. Sarkar. OOPSLA Onwards! conference, October 2005.
“A Theory of Memory Models”, V Saraswat, R Jagadeesan, M. Michael, C. von Praun, to appear PPoPP 2007.
“Experiences with an SMP Implementation for X10 based on the Java Concurrency Utilities Rajkishore Barik, Vincent Cave, Christopher Donawa, Allan Kielstra,Igor Peshansky, Vivek Sarkar. Workshop on Programming Models for Ubiquitous Parallelism (PMUP), September 2006.
"X10: an Experimental Language for High Productivity Programming of Scalable Systems", K. Ebcioglu, V. Sarkar, V. Saraswat. P-PHEC workshop, February 2005.
Tutorials
TiC 2006, PACT 2006, OOPSLA06 X10 Core Team
Rajkishore Barik
Vincent Cave
Chris Donawa
Allan Kielstra
Igor Peshansky
Christoph von Praun
Vijay Saraswat
Vivek Sarkar
Tong Wen
X10 Tools
Philippe Charles
Julian Dolby
Robert Fuhrer
Frank Tip
Mandana Vaziri
Emeritus
Kemal Ebcioglu
Christian Grothoff
Research colleagues
R. Bodik, G. Gao, R. Jagadeesan, J. Palsberg, R. Rabbah, J. Vitek
Several others at IBM
32. 32 Global RandomAccess
33. 33 Global Random Access benchmark
34. 34 Theory of Memory Models
35. 35 Background: Why memory models? Current architectures optimize for single-thread execution.
Sequential Consistency (SC) is not consistent with these optimizations.
So weaker memory model desired.
Fundamental Property (FP): Programs whose SC executions have no races should have only SC executions.
Who should have responsibility for race-freedom?
Implementation / Programmer
Programmer may know a lot about the computation.
Idea: make semantics as weak as possible, while preserving FP.
Need precise semantics!
36. 36 Test Case 7: Single-thread reordering (r1=z,r2=x;y=r2 | r3=y;z=r3;x=1)
r1,r2,y=z,x,x | r3,z,x=y,y,1 -- CO,CO
r1,r2,y=z,x,x | r3,z=y,y | x=1 -- DE
x=1; r1,r2,y=z,x,x | r3,z=y,y -- AU
x,r1,r2,y=1,z,1,1 | r3,z=y,y -- CO
x,r1,r2=1,z,1 | y=1 | r3,z=y,y -- DE
x,r1,r2=1,z,1 | y=1 ; r3,z=y,y -- AU
x,r1,r2=1,z,1 | y,r3,z=1,1,1 -- CO
y,r3,z=1,1,1; x,r1,r2=1,z,1 -- AU
y,r3,z,x,r1,r2=1,1,1,1,1,1 -- CO
The last step is a process w/ a single step
So: Yes!
37. 37 Technical overview Model sequential execution through steps
Finite functions over partial stores.
Model concurrent execution through a pomset of steps
Partial order reflects the “happens before” order.
Memory Model = Transformations
Combine steps
Break apart steps
Simplify steps
Add hb edges
Add “links” Develop a formal calculus to establish all possible behaviors.
All Causality Cases can be dealt with in a simple fashion.
Model synchronization constructs
raw and shared variables
async, finish
atomic, isolated
… more constructs?