ParalleX : Towards a New Parallel Execution Model for Scalable Programming and Architectures

A Presentation to CGCSC 2006: ParalleX:Towards a New Parallel Execution Model for Scalable Programming and Architectures Thomas Sterling Louisiana State University California Institute of Technology Oak Ridge National Laboratory September 11, 2006

Parallel Computing Challenges • Ease of programming • Representation of application algorithms • Control of resource management • Efficiency of operation • Overhead, latency, contention (bandwidth), starvation • Scalability – strict and weak • Tracking technology advances • Relative costs & sizes, balance of feeds & speeds, physical limits • Exposure & Exploitation of parallelism • Reduce execution time • Increase throughput • Hide latencies and overlap overheads • Symbolic computing, e.g. directed graphs

Observations on Historical Trends • Multiple Technology Generations • 5 generations of main memory • 6 generations of processor logic • Moore’s Law • DRAM • 4X every 3 years • Top-500 • Stretchable Linpack benchmark • ~100X per decade • Traverses multiple architecture classes • Architecture innovation tracks technology opportunity • Exploits technology strengths while mitigating weaknesses • Adopts alternative models of computation to govern execution • Technology has progressed > 2 orders of magnitude (density) while execution model has not changed in 15 years

Attributes of an Execution Model • Specifies referents, their interrelationships, and actions that can be performed on them • Defines semantics of state objects, functions, parallel flow control, and distributed interactions • Leaves unbound policies of implementation technology, structure, and mechanism • Is NOT a programming model, architecture, or virtual machine • Enables reasoning about the decision chain for resource scheduling • Conceptual framework for considering design decisions for programming languages, compilers, runtime, OS, and hardware architectures

Towards A New Model of Parallel Computation • Serve as a discipline to govern future scalable system architectures, programming methods, and runtime/OS • Address Dominant Challenges to Enable High Efficiency • Latency • Overhead • Starvation • Resource contention • Programmability • Originated with ANL DOE Advanced Programming project • Driver for SNL/UNM DOE Fast-OS Config-OS project • Provide alternative model to conventional CSP paradigm • Target for composability methodology and lightweight kernels • Test for dynamic adaptive runtime components • Explore system architecture and support mechanisms under NSF • Potentially divisive, disruptive, and ignored • Borrows from a lot good stuff

Prior Projects that Influenced ParalleX • Beowulf • NASA, GSFC & JPL • Continuum Computing Architecture • DARPA, Caltech • HTMT • NSF/DARPA/NSA/NASA, Caltech & JPL • DIVA • DARPA, USC ISI • Gilgamesh • NASA, JPL • Percolation • NSF, U. of Delaware • Advance Programming Models • DOE, ANL • Cascade • DARPA, Cray • Config-OS • DOE, SNL & UNM

Latency Hiding with ParcelsIdle Time with respect to Degree of Parallelism

LSU Research Strategy • Devise a new model of computation • Liberates current and future generation technologies • Empowers computer architecture beyond incrementalism • Specification with intermediate form syntax • Develop early reference implementations for near term application experiments and architecture studies • Derive resource/action decision chain strategy within context of new model to establish responsibilities of every level from application to hardware • Explore implications for programming models and compilers, runtime and operating systems, and parallel architecture design • Apply to conventional systems for near term improvements in non-disruptive market context • Parcel/threads FPGA accelerator

ParalleX Model’s Major Features • Specifies semantics, not implementation policies • Does not assume particular architecture • Permits different hardware/software implementations • Lends itself to innovative optimization and heterogeneity • Not a language • Provides an intermediate form (PXI) • Intrinsically latency hiding • Highly asynchronous • Exposes wide variety of forms of parallelism • Degrees of granularity, including fine grain • Different types of parallel action • Separation of virtual logical space from physical execution and storage space • Towards low power, multi-core, PIM, fault tolerance

A New Synthesis of Selected Concepts • Split-phase transaction • Telephone switching and database transaction processing • Dennis/Arvind with dataflow • Yelick/Culler with split-C • Message-driven • Dennis using Tokens in dataflow • Dally in his J-machine • Hewitt with Actors model • The whole object oriented thing • Multi-threaded • Smith with HEP & MTA • Unix Pthreads

More Synthesis • Distributed Shared Memory • Scott with T3D • BBN with Butterfly machines • Futures • Hewitt with Actors • Halstead’s Multilisp • Arvind’s I-structures • Machines like • MTA • J-machine • Percolation • Sterling & Gao in HTMT • Babb course grain dataflow • Samon and Warren in their N-body tree code

Yet more Synthesis • Lightweight control objects • Dennis’s dataflow templates • Struct processing from MIND • One sided • Gropp et al with MPI-2 • Carlson with UPC • In-memory synchronization • Smith & Callahan with empty/full bits • Copy semantics • Variation on Gao’s location consistency model

Localities • A “locality” is a contiguous physical domain • Guarantees compound atomic operations on local state • Manages intra-locality latencies • Exposes diverse temporal locality attributes • Divides the world into synchronous and asynchronous • System comprises a set of mutually exclusive, collectively exhaustive localities • A first class object • An attribute of other objects • Heterogeneous • Specific inalienable properties

Split Phase Transactions • A transaction is a set of interdependent actions on exchanged values • Transactions are divided between successive phases • All actions of a transaction phase are relatively local • Assigned to a given execution element • Operations perform on local state for low latency • Phases are divided at stages of remote access or service request • Thus, asynchronous phasing at split

Multi-Grain Multithreading • Threads are collections of related operations that perform on locally shared data • A thread is a continuation combined with a local environment • Modifies local named data state and temporaries • Updates intra thread and inter thread control state • Does not assume sequential execution • Other flow control for intra-thread operations possible • Thread can realize transaction phase • Thread does not assume dedicated execution resources • Thread is first class object identified in global name space • Thread is ephemeral

Parcels • Enables message-driven computation • Messages that specify function to be performed on a named element • Moves work and data between objects in different localities • Parcels are not first-class objects • Exists in the world of “parcel sets” • First-class objects • Transfer between parcel sets is atomic, invariant, and unobservable • Major semantic content • Destination object • Action to be performed on targeted object • Operands for function to be performed • Continuation specifier

Percolation Pre-Staging • An important latency hiding and scheduling technique • Overhead functions are not necessarily done optimally by high speed processors • Moves data and task specification to local temporary storage of an execution element by external means • Minimum overhead at execution site • Almost no remote accesses • Cycle: dispatch/prestage/execute/commit/control update • High speed execution element operates on work queue • Processors are dumb, memory is smart • Good for accelerators, functional elements, precious resources

Fine-grain event driven synchronization: breaking the barrier • A number of forms of synchronization are incorporated into the semantics • Message-driven remote thread instantiation • Lightweight objects • Data flow • Futures • In-memory synchronization • Control state is in the name space of the machine • Producer-consumer in memory • e.g., empty/full bits • Local mutual exclusion protection • Synchronization mechanisms as well as state are presumed to be intrinsic to memory • Directed trees and graphs • Low cost traversal

Global name space • User variables • Synchronization variables and objects • Threads as first-class objects • Moves virtual named elements in physical space • Parcel sets • Process • First class object • Specifies a broad task • Defines a distributed environment • Spans multiple localities • Need not be contiguous

Beyond current scope • Policies not specified • Execution order • Language and language syntax • What’s special about hardware • Runtime vs. OS responsibilities • Load balancing • What’s missing • Affinity, colocation • Fault intrinsics • Meta threads • I/O • Many details

PXIF – ParalleX Intermediate-form • Not a programming language • Provides command line-like hooks to relate to and control all elements and actions of ParalleX execution • Lists of actions and operands • ‘(<action> <target object> <function operands>)’ • Some special forms (sadly) • Create, delete, and move objects • Invoke, terminate, migrate actions • Syntaxless syntax • Currently uses a prefix notation but is amenable to other forms as long as one-on-one isomorphism • Note that MPI has more than one syntax • In-work • Not finished • Subject to change • But great progress

Reference Implementation • Goal • Validation of semantics and PXI formulation • Correctness • Completeness • Early testbed for experimentation and algorithm development • Executable reference for future PXI implementations by external collaborators • Strategy • Facilitates development of PXIF syntax specification • Employ rapid prototyping software development environment • Incremental design • Replace existing functions with PXI-specific modules • Refinement of ParalleX concepts and PXIF formalism

Accomplishments and Status • A previous toy version of some parts of early ParalleX concepts (predated actual PXIF) • Running PXIF interpreter (version 0.4) • Distributed vers. 0.3.2 to NASA ARC & UNM • Based on open source Common Lisp environment • CLISP & SBCL • Compiled interpreter • Virtual machine for execution of serialized code • ~75% PXIF implemented • Object representation of system, localities, threads, data, text • Futures, templates, continuations • Data types: arrays, structs, scalars • Automatic compilation of methods • Some small application kernels running on multithreaded PXVM • Collects runtime statistics • e.g., #threads, #instr, histogramming • Rudimentary PXIF debugger

Text Object Pool Text Text Text PXIF from Sources to Execution PXIF Sources High-level Translator Static Compiler Dynamic Compiler Internal Representation Thread Scheduler Thread Scheduler Thread Scheduler Multi-threaded VM Multi-threaded VM Multi-threaded VM ...

Conclusions • Undertaking an exploratory study of alternative execution models • Influenced by early architecture studies • Benefits from previous projects • Working toward • Specification • Reference implementation • Costs quantification • Realization on conventional distributed systems • FPGA-based accelerator • Architecture refinement • Programming model and language

ParalleX : Towards a New Parallel Execution Model for Scalable Programming and Architectures

ParalleX : Towards a New Parallel Execution Model for Scalable Programming and Architectures

Presentation Transcript

Implementation of SOA

Superscalar Processors

Advanced Parallel Programming with MPI

Parallel and Concurrent Programming

Denis.Caromel@inria.fr

Advanced Parallel Programming with OpenMP

Danny Bickson

Parallel Programming Models, Languages and Compilers

Parallel Programming Models, Languages and Compilers

Integer linear programming

Distributed Memory Multiprocessors

Scalable Web Architectures

MPI Parallel Programming

CHAPTER 8 – High-Level Programming Languages

Parallel Real-Time Systems

MPI Parallel Programming

Socket Programming(2/2)