Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor

Topic 3 -- II: System Software Fundamentals: Multithreaded Execution Models, Virtual Machines and Memory Models Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor Electrical & Computer Engineering University of Delaware ggao@capsl.udel.edu CPEG421-2001-F-Topic-3-II

Outline • An introduction to parallel program execution models • Coarse-grain vs. fine-grain multithreading • Evolution of fine-grain multithreaded program execution models. • Memory and synchronization. models • Fine-Grain Multithreaded execution and virtual machine models for peta-scale computing: a case study on HTMT/EARTH CPEG421-2001-F-Topic-3-II

Terminology Clarification • Parallel Model of Computation • Parallel Models for Algorithm Designers • Parallel Models for System Designers • Parallel Programming Models • Parallel Execution Models • Parallel Architecture Models CPEG421-2001-F-Topic-3-II

System Characterization Questions: Q1: What characteristics of a computational system are required … Q2: The diversity of existing and potential multi-core architectures… Response: R1: An important characteristic of such a compiler should include, at both chip level and system level, a program execution model that should at least include the specification and API Gao, ECCD Workshop, Washington D.C., Nov. 2007 CPEG421-2001-F-Topic-3-II

What Does Program Execution Model (PXM) Mean ? • The notion of PXM The program execution model (PXM) is the basic low-level abstraction of the underlying system architecture upon which our programming model, compilation strategy, runtime system, and other software components are developed. • The PXM (and its API) serves as an interface between the architecture and the software. CPEG421-2001-F-Topic-3-II

Program Execution Model (PXM) – Cont’d Unlike an instruction set architecture (ISA) specification, which usually focuses on lower level details (such as instruction encoding and organization of registers for a specific processor), the PXM refers to machine organization at a higher level for a whole class of high-end machines as view by the users Gao, et. al., 2000 CPEG421-2001-F-Topic-3-II

What is your “Favorite” Program Execution Model? CPEG421-2001-F-Topic-3-II

A Generic MIMD Architecture Node: Processor(s), Memory System plus Communication assist (Network Interface & Communication Controller) Communication Assist Memory NIC IC $ $ P P Full Feature Interconnect Networks. Packet Switching Fabrics. Key: Scalable Network Objective: Make efficient use of scarce communication resources – providing high bandwidth, low-latency communication between nodes with a minimum cost and energy CPEG421-2001-F-Topic-3-II

Programming Models for Multi-Processor Systems • Message Passing Model • Multiple address spaces • Communication can only be achieved through “messages” • Shared Memory Model • Memory address space is accessible to all • Communication is achieved through memory Global Memory Messages Local Memory Local Memory Processor Processor Processor Processor CPEG421-2001-F-Topic-3-II

Comparison Message Passing Shared Memory global shared address space Easy to program (?) No (explicit) message passing (e.g. communication through memory put/get operations) Synchronization (memory consistency models, cache models) Scalability • Less Contention • Highly Scalable • Simplified Synch • Message Passing  Sync + Comm. • But does not mean highly programmable • Load Balancing • Deadlock prone • Overhead of small messages CPEG421-2001-F-Topic-3-II

What is A Shared Memory Execution Model? Thread Model A set of rules for creating, destroying and managing threads Execution Model Synchronization Model Provide a set of mechanisms to protect from data races Memory Model Dictate the ordering of memory operations The Thread Virtual Machine CPEG421-2001-F-Topic-3-II

Essential Aspects in User-Level Shared Memory Support? • Shared address space support and management • Access control and management • Memory consistency model (MCM) • Cache management mechanism CPEG421-2001-F-Topic-3-II

Grand Challenge Problems • How to build a shared-memory multiprocessor that is scalable both within a (multi-core/many-core chip) and a system with many chips ? • How to program and optimize application programs? Our view: One major obstacle in solving these problems in the memory coherence assumption in today’s hardware-centric memory consistency model. CPEG421-2001-F-Topic-3-II

A Parallel Execution Model Application Programming Interface (API) Execution / Architecture Model Synchronization Model Thread Model Memory Model CPEG421-2001-F-Topic-3-II

A Parallel Execution Model Application Programming Interface (API) Execution / Architecture Model Fine Grained Synchronization Model Fine Grained Multithreaded Model Memory Adaptive / Aware Model With Dataflow Origins Our Model CPEG421-2001-F-Topic-3-II

Comment on OS impact? • Should compiler be OS-Aware too ? If so, how ? • Or other alternatives ? Compiler-controlled runtime, of compiler-aware kernels, etc. • Example: software pipelining … Gao, ECCD Workshop, Washington D.C., Nov. 2007 CPEG421-2001-F-Topic-3-II

Outline • An introduction to multithreaded program execution models • Coarse-grain vs. fine-grain parallel execution models – a historical overview • Fine-grain multithreaded program execution models. • Memory and synchronization. models • Fine-grain multithreaded execution and virtual machine models for extreme-scale machines: a case study on HTMT/EARTH CPEG421-2001-F-Topic-3-II

Course Grain Execution Models The Single Instruction Multiple Data (SIMD) Model Pipelined Vector Unit or Array of Processors The Single Program Multiple Data (SPMD) Model Program Program Program Program Processor Processor Processor Processor The Data Parallel Model Data Structure Task Task Task Task CPEG421-2001-F-Topic-3-II

Data Parallel Model Limitations Difficult to write unstructured programs Compute Convenient only for problems with regular structured parallelism. Communication ? Compute Limited composability! Inherent limitation of coarse-grain multi-threading Communication CPEG421-2001-F-Topic-3-II

Dataflow Model of Computation a b c d e 1 + 3 * 4 + 3 CPEG421-2001-F-Topic-3-II

Dataflow Model of Computation a b c d e + 4 * 4 + 3 CPEG421-2001-F-Topic-3-II

Dataflow Model of Computation a b c d e + 4 * 7 + CPEG421-2001-F-Topic-3-II

Dataflow Model of Computation a b c d e + * 28 + CPEG421-2001-F-Topic-3-II

Dataflow Model of Computation a b c d e 1 + 3 * 28 4 + 3 Dataflow Software Pipelining CPEG421-2001-F-Topic-3-II

Outline • An introduction to multithreaded program execution models • Coarse-grain vs. fine-grain parallel execution models – A Historical Overview • Fine-grain multithreaded program execution models. • Memory and synchronization. models • Fine-grain multithreaded execution and virtual machine models for peta-scale machines: a case study on HTMT/EARTH CPEG421-2001-F-Topic-3-II

CPU CPU Memory Memory Thread Unit Executor Locus A Single Thread Thread Unit Executor Locus A Pool Thread Coarse-Grain thread- The family home model Fine-Grain non-preemptive thread- The “hotel” model Coarse-Grain vs. Fine-Grain Multithreading [Gao: invited talk at Fran Allen’s Retirement Workshop, 07/2002] CPEG421-2001-F-Topic-3-II

Evolution of Multithreaded Execution and Architecture Models CHoPP’77 CHoPP’87 Non-dataflow based MASA Halstead 1986 Alwife Agarwal 1989-96 Eldorado HEP B. Smith 1978 Tera B. Smith 1990- CDC 6600 1964 CASCADE J-Machine Dally 1988-93 M-Machine Dally 1994-98 Flynn’s Processor 1969 Cosmic Cube Seiltz 1985 Others: Multiscalar (1994), SMT (1995), etc. Monsoon Papadopoulos & Culler 1988 Dataflow model inspired P-RISC Nikhil & Arvind 1989 *T/Start-NG MIT/Motorola 1991- MIT TTDA Arvind 1980 Cilk Leiserson LAU Syre 1976 TAM Culler 1990 Iannuci’s 1988-92 EM-5/4/X RWC-1 1992-97 Manchester Gurd & Watson 1982 SIGMA-I Shimada 1988 Static Dataflow Dennis 1972 MIT EARTH CARE PACT95’, ISCA96, Theobald99 Arg-Fetching Dataflow DennisGao 1987-88 MTA HumTheobald Gao 94 MDFA Gao 1989-93 Marquez04 CPEG421-2001-F-Topic-3-II

The Von Neumann-type Processing begin for i = 1 … … endfor end Compiler Sequential Machine Representation Source Code Load CPU Processor CPEG421-2001-F-Topic-3-II

A Multithreaded Architecture To Other PE’s One PE CPEG421-2001-F-Topic-3-II

McGill Data FlowArchitecture Model(MDFA) CPEG421-2001-F-Topic-3-II

n1 n1 store store store fetch fetch fetch n2 n3 n2 n3 fetch Argument –flow Principle Argument –fetching Principle CPEG421-2001-F-Topic-3-II

A Dataflow Program Tuple Program Tuple = { P-Code . S-Code } S-Code P-Code N1: x = a + b; N2: y = c – d; N3: z = x * y; a c 2 n2 2 n1 2 n1 d b 3 3 3 IPU ISU CPEG421-2001-F-Topic-3-II

The McGill Dataflow Architecture Model Pipelined Instruction Processing Unit (PIPU) Fire Done Dataflow Instruction Scheduling Unit (DISU) Enable Memory & Controller Signal Processing CPEG421-2001-F-Topic-3-II

The McGill Dataflow Architecture Model Pipelined Instruction Processing Unit (PIPU) Important Features Fire Done Dataflow Instruction Scheduling Unit (DISU) Pipeline can be kept fully utilized provided that the program has sufficient parallelism = PC Enabled Instructions Waiting Instructions CPEG421-2001-F-Topic-3-II

The Scheduling Memory (Enable) Dataflow Instruction Scheduling Unit (DISU) Signal Processing C O N T R O L L E R 1 0 1 1 Fire Done 0 0 1 0 Count Signal(s) 0 1 0 0 0 0 1 1 1 1 0 1 Enabled Instructions Waiting Instructions 1 0 CPEG421-2001-F-Topic-3-II

Advantages of the McGill Dataflow Architecture Model • Eliminate unnecessary token copying and transmission overhead • Instruction scheduling is separated from the main datapath of the processor (e.g. asynchronous, decoupled) CPEG421-2001-F-Topic-3-II

Von Neumann Threads as Macro Dataflow Nodes A sequence of instructions is “packed” into a macro-dataflow node 1 2 3 Synchronization is done at the macro-node level k CPEG421-2001-F-Topic-3-II

Hybrid Evaluation Von Neumann Style Instruction Execution” on the McGill Dataflow Architecture • Group a “sequence” of dataflow instruction into a “thread” or a macro dataflow node. • Data-driven synchronization among threads. • “Von Neumann style sequencing” within a thread. Advantage: Preserves the parallelism among threads but avoids unnecessary fine-grain synchronization between instructions within a sequential thread. CPEG421-2001-F-Topic-3-II

What Do We Get? • A hybrid architecture model without sacrificing the advantage of fine-grain parallelism! (latency-hiding, pipelining support) CPEG421-2001-F-Topic-3-II

A Realization of the Hybrid Evaluation 1 2 k Shortcut Pipelined Instruction Processing Unit (PIPU) Von Neumann bit Fire Done Dataflow Instruction Scheduling Unit (DISU) CPEG421-2001-F-Topic-3-II

Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor