Scalable Multiprocessors

Scalable Multiprocessors Read Dubois/Annavaram/Stenström Chapter 5.5-5.6 (COMA architectures could be paper topic) Read Dubois/Annavaram/Stenström Chapter 6 • What is a scalable design? (7.1) • Realizing programming models (7.2) • Scalable communication architectures (SCAs) • Message-based SCAs (7.3-7.5) • Shared-memory based SCAs (7.6) PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011

Scalability Goals (P is number of processors) • Bandwidth: scale linearly with P • Latency: short and independent of P • Cost: low fixed cost and scale linearly with P Example: A bus-based multiprocessor • Bandwidth: constant • Latency: short and constant • Cost: high for infrastructure and then linear PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011

Dance-hall memory organization Distributed memory organization Organizational Issues • Network composed of switches for performance and cost • Many concurrent transactions allowed • Distributed memory can bring down bandwidth demands Bandwidth scaling: • no global arbitration and ordering • broadcast bandwidth fixed and expensive PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011

Scaling Issues Latency scaling: • T(n) = Overhead + Channel Time + Routing Delay • Channel Time is a function of bandwidth • Routing Delay is a function of number of hops in network Cost scaling: • Cost(p,m) = Fixed cost + Incremental Cost (p,m) • Design is cost-effective if speedup(p,m) > costup(p,m) PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011

Network Transaction Primitives Primitives to implement the programming model on a scalable machine • One-way transfer between source and destination • Resembles a bus transaction but much richer in variety Examples: • A message send transaction • A write transaction in a SAS machine PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011

Bus vs. Network Transactions Bus Transactions: V->P address translation Fixed Simple Global Direct One source Response Simple Global order Network Transactions: Done at multiple points Flexible Support flexible in format Distributed Via several switches Several sources Rich diversity Response transaction No global order Design Issues: Protection Format Output buffering Media arbitration Destination name & routing Input buffering Action Completion detection Transaction ordering PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011

Message Passing Multiple flavors of synchronization semantics • Blocking versus non-blocking • Blocking send/recv returns when operation completes • Non-blocking returns immediately (probe function tests completion) • Synchronous • Send completes after matching receive has executed • Receive completes after data transfer from matching send completes • Asynchronous (buffered, in MPI terminology) • Send completes as soon as send buffer may be reused PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011

Synchronous MP Protocol Alternative: Keep match table at the sender, enabling a two-phase receive-initiated protocol PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011

Asynchronous Optimistic MP Protocol Issues: • Copying overhead at receiver from temp buffer to user space • Huge buffer space at receiver to cope with worst case PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011

Asynchronous Robust MP Protocol Note: after handshake, send and recv buffer addresses are known, so data transfer can be performed with little overhead PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011

Active Messages • User-level analog of network transactions • transfer data packet and invoke handler to extract it from network and integrate with on-going computation Request handler Reply handler PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011

Challenges Common to SAS and MP • Input buffer overflow: how to signal buffer space is exhausted Solutions: • ACK at protocol level • back pressure flow control • special ACK path or drop packets (requires time-out) • Fetch deadlock (revisited): a request often generates a response that can form dependence cycles in the network Solutions: • two logically independent request/response networks • NACK requests at receiver to free space PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011

Spectrum of Designs • None, physical bit stream • blind, physical DMA nCUBE, iPSC, . . . • User/System • User-level port CM-5, *T • User-level handler J-Machine, Monsoon, . . . • Remote virtual address • Processing, translation Paragon, Meiko CS-2 • Global physical address • Proc + Memory controller RP3, BBN, T3D • Cache-to-cache • Cache controller Dash, KSR, Flash Increasing HW Support, Specialization, Intrusiveness, Performance (???) PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011

Scalable Network Message Input Processing – checks – translation – buffering – action Output Processing – checks – translation – formatting – scheduling ° ° ° CA Communication Assist CA Node Architecture M P M P MP Architectures Design tradeoff: how much processing in CA vs P, and how much interpretation of network transaction • Physical DMA (7.3) • User-level access (7.4) • Dedicated message processing (7.5) PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011

Example: nCUBE/2, IBM SP1 Physical DMA • Node processor packages messages in user/system mode • DMA used to copy between network and system buffers • Problem: no way to distinguish between user/system messages, which results in much overhead because node processor must be involved PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011

User-Level Access Example: CM-5 • Network interface mapped into user address space • Communication assist does protection checks, translation, etc. No intervention by kernel except for interrupts PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011

Network dest ° ° ° Mem Mem NI NI P P M P M P User System User System Dedicated Message Processing MP does • Interprets message • Supports message operations • Off-loads P with a clean message abstraction Issues: • P/MP communicate via shared memory: coherence traffic • MP can be a bottleneck due to all concurrent actions PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011

Scalable Network Pseudo memory Pseudo processor Pseudo memory Pseudo processor M P M P Shared Physical Address Space • Remote read/write performed by pseudo processors • Cache coherence issues treated in Ch. 8 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011

Scalable Multiprocessors

Scalable Multiprocessors

Presentation Transcript

Multiprocessors

Disco: Running Commodity Operating Systems on Scalable Multiprocessors

Disco: Running Commodity Operating Systems on Scalable Multiprocessors

Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors

General Purpose Node-to-Network Interface in Scalable Multiprocessors

Disco: Running Commodity Operating Systems on Scalable Multiprocessors

Scalable Distributed Memory Multiprocessors

Scalable Multiprocessors(II)

Chapter 7 (excl. 7.9): Scalable Multiprocessors

Disco: Running Commodity Operating Systems on Scalable Multiprocessors

Node-to-Network Interface in Scalable Multiprocessors

Disco: Running Commodity Operation Systems on Scalable Multiprocessors

Disco: Running Commodity Operating Systems on Scalable Multiprocessors

Multiprocessors

Chapter 8: Cache Coherents in Scalable Multiprocessors

Scalable Multiprocessors (III)

Scalable Reader-Writer Synchronization for Shared-Memory Multiprocessors

Disco: Running Commodity Operating Systems on Scalable Multiprocessors

Multiprocessors

Disco: Running Commodity Operating Systems on Scalable Multiprocessors

Node-to-Network Interface in Scalable Multiprocessors

Multiprocessors