220 likes | 338 Views
Scalable Multiprocessors. Read Dubois/ Annavaram / Stenström Chapter 5.5-5.6 (COMA architectures could be paper topic) Read Dubois/ Annavaram / Stenström Chapter 6. What is a scalable design? (7.1) Realizing programming models (7.2) Scalable communication architectures (SCAs)
E N D
Scalable Multiprocessors Read Dubois/Annavaram/Stenström Chapter 5.5-5.6 (COMA architectures could be paper topic) Read Dubois/Annavaram/Stenström Chapter 6 • What is a scalable design? (7.1) • Realizing programming models (7.2) • Scalable communication architectures (SCAs) • Message-based SCAs (7.3-7.5) • Shared-memory based SCAs (7.6) PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011
Scalability Goals (P is number of processors) • Bandwidth: scale linearly with P • Latency: short and independent of P • Cost: low fixed cost and scale linearly with P Example: A bus-based multiprocessor • Bandwidth: constant • Latency: short and constant • Cost: high for infrastructure and then linear PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011
Dance-hall memory organization Distributed memory organization Organizational Issues • Network composed of switches for performance and cost • Many concurrent transactions allowed • Distributed memory can bring down bandwidth demands Bandwidth scaling: • no global arbitration and ordering • broadcast bandwidth fixed and expensive PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011
Scaling Issues Latency scaling: • T(n) = Overhead + Channel Time + Routing Delay • Channel Time is a function of bandwidth • Routing Delay is a function of number of hops in network Cost scaling: • Cost(p,m) = Fixed cost + Incremental Cost (p,m) • Design is cost-effective if speedup(p,m) > costup(p,m) PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011
Physical Scaling • Chip, board, system-level partitioning has a big impact on scaling • However, little consensus PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011
Network Transaction Primitives Primitives to implement the programming model on a scalable machine • One-way transfer between source and destination • Resembles a bus transaction but much richer in variety Examples: • A message send transaction • A write transaction in a SAS machine PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011
Bus vs. Network Transactions Bus Transactions: V->P address translation Fixed Simple Global Direct One source Response Simple Global order Network Transactions: Done at multiple points Flexible Support flexible in format Distributed Via several switches Several sources Rich diversity Response transaction No global order Design Issues: Protection Format Output buffering Media arbitration Destination name & routing Input buffering Action Completion detection Transaction ordering PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011
SAS Transactions Issues: • Fixed or variable size transfers • Deadlock avoidance and input buffer full PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011
Sequential Consistency Issues: • Writes need acks to signal completion • SC may cause extreme waiting times PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011
Message Passing Multiple flavors of synchronization semantics • Blocking versus non-blocking • Blocking send/recv returns when operation completes • Non-blocking returns immediately (probe function tests completion) • Synchronous • Send completes after matching receive has executed • Receive completes after data transfer from matching send completes • Asynchronous (buffered, in MPI terminology) • Send completes as soon as send buffer may be reused PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011
Synchronous MP Protocol Alternative: Keep match table at the sender, enabling a two-phase receive-initiated protocol PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011
Asynchronous Optimistic MP Protocol Issues: • Copying overhead at receiver from temp buffer to user space • Huge buffer space at receiver to cope with worst case PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011
Asynchronous Robust MP Protocol Note: after handshake, send and recv buffer addresses are known, so data transfer can be performed with little overhead PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011
Active Messages • User-level analog of network transactions • transfer data packet and invoke handler to extract it from network and integrate with on-going computation Request handler Reply handler PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011
Challenges Common to SAS and MP • Input buffer overflow: how to signal buffer space is exhausted Solutions: • ACK at protocol level • back pressure flow control • special ACK path or drop packets (requires time-out) • Fetch deadlock (revisited): a request often generates a response that can form dependence cycles in the network Solutions: • two logically independent request/response networks • NACK requests at receiver to free space PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011
Spectrum of Designs • None, physical bit stream • blind, physical DMA nCUBE, iPSC, . . . • User/System • User-level port CM-5, *T • User-level handler J-Machine, Monsoon, . . . • Remote virtual address • Processing, translation Paragon, Meiko CS-2 • Global physical address • Proc + Memory controller RP3, BBN, T3D • Cache-to-cache • Cache controller Dash, KSR, Flash Increasing HW Support, Specialization, Intrusiveness, Performance (???) PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011
Scalable Network Message Input Processing – checks – translation – buffering – action Output Processing – checks – translation – formatting – scheduling ° ° ° CA Communication Assist CA Node Architecture M P M P MP Architectures Design tradeoff: how much processing in CA vs P, and how much interpretation of network transaction • Physical DMA (7.3) • User-level access (7.4) • Dedicated message processing (7.5) PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011
Example: nCUBE/2, IBM SP1 Physical DMA • Node processor packages messages in user/system mode • DMA used to copy between network and system buffers • Problem: no way to distinguish between user/system messages, which results in much overhead because node processor must be involved PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011
User-Level Access Example: CM-5 • Network interface mapped into user address space • Communication assist does protection checks, translation, etc. No intervention by kernel except for interrupts PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011
Network dest ° ° ° Mem Mem NI NI P P M P M P User System User System Dedicated Message Processing MP does • Interprets message • Supports message operations • Off-loads P with a clean message abstraction Issues: • P/MP communicate via shared memory: coherence traffic • MP can be a bottleneck due to all concurrent actions PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011
Scalable Network Pseudo memory Pseudo processor Pseudo memory Pseudo processor M P M P Shared Physical Address Space • Remote read/write performed by pseudo processors • Cache coherence issues treated in Ch. 8 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011