1 / 21

Scalable Multiprocessors

Scalable Multiprocessors. Read Dubois/ Annavaram / Stenström Chapter 5.5-5.6 (COMA architectures could be paper topic) Read Dubois/ Annavaram / Stenström Chapter 6. What is a scalable design? (7.1) Realizing programming models (7.2) Scalable communication architectures (SCAs)

presley
Download Presentation

Scalable Multiprocessors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Scalable Multiprocessors Read Dubois/Annavaram/Stenström Chapter 5.5-5.6 (COMA architectures could be paper topic) Read Dubois/Annavaram/Stenström Chapter 6 • What is a scalable design? (7.1) • Realizing programming models (7.2) • Scalable communication architectures (SCAs) • Message-based SCAs (7.3-7.5) • Shared-memory based SCAs (7.6) PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011

  2. Scalability Goals (P is number of processors) • Bandwidth: scale linearly with P • Latency: short and independent of P • Cost: low fixed cost and scale linearly with P Example: A bus-based multiprocessor • Bandwidth: constant • Latency: short and constant • Cost: high for infrastructure and then linear PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011

  3. Dance-hall memory organization Distributed memory organization Organizational Issues • Network composed of switches for performance and cost • Many concurrent transactions allowed • Distributed memory can bring down bandwidth demands Bandwidth scaling: • no global arbitration and ordering • broadcast bandwidth fixed and expensive PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011

  4. Scaling Issues Latency scaling: • T(n) = Overhead + Channel Time + Routing Delay • Channel Time is a function of bandwidth • Routing Delay is a function of number of hops in network Cost scaling: • Cost(p,m) = Fixed cost + Incremental Cost (p,m) • Design is cost-effective if speedup(p,m) > costup(p,m) PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011

  5. Physical Scaling • Chip, board, system-level partitioning has a big impact on scaling • However, little consensus PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011

  6. Network Transaction Primitives Primitives to implement the programming model on a scalable machine • One-way transfer between source and destination • Resembles a bus transaction but much richer in variety Examples: • A message send transaction • A write transaction in a SAS machine PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011

  7. Bus vs. Network Transactions Bus Transactions: V->P address translation Fixed Simple Global Direct One source Response Simple Global order Network Transactions: Done at multiple points Flexible Support flexible in format Distributed Via several switches Several sources Rich diversity Response transaction No global order Design Issues: Protection Format Output buffering Media arbitration Destination name & routing Input buffering Action Completion detection Transaction ordering PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011

  8. SAS Transactions Issues: • Fixed or variable size transfers • Deadlock avoidance and input buffer full PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011

  9. Sequential Consistency Issues: • Writes need acks to signal completion • SC may cause extreme waiting times PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011

  10. Message Passing Multiple flavors of synchronization semantics • Blocking versus non-blocking • Blocking send/recv returns when operation completes • Non-blocking returns immediately (probe function tests completion) • Synchronous • Send completes after matching receive has executed • Receive completes after data transfer from matching send completes • Asynchronous (buffered, in MPI terminology) • Send completes as soon as send buffer may be reused PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011

  11. Synchronous MP Protocol Alternative: Keep match table at the sender, enabling a two-phase receive-initiated protocol PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011

  12. Asynchronous Optimistic MP Protocol Issues: • Copying overhead at receiver from temp buffer to user space • Huge buffer space at receiver to cope with worst case PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011

  13. Asynchronous Robust MP Protocol Note: after handshake, send and recv buffer addresses are known, so data transfer can be performed with little overhead PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011

  14. Active Messages • User-level analog of network transactions • transfer data packet and invoke handler to extract it from network and integrate with on-going computation Request handler Reply handler PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011

  15. Challenges Common to SAS and MP • Input buffer overflow: how to signal buffer space is exhausted Solutions: • ACK at protocol level • back pressure flow control • special ACK path or drop packets (requires time-out) • Fetch deadlock (revisited): a request often generates a response that can form dependence cycles in the network Solutions: • two logically independent request/response networks • NACK requests at receiver to free space PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011

  16. Spectrum of Designs • None, physical bit stream • blind, physical DMA nCUBE, iPSC, . . . • User/System • User-level port CM-5, *T • User-level handler J-Machine, Monsoon, . . . • Remote virtual address • Processing, translation Paragon, Meiko CS-2 • Global physical address • Proc + Memory controller RP3, BBN, T3D • Cache-to-cache • Cache controller Dash, KSR, Flash Increasing HW Support, Specialization, Intrusiveness, Performance (???) PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011

  17. Scalable Network Message Input Processing – checks – translation – buffering – action Output Processing – checks – translation – formatting – scheduling ° ° ° CA Communication Assist CA Node Architecture M P M P MP Architectures Design tradeoff: how much processing in CA vs P, and how much interpretation of network transaction • Physical DMA (7.3) • User-level access (7.4) • Dedicated message processing (7.5) PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011

  18. Example: nCUBE/2, IBM SP1 Physical DMA • Node processor packages messages in user/system mode • DMA used to copy between network and system buffers • Problem: no way to distinguish between user/system messages, which results in much overhead because node processor must be involved PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011

  19. User-Level Access Example: CM-5 • Network interface mapped into user address space • Communication assist does protection checks, translation, etc. No intervention by kernel except for interrupts PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011

  20. Network dest ° ° °  Mem Mem NI NI P P M P M P User System User System Dedicated Message Processing MP does • Interprets message • Supports message operations • Off-loads P with a clean message abstraction Issues: • P/MP communicate via shared memory: coherence traffic • MP can be a bottleneck due to all concurrent actions PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011

  21. Scalable Network Pseudo memory Pseudo processor Pseudo memory Pseudo processor M P M P Shared Physical Address Space • Remote read/write performed by pseudo processors • Cache coherence issues treated in Ch. 8 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011

More Related