1 / 29

M-Machine and Grids Parallel Computer Architectures

M-Machine and Grids Parallel Computer Architectures. Navendu Jain. Readings. The M-machine multicomputer Marco et al., MICRO 1995 Exploiting fine-grain thread level parallelism on the MIT multi-ALU processor Keckler et al., MICRO 1998

aran
Download Presentation

M-Machine and Grids Parallel Computer Architectures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. M-Machine and GridsParallel Computer Architectures Navendu Jain

  2. Readings • The M-machine multicomputer Marco et al., MICRO 1995 • Exploiting fine-grain thread level parallelism on the MIT multi-ALU processorKeckler et al., MICRO 1998 • A design space evaluation of grid processor architecturesNagarajan et al., MICRO 2001

  3. Outline • The M-Machine Multicomputer • Thread Level Parallelism on M-Machine • Grid Processor Architectures • Review and Discussion

  4. The M-Machine Multicomputer

  5. Design Motivation • Achieve higher throughput of memory resources • Increase chip area devoted to processors • Arithmetic to bandwidth ratio of 12 operations/word • Minimize global communication (local sync.) • Faster execution of fixed size problems • Easier programmability of parallel computers • Incremental approach

  6. Architecture • A bi-directional 3-D network mesh of multi-threaded processing nodes • A chip comprises of a multi-ALU processor (MAP) and 128KB on-chip sync. DRAM • A user-accessible message passing system (SEND) • Single global virtual address space • Target CLK 100 MHz (control logic 40MHz)

  7. Multi-ALU processor (MAP) • A MAP chip comprises : • Three 64-bit 3-issue clusters • 2-way interleaved on-chip cache • A Memory Switch • A Cluster switch • External memory interface • On-chip network interfaces and routers

  8. A MAP Cluster • 64-bit three issue pipelined processor • 2 Integer ALUs • 1 Floating point ALU • Register Files • 4KB Instruction cache • A MAP instruction has 1, 2 or 3 operations

  9. Map Chip Die (18 mm side, 5M transistors)

  10. Exploiting Parallelism on M-Machine

  11. Threads • Exploit ILP both with-in and across the clusters • Horizontal Threads (H-Threads) • Instruction level parallelism • Executes on a single MAP cluster • 3-wide instruction stream • Communication/synchronization through messages/registers/memory • Max. 6 H-Threads can be interleaved dynamically on a cycle-by-cycle basis

  12. Threads (contd.) • Vertical Threads (V-Threads) • Thread level parallelism (a standard process) • contains up-to 4 H-Threads (one per cluster) • Flexibility of scheduling (compiler/run-time) • Communication/synchronization through registers • At-most 6 resident V-Threads • 4 user slots, 1 event slot, 1 exception

  13. Concurrency Model Three Levels of Parallelism • Instruction Level Parallelism ( 1 instruction) • VLIW, Superscalar processors • Issues: Control Flow, Data dependency, Scalability • Thread Level Parallelism (~ 1000 instructions) • Chip Multiprocessors • Issues: Limited coarse TLP, Inner cores non-optimal • Fine grain Parallelism (~ 50 – 1000 instructions)

  14. Mapping Program Architecture Granularity

  15. Fine-grain overheads • Thread creation (11 cycles – hfork) • Communication • Register-Register read/writes • Message passing/on-chip cache • Synchronization • Blocking on a register (full/empty bit) • Barrier Instruction (cbar instruction) • Memory (sync bit)

  16. Grid Processor Architecture

  17. Design Motivation • Continued scaling of the clock rate • Scalability of the processing core • Higher ILP - Instruction throughput (IPC) • Mitigate global wire and delay overheads • Closer coupling of Architecture and compiler

  18. Architecture • An inter-connected 2-D network of ALU arrays • Each node has a IB and a execution unit • A single control thread maps instructions to nodes • Block-Atomic Execution Model • Mapping blocks of statically scheduled instructions • Dynamic execution in data-flow order • Forwarding temp. values to the consumer ALUs • Critical path scheduled along shortest physical path

  19. GPA Architecture

  20. Example: Block-Atomic Mapping

  21. Implementation • Instruction fetch and map • predicated hyper-block, move instructions • Execution - control logic • Operand routing – max 3 dest., split instructions • Hyper-block control • Predication (execute-all approach), cmove instructions • Block-commit • Block-stitching

  22. Review and Discussion

  23. Key Ideas: Convergence • Microprocessor – no. of superscalar processors comm./sync. via registers – low overheads • Exploiting ILP – TLP granularities • Dependency mapped to a grid of ALUs • Replication reduces design/verification effort • Point-to-point communication • Exposing architecture partitioning and flow of operations to the compiler • Avoid wire, routing delays, memory wall problems

  24. Ideas: Divergence M-Machine • On-chip cache Register based mech. [Delays] • Broadcasting and Point-to-point communication GPA • Register Set Grid: Chaining [Scalability] • Point-to-point communication TERA • Fine-grain threads – Memory comm/sync (full/empty) • No support for single threaded code

  25. M-Machine Scalability Clock speeds Memory Synchronization (use hfork) Grid Processor Arch. Data Caches far from ALUs Incur delays between dependent operations due to network router and wires Complex Frame-management and Block-stitching Explicit compiler dependence Drawbacks (Unresolved Issues)

  26. Challenges/Future Directions • Architectural support to extract TLP • Parallelizing compiler technology • How many cores/threads • No. of threads – memory latency, wire delays [Flynn] • Inter-thread communication • Height of Grid == 8 (IPC 5-6) [GPA, Peter] • Optimization - f(comm., delays, memory costs)

  27. Challenges (contd.) • On-fly data-dependence detection (RAW/WAR) • TLP/ILP Balance – M Multi-Computer

  28. Thanks

More Related