300 likes | 432 Views
M-Machine and Grids Parallel Computer Architectures. Navendu Jain. Readings. The M-machine multicomputer Marco et al., MICRO 1995 Exploiting fine-grain thread level parallelism on the MIT multi-ALU processor Keckler et al., MICRO 1998
E N D
M-Machine and GridsParallel Computer Architectures Navendu Jain
Readings • The M-machine multicomputer Marco et al., MICRO 1995 • Exploiting fine-grain thread level parallelism on the MIT multi-ALU processorKeckler et al., MICRO 1998 • A design space evaluation of grid processor architecturesNagarajan et al., MICRO 2001
Outline • The M-Machine Multicomputer • Thread Level Parallelism on M-Machine • Grid Processor Architectures • Review and Discussion
Design Motivation • Achieve higher throughput of memory resources • Increase chip area devoted to processors • Arithmetic to bandwidth ratio of 12 operations/word • Minimize global communication (local sync.) • Faster execution of fixed size problems • Easier programmability of parallel computers • Incremental approach
Architecture • A bi-directional 3-D network mesh of multi-threaded processing nodes • A chip comprises of a multi-ALU processor (MAP) and 128KB on-chip sync. DRAM • A user-accessible message passing system (SEND) • Single global virtual address space • Target CLK 100 MHz (control logic 40MHz)
Multi-ALU processor (MAP) • A MAP chip comprises : • Three 64-bit 3-issue clusters • 2-way interleaved on-chip cache • A Memory Switch • A Cluster switch • External memory interface • On-chip network interfaces and routers
A MAP Cluster • 64-bit three issue pipelined processor • 2 Integer ALUs • 1 Floating point ALU • Register Files • 4KB Instruction cache • A MAP instruction has 1, 2 or 3 operations
Threads • Exploit ILP both with-in and across the clusters • Horizontal Threads (H-Threads) • Instruction level parallelism • Executes on a single MAP cluster • 3-wide instruction stream • Communication/synchronization through messages/registers/memory • Max. 6 H-Threads can be interleaved dynamically on a cycle-by-cycle basis
Threads (contd.) • Vertical Threads (V-Threads) • Thread level parallelism (a standard process) • contains up-to 4 H-Threads (one per cluster) • Flexibility of scheduling (compiler/run-time) • Communication/synchronization through registers • At-most 6 resident V-Threads • 4 user slots, 1 event slot, 1 exception
Concurrency Model Three Levels of Parallelism • Instruction Level Parallelism ( 1 instruction) • VLIW, Superscalar processors • Issues: Control Flow, Data dependency, Scalability • Thread Level Parallelism (~ 1000 instructions) • Chip Multiprocessors • Issues: Limited coarse TLP, Inner cores non-optimal • Fine grain Parallelism (~ 50 – 1000 instructions)
Mapping Program Architecture Granularity
Fine-grain overheads • Thread creation (11 cycles – hfork) • Communication • Register-Register read/writes • Message passing/on-chip cache • Synchronization • Blocking on a register (full/empty bit) • Barrier Instruction (cbar instruction) • Memory (sync bit)
Design Motivation • Continued scaling of the clock rate • Scalability of the processing core • Higher ILP - Instruction throughput (IPC) • Mitigate global wire and delay overheads • Closer coupling of Architecture and compiler
Architecture • An inter-connected 2-D network of ALU arrays • Each node has a IB and a execution unit • A single control thread maps instructions to nodes • Block-Atomic Execution Model • Mapping blocks of statically scheduled instructions • Dynamic execution in data-flow order • Forwarding temp. values to the consumer ALUs • Critical path scheduled along shortest physical path
Implementation • Instruction fetch and map • predicated hyper-block, move instructions • Execution - control logic • Operand routing – max 3 dest., split instructions • Hyper-block control • Predication (execute-all approach), cmove instructions • Block-commit • Block-stitching
Key Ideas: Convergence • Microprocessor – no. of superscalar processors comm./sync. via registers – low overheads • Exploiting ILP – TLP granularities • Dependency mapped to a grid of ALUs • Replication reduces design/verification effort • Point-to-point communication • Exposing architecture partitioning and flow of operations to the compiler • Avoid wire, routing delays, memory wall problems
Ideas: Divergence M-Machine • On-chip cache Register based mech. [Delays] • Broadcasting and Point-to-point communication GPA • Register Set Grid: Chaining [Scalability] • Point-to-point communication TERA • Fine-grain threads – Memory comm/sync (full/empty) • No support for single threaded code
M-Machine Scalability Clock speeds Memory Synchronization (use hfork) Grid Processor Arch. Data Caches far from ALUs Incur delays between dependent operations due to network router and wires Complex Frame-management and Block-stitching Explicit compiler dependence Drawbacks (Unresolved Issues)
Challenges/Future Directions • Architectural support to extract TLP • Parallelizing compiler technology • How many cores/threads • No. of threads – memory latency, wire delays [Flynn] • Inter-thread communication • Height of Grid == 8 (IPC 5-6) [GPA, Peter] • Optimization - f(comm., delays, memory costs)
Challenges (contd.) • On-fly data-dependence detection (RAW/WAR) • TLP/ILP Balance – M Multi-Computer