M-Machine and Grids Parallel Computer Architectures

M-Machine and GridsParallel Computer Architectures Navendu Jain

Readings • The M-machine multicomputer Marco et al., MICRO 1995 • Exploiting fine-grain thread level parallelism on the MIT multi-ALU processorKeckler et al., MICRO 1998 • A design space evaluation of grid processor architecturesNagarajan et al., MICRO 2001

Outline • The M-Machine Multicomputer • Thread Level Parallelism on M-Machine • Grid Processor Architectures • Review and Discussion

The M-Machine Multicomputer

Design Motivation • Achieve higher throughput of memory resources • Increase chip area devoted to processors • Arithmetic to bandwidth ratio of 12 operations/word • Minimize global communication (local sync.) • Faster execution of fixed size problems • Easier programmability of parallel computers • Incremental approach

Architecture • A bi-directional 3-D network mesh of multi-threaded processing nodes • A chip comprises of a multi-ALU processor (MAP) and 128KB on-chip sync. DRAM • A user-accessible message passing system (SEND) • Single global virtual address space • Target CLK 100 MHz (control logic 40MHz)

Multi-ALU processor (MAP) • A MAP chip comprises : • Three 64-bit 3-issue clusters • 2-way interleaved on-chip cache • A Memory Switch • A Cluster switch • External memory interface • On-chip network interfaces and routers

A MAP Cluster • 64-bit three issue pipelined processor • 2 Integer ALUs • 1 Floating point ALU • Register Files • 4KB Instruction cache • A MAP instruction has 1, 2 or 3 operations

Map Chip Die (18 mm side, 5M transistors)

Exploiting Parallelism on M-Machine

Threads • Exploit ILP both with-in and across the clusters • Horizontal Threads (H-Threads) • Instruction level parallelism • Executes on a single MAP cluster • 3-wide instruction stream • Communication/synchronization through messages/registers/memory • Max. 6 H-Threads can be interleaved dynamically on a cycle-by-cycle basis

Threads (contd.) • Vertical Threads (V-Threads) • Thread level parallelism (a standard process) • contains up-to 4 H-Threads (one per cluster) • Flexibility of scheduling (compiler/run-time) • Communication/synchronization through registers • At-most 6 resident V-Threads • 4 user slots, 1 event slot, 1 exception

Concurrency Model Three Levels of Parallelism • Instruction Level Parallelism ( 1 instruction) • VLIW, Superscalar processors • Issues: Control Flow, Data dependency, Scalability • Thread Level Parallelism (~ 1000 instructions) • Chip Multiprocessors • Issues: Limited coarse TLP, Inner cores non-optimal • Fine grain Parallelism (~ 50 – 1000 instructions)

Mapping Program Architecture Granularity

Fine-grain overheads • Thread creation (11 cycles – hfork) • Communication • Register-Register read/writes • Message passing/on-chip cache • Synchronization • Blocking on a register (full/empty bit) • Barrier Instruction (cbar instruction) • Memory (sync bit)

Grid Processor Architecture

Design Motivation • Continued scaling of the clock rate • Scalability of the processing core • Higher ILP - Instruction throughput (IPC) • Mitigate global wire and delay overheads • Closer coupling of Architecture and compiler

Architecture • An inter-connected 2-D network of ALU arrays • Each node has a IB and a execution unit • A single control thread maps instructions to nodes • Block-Atomic Execution Model • Mapping blocks of statically scheduled instructions • Dynamic execution in data-flow order • Forwarding temp. values to the consumer ALUs • Critical path scheduled along shortest physical path

GPA Architecture

Example: Block-Atomic Mapping

Implementation • Instruction fetch and map • predicated hyper-block, move instructions • Execution - control logic • Operand routing – max 3 dest., split instructions • Hyper-block control • Predication (execute-all approach), cmove instructions • Block-commit • Block-stitching

Review and Discussion

Key Ideas: Convergence • Microprocessor – no. of superscalar processors comm./sync. via registers – low overheads • Exploiting ILP – TLP granularities • Dependency mapped to a grid of ALUs • Replication reduces design/verification effort • Point-to-point communication • Exposing architecture partitioning and flow of operations to the compiler • Avoid wire, routing delays, memory wall problems

Ideas: Divergence M-Machine • On-chip cache Register based mech. [Delays] • Broadcasting and Point-to-point communication GPA • Register Set Grid: Chaining [Scalability] • Point-to-point communication TERA • Fine-grain threads – Memory comm/sync (full/empty) • No support for single threaded code

M-Machine Scalability Clock speeds Memory Synchronization (use hfork) Grid Processor Arch. Data Caches far from ALUs Incur delays between dependent operations due to network router and wires Complex Frame-management and Block-stitching Explicit compiler dependence Drawbacks (Unresolved Issues)

Challenges/Future Directions • Architectural support to extract TLP • Parallelizing compiler technology • How many cores/threads • No. of threads – memory latency, wire delays [Flynn] • Inter-thread communication • Height of Grid == 8 (IPC 5-6) [GPA, Peter] • Optimization - f(comm., delays, memory costs)

Challenges (contd.) • On-fly data-dependence detection (RAW/WAR) • TLP/ILP Balance – M Multi-Computer

Thanks

M-Machine and Grids Parallel Computer Architectures

M-Machine and Grids Parallel Computer Architectures

Presentation Transcript

Parallel Computer Architectures

Parallel Computer Architectures Duncan A. Buell

Parallel Architectures in Biotechnology

Parallel Computer Architectures

Advanced Computer Architecture Data-Level Parallel Architectures

Parallel Architectures Based on Parallel Computing , M. J. Quinn

Parallel Computer Architectures 2 nd week

Parallel Architectures

Parallel and Multiprocessor Architectures

Parallel Computer Architectures Duncan A. Buell

Parallel Architectures

Paralleelarvutid Parallel Computer Architectures

Networks, Grids and Service Oriented Architectures

Web 2.0, Grids and Parallel Computing

2. Machine Architectures

Parallel Architectures: Topologies

Parallel Architectures

Parallel Architectures

Parallel Computer Architectures Duncan A. Buell

A Survey of Parallel Computer Architectures

Communication Models for Parallel Computer Architectures

Parallel Architectures History