A Design Space Evaluation of Grid Processor Architectures

A Design Space Evaluation of Grid Processor Architectures Ramadass Nagarajan, Karthikeyan Sankaralingam, Doug Burger, StephenW. Keckler The University of Texas at Austin Presented by : A.Emre ARPACI CMPE / BOUN

Outline Introduction Grid Processor Architectures (GPAs) Organization of a GPA Architecture of a Computation Node Execution Model on GPAs A GPA Baseline, GPA-1 Advantages & Disadvantages of GPAs Experimental Results & Comparisons Conclusions

Introduction • Processor performance has improved at a rate of 50% per year over the past two decades • Previously, performance improvements were due to wider data paths and h/w support for memory mgmt., and tighter integration • Nowadays, bulk of performance growth is due mainly to faster clock rates – this comes from • Technology scaling • Deeper pipelines • This growth will soon end, as deeper pipelines reach limits on the number of gates per pipeline stage • Moreover, increasing wire resistance make achieving higher ILP in conventional architectures more difficult • Hence, further performance improvements must come from higher levels of instruction and thread-level parallelism

Introduction cont. • Two approaches for extracting ILP;through conventionalsuperscalar and VLIW machines • Superscalar machines detect parallelism at run time, however, they have limited instruction issue window • VLIW machines detect parallelism at compile time,however, they perform instruction scheduling statically • GPAs are new class of architectures designed toaddressthese technology challenges

Grid Processor Architectures (GPAs) • A GPA is a hybrid between conventional VLIW and superscalar architectures • GPAs are designed to enable faster clock rates and higher ILP than conventional archs, even as devices shrink and wire delays increase • GPAs schedules instructions statically onto computationnodes and then executes them dynamically in dataflow order

GPAs Overview • The GPA core is a 2-d array of nodes, each containing a small instruction buffer and one execution unit • These nodes are connected using a dedicated communication n/w for passing operands and data • They are controlled by a single control thread that maps large blocks of instructions to the nodes • A compiler is used to detect parallelism and statically schedule instructions onto the grid, such that the topology of the dataflow graph matches the mapping • Instructions are issued dynamically – order determined by the availability of input operands

Banked Register Files Instruction Caches Data Caches Organization of GPA • Consists of a 2-D array of fine-grained computation nodesconnected by a dedicated communication network

Organization of GPA cont. • Instructions are delivered byinstruction cache banks on the left side of the 2-D array • The blocksequencer and block termination control determines which instructiongroups to map to the grid and when each group hasbeen completed and can be committed • Instruction group inputsare fetched from the register file banks and injected fromthe top of the grid • Operands are passed from producer to consumerinstructions through a lightweight network, shown as amesh augmented with diagonal channels • Memory accessesare routed to the primary cache banks located on the right sideof the grid through a separate network

Architecture of a Computation Node • Performs the function of execution, temporary storage, • and data forwarding

Architecture of a Computation Node cont. • ALU performs the actual execution • Buffers store instructions and their input operands • Instruction wake-up unit matches instructions and issues them • Router forwards values to destinations

The Block-Atomic Execution Model • Treats groups of instructions as an atomic unit for fetching, mapping onto the execution resources, and committing • The execution substrate is a collection of ALUs • Instruction Groups • Instructions are placed in groups by the compiler • A group has no internal transfers of control • Taken branches and the last instruction in a group, transfer control to the next group • Data used and consumed by a group are of three types: • Group outputs – written to register file when the group commits • Group temporaries – forwarded directly from producer to consumer • Group inputs – values produced by preceding groups • Group Execution • Compiler statically assigns each instr to one of the ALUs

The Block-Atomic Execution Model cont. • Execution of a group proceeds as follows: • Fetched and mapped into the ALUs • “move” instructions read inputs and forward the values to ALUs • When all operands arrive at an ALU, the instr is executed (data-flow) • Destinations (ALU names) are explicitly encoded into instructions, so group temporaries can be directly sent point-to-point • When all instrs in a group have completed, the group is committed – groups outputs are written to register file, and required memory updates are done

The mapping and executionof a group • First, place critical path to minimize communication delays • Then place less critical paths to maximize the ILP

A GPA Baseline, GPA-1 • Each group of mapped instrs consists of one predicated hyperblock – single point of entry and multiple exits but no internal transfer of control • After a hyperblock is mapped, branch and target predictors predict the succeeding hyperblock • Supports three consumers per instruction. If there are more than 3 consumers, a “split” instr can be inserted • Four kinds of delays inhibit back-to-back execution of instructions in consecutive cycles: • Routing delays • Transmission/wire delays • Instr wakeup delay • Delays induced by contention for wires/ports at nodes • They found that two I/O ports at each node is enough • Router and wire delays are the most important factor

GPA-1 cont. • Hyperblock control • Predication • Uses an execute-all approach – both predicate paths execute, but only one path delivers the result • Only the leaf instrs in the DFG are predicated • “cmove” instrs are used to implement predication • Early exits • A branch from the middle of a hyperblock is called early exits • Branch instrs should be executed in serial order • Every branch instr is predicated on the complement of the condition for the immediately preceding branch • Block commit • Logic for detecting when all values have been produced • A block cannot be committed until all instructions have completed • Block stitching • Concurrent execution of multiple hyperblocks • Memory access is maintained by traditional load/store queues

Advantages of GPAs • Eliminate frequently accessed centralized structures such as register files and instruction issue window; enhances scalability • Convert the conventional broadcast network into a point-to-point network; reduces growing global wire and delay overheads • The physical layout of the ALUs is exposed to the instruction scheduler – so wire and communication delays can be used to help the scheduler minimize the critical path • Large instruction groups are mapped onto nodes as single unit of computation; reduces many per-instruction decode overheads

Disadvantages of GPAs • Force data caches to be far away from many of the nodes; incur delays between dependent operations due to network router and wires • The complexity of block stitching is significant and may interfere with the goal of fast clock rates

Experimental Results & Comparisons • An evaluation of GPA-1 performance across a set of nine applications • Scheduler assigns instructions to nodes using a greedy critical path scheduling strategy • Uses large groups consisting 16-172 useful instructions • Performance comparison to a conventional 8-way issue superscalar and VLIW architectures

Experimental Results & Comparisons cont. • GPA-1 performs best with dct benchmark, showing 10.2 IPC with perfect and 8.5 IPC with realistic

Experimental Results & Comparisons • An evaluation of GPA-1 performance across a set of ten applications • Uses large groups consisting 14-119 useful instructions • Performance comparison to the 6-way Alpha 21264 conventional out-of-order superscalar architecture

Experimental Results & Comparisons cont. • GPA-1 achieves 1.1-1.4x higher IPC compared to the Alpha 21264

Comments on Experimental Results • Block stitching provides roughly a factor of 2 speedup – ability to map multiple blocks speculatively is critical • Routing delay: largest determinant of GPA performance • Wire delay affects performance more than the router delay • Three point-to-point paths per node is enough • Performance improvement tapers off beyond 8x8 grid • GPA exploits between 10% and 40% of the available ILP in each benchmark

Conclusions • GPAs, which are designed to enable continued scaling of both clock rates and instruction throughput, are introduced • Experimental results across a set of benchmarks show that GPAs have competitive performance, even higher, in most benchmarks compared to conventionalarchitectures • There are many challenges that GPAs exposed.Researchers are actively working to advance the architecture of the first GPA, GPA-1, to alleviate these challenges

Thanx ... Questions ?

A Design Space Evaluation of Grid Processor Architectures

A Design Space Evaluation of Grid Processor Architectures

Presentation Transcript

Processor architectures

Processor architectures

Processor Design

Processor Design

A Review of Processor Design Flow

A Review of Processor Design Flow

Exploring Design Space of VLIW Architectures

Evaluation of Information Service Architectures for Grid

Design and Evaluation of Architectures for Commercial Applications

Design and Evaluation of Architectures for Commercial Applications

Design a MIPS Processor

Processor Design

Design and Evaluation of Architectures for Commercial Applications

Performance Evaluation of Architectures

Processor Design

Grid Service Architectures

Processor design

Design and Performance Evaluation of Networked Storage Architectures

PROCESSOR ARCHITECTURES FOR MULTIMEDIA APPLICATIONS

A Design Space Exploration of Grid Processor Architectures

Evaluation of a 12 bits Video Processor for Space Application

Exploring Design Space for 3D Clustered Architectures