250 likes | 431 Views
A Design Space Evaluation of Grid Processor Architectures. Ramadass Nagarajan , Karthikeyan Sankaralingam , Doug Burger , StephenW. Keckler The University of Texas at Austin. Presented by : A.Emre ARPACI CMPE / BOUN. Outline. Introduction Grid Processor Architectures (GPAs)
E N D
A Design Space Evaluation of Grid Processor Architectures Ramadass Nagarajan, Karthikeyan Sankaralingam, Doug Burger, StephenW. Keckler The University of Texas at Austin Presented by : A.Emre ARPACI CMPE / BOUN
Outline Introduction Grid Processor Architectures (GPAs) Organization of a GPA Architecture of a Computation Node Execution Model on GPAs A GPA Baseline, GPA-1 Advantages & Disadvantages of GPAs Experimental Results & Comparisons Conclusions
Introduction • Processor performance has improved at a rate of 50% per year over the past two decades • Previously, performance improvements were due to wider data paths and h/w support for memory mgmt., and tighter integration • Nowadays, bulk of performance growth is due mainly to faster clock rates – this comes from • Technology scaling • Deeper pipelines • This growth will soon end, as deeper pipelines reach limits on the number of gates per pipeline stage • Moreover, increasing wire resistance make achieving higher ILP in conventional architectures more difficult • Hence, further performance improvements must come from higher levels of instruction and thread-level parallelism
Introduction cont. • Two approaches for extracting ILP;through conventionalsuperscalar and VLIW machines • Superscalar machines detect parallelism at run time, however, they have limited instruction issue window • VLIW machines detect parallelism at compile time,however, they perform instruction scheduling statically • GPAs are new class of architectures designed toaddressthese technology challenges
Grid Processor Architectures (GPAs) • A GPA is a hybrid between conventional VLIW and superscalar architectures • GPAs are designed to enable faster clock rates and higher ILP than conventional archs, even as devices shrink and wire delays increase • GPAs schedules instructions statically onto computationnodes and then executes them dynamically in dataflow order
GPAs Overview • The GPA core is a 2-d array of nodes, each containing a small instruction buffer and one execution unit • These nodes are connected using a dedicated communication n/w for passing operands and data • They are controlled by a single control thread that maps large blocks of instructions to the nodes • A compiler is used to detect parallelism and statically schedule instructions onto the grid, such that the topology of the dataflow graph matches the mapping • Instructions are issued dynamically – order determined by the availability of input operands
Banked Register Files Instruction Caches Data Caches Organization of GPA • Consists of a 2-D array of fine-grained computation nodesconnected by a dedicated communication network
Organization of GPA cont. • Instructions are delivered byinstruction cache banks on the left side of the 2-D array • The blocksequencer and block termination control determines which instructiongroups to map to the grid and when each group hasbeen completed and can be committed • Instruction group inputsare fetched from the register file banks and injected fromthe top of the grid • Operands are passed from producer to consumerinstructions through a lightweight network, shown as amesh augmented with diagonal channels • Memory accessesare routed to the primary cache banks located on the right sideof the grid through a separate network
Architecture of a Computation Node • Performs the function of execution, temporary storage, • and data forwarding
Architecture of a Computation Node cont. • ALU performs the actual execution • Buffers store instructions and their input operands • Instruction wake-up unit matches instructions and issues them • Router forwards values to destinations
The Block-Atomic Execution Model • Treats groups of instructions as an atomic unit for fetching, mapping onto the execution resources, and committing • The execution substrate is a collection of ALUs • Instruction Groups • Instructions are placed in groups by the compiler • A group has no internal transfers of control • Taken branches and the last instruction in a group, transfer control to the next group • Data used and consumed by a group are of three types: • Group outputs – written to register file when the group commits • Group temporaries – forwarded directly from producer to consumer • Group inputs – values produced by preceding groups • Group Execution • Compiler statically assigns each instr to one of the ALUs
The Block-Atomic Execution Model cont. • Execution of a group proceeds as follows: • Fetched and mapped into the ALUs • “move” instructions read inputs and forward the values to ALUs • When all operands arrive at an ALU, the instr is executed (data-flow) • Destinations (ALU names) are explicitly encoded into instructions, so group temporaries can be directly sent point-to-point • When all instrs in a group have completed, the group is committed – groups outputs are written to register file, and required memory updates are done
The mapping and executionof a group • First, place critical path to minimize communication delays • Then place less critical paths to maximize the ILP
A GPA Baseline, GPA-1 • Each group of mapped instrs consists of one predicated hyperblock – single point of entry and multiple exits but no internal transfer of control • After a hyperblock is mapped, branch and target predictors predict the succeeding hyperblock • Supports three consumers per instruction. If there are more than 3 consumers, a “split” instr can be inserted • Four kinds of delays inhibit back-to-back execution of instructions in consecutive cycles: • Routing delays • Transmission/wire delays • Instr wakeup delay • Delays induced by contention for wires/ports at nodes • They found that two I/O ports at each node is enough • Router and wire delays are the most important factor
GPA-1 cont. • Hyperblock control • Predication • Uses an execute-all approach – both predicate paths execute, but only one path delivers the result • Only the leaf instrs in the DFG are predicated • “cmove” instrs are used to implement predication • Early exits • A branch from the middle of a hyperblock is called early exits • Branch instrs should be executed in serial order • Every branch instr is predicated on the complement of the condition for the immediately preceding branch • Block commit • Logic for detecting when all values have been produced • A block cannot be committed until all instructions have completed • Block stitching • Concurrent execution of multiple hyperblocks • Memory access is maintained by traditional load/store queues
Advantages of GPAs • Eliminate frequently accessed centralized structures such as register files and instruction issue window; enhances scalability • Convert the conventional broadcast network into a point-to-point network; reduces growing global wire and delay overheads • The physical layout of the ALUs is exposed to the instruction scheduler – so wire and communication delays can be used to help the scheduler minimize the critical path • Large instruction groups are mapped onto nodes as single unit of computation; reduces many per-instruction decode overheads
Disadvantages of GPAs • Force data caches to be far away from many of the nodes; incur delays between dependent operations due to network router and wires • The complexity of block stitching is significant and may interfere with the goal of fast clock rates
Experimental Results & Comparisons • An evaluation of GPA-1 performance across a set of nine applications • Scheduler assigns instructions to nodes using a greedy critical path scheduling strategy • Uses large groups consisting 16-172 useful instructions • Performance comparison to a conventional 8-way issue superscalar and VLIW architectures
Experimental Results & Comparisons cont. • GPA-1 performs best with dct benchmark, showing 10.2 IPC with perfect and 8.5 IPC with realistic
Experimental Results & Comparisons • An evaluation of GPA-1 performance across a set of ten applications • Uses large groups consisting 14-119 useful instructions • Performance comparison to the 6-way Alpha 21264 conventional out-of-order superscalar architecture
Experimental Results & Comparisons cont. • GPA-1 achieves 1.1-1.4x higher IPC compared to the Alpha 21264
Comments on Experimental Results • Block stitching provides roughly a factor of 2 speedup – ability to map multiple blocks speculatively is critical • Routing delay: largest determinant of GPA performance • Wire delay affects performance more than the router delay • Three point-to-point paths per node is enough • Performance improvement tapers off beyond 8x8 grid • GPA exploits between 10% and 40% of the available ILP in each benchmark
Conclusions • GPAs, which are designed to enable continued scaling of both clock rates and instruction throughput, are introduced • Experimental results across a set of benchmarks show that GPAs have competitive performance, even higher, in most benchmarks compared to conventionalarchitectures • There are many challenges that GPAs exposed.Researchers are actively working to advance the architecture of the first GPA, GPA-1, to alleviate these challenges
Thanx ... Questions ?