Reducing Issue Logic Complexity in Superscalar Microprocessors

Reducing Issue Logic Complexity in Superscalar Microprocessors Survey Project CprE 585 – Advanced Computer Architecture David Lastine Ganesh Subramanian

Introduction • The ultimate goal of any computer architect – designing a fast machine • Approaches • Increasing clocking rate (Help from VLSI) • Increasing bus width • Increasing pipeline depth • Superscalar architectures • Tradeoffs between hardware complexity and clock speed • Given a particular technology, the more complex the hardware, the lesser is the clocking rate

A New Paradigm • Retaining the effective functionality of complex superscalar processors • Target the bottleneck in present day microprocessors • Instruction scheduling is the throughput limiter • Need to effectively handle register renaming, issue window and wakeup selector • Increase the clocking rate • Rethinking circuit design methodologies • Modifying architectural design strategies • Wanting to have the cake and eat it too? • Aim at reducing power consumption too

Approaches to Handle Issue Logic Complexity • Performance = IPC * Clock Frequency • Pipelining scheduling logic reduces the IPC • Non-pipelined scheduling logic reduces clocking rate • Architectural solutions • Non-pipelined scheduling with dependence queue based issue logic – Complexity Effective [1] • Pipelined scheduling with speculative wakeup [2] • Generic speed up and power conservation using tag elimination [3]

Baseline Superscalar Model • The rename and the wake-up select stages of the generic superscalar pipeline model need to be targeted • Consider VLSI effects and decide to redesign a particular design component

Analyzing Baseline Implementations • Physical layout implementation of microprocessor circuits optimized for speed • Usage of dynamic logic for bottleneck circuits • Manual sizing of transistors in critical path • Logic optimizations like two level decomposition • Components analyzed • Register rename logic • Wakeup Logic / Issue window • Selection logic • Bypass logic

Register Rename Logic • RAM vs. CAM • Focus on RAM due to scalability • Decreasing feature sizes do not correspondingly scale down wire delays, but only logic delays • Delay relation with issue width is quadratic, but effectively linear • Need to handle wordline and bitline delays in future

Wakeup Logic • CAM is preferred • Tag drive times are quadratic functions of window size as well as issue width • Matching times are quadratic functions of issue width only • All delays are effectively linear for considered design space • Need to handle broadcast operation delays in future

Selection Logic • Tree of arbiters • Requests flow down while functional unit grants flow up to the issue window • Necessity of a selection policy (Oldest First / Leftmost First) • Delays proportional to the logarithm of the window size • All delays considered are logic delays

Bypass Logic • Number of bypass paths dependent upon pipeline depth (linear) and issue width (quadratic) • Composed of operand muxes and buffer drivers • Delays are quadratically proportional to length of result wires and hence issue width • Insignificant compared to other delays as feature size reduces

Complexity Effective Microarchitecture Design Premises • Retain benefits of complex issue schemes but enable faster clocking • Design assumption: Should not pipeline wakeup + select, or data bypassing, as these are atomic operations (if dependent instruction should be executable in consecutive cycles)

Dependence Based Microarchitecture • Replace Issue Window by FIFOs with each queue composed of dependent instructions • Steer instructions to the appropriate FIFO in rename stage using heuristics • ‘SRC_FIFO’ and ‘Reservations Tables’ to handle dependencies and wakeup • IPC reduces but clocking rate increases to give a faster implementation

Clustering Dependence Based Microarchitectures • Reducing bypass delays by reducing length of bypass paths • Minimization of inter-cluster communication, extra cycle penalty otherwise • Clustered Microarchitecture Types • Single Window, Execution Driven Steering • Two Windows, Dispatch Driven Steering - Best • Two Windows, Random Steering

Pipelining Dynamic Instruction Scheduling Logic • Wakeup+Select was held atomic in previous implementation • Increase performance by pipelining it, but retain execution of dependent instruction in consecutive cycles • Speculate on the wakeup by predicting based on both parent and grandparent instructions • Integrated into the Tomasulo approach

Wakeup Logic Details • Tag broadcast as soon as instruction begins execution • Broadcast – Execution Completion latency specified as shown • Match bit acts as the sticky bit to enable delay countdown • Need not always be correct due to unexpected stalls • Select logic remains as in previous work

Pipelining Rename Logic • Assumption by child instruction that parent would broadcast its tag in the next cycle, IF grandparent instructions broadcasts tag • Speculative wakeup on grandparent tag receiving for selection in the next cycle • Speculative since parent selection for execution is not guaranteed • Modifications in rename map and dependency analysis logic

Wakeup and Select Logic • Wakeup request sent after looking into ready bits from the parents’ and grandparents’ tags • A multi-cycle parent’s field can be ignored • In addition to speculative readiness signified by request line, a confirm line is activated when all parents are ready • False selection involve non-confirmed requests • Problematic only when really ready instructions are not selected

Implementation & Experimentation Details • Usage of a cycle accurate execution driven simulator for the Alpha ISA • Baseline conventional scheduled (2) pipeline • Budget / Deluxe – speculatively woken up scheduling • Ideal – 1 cycle scheduling pipeline • Factors like issue width and reservation station depth considered • Significant reduction in critical path with minor IPC impacts • Enables higher clock frequencies, deeper pipelines and larger instruction windows for better performance

Paradigm shift • So far we’ve added hardware to improve performance • However issue window could also be improved by removing hardware

Current Situation of Issue Windows • Content Addressable Memory (CAM) latency dominates instruction window latency. • Load Capacitance of CAM is a major limiting factor for speed. • Parasitic Capacitance also waste power. • Issue logic uses a lot of the power budget • 16% for the Pentium Pro • 18% for Alpha 21264

Unnecessary Circuity • Observation: Register stations compare broadcast tags to both operands. Often, this is unnecessary. • Only 25% to 35% of architectural instructions have two operands. • Simulation of speck2k programs shows only 10% to 20% of instructions need two comparators during runtime.

Simulation • Used SimpleScalar • Varied instruction window size 16, 64, 256. • Load/Store queue of half window size.

Removing extra comparators • Specialize the reservation stations. • Number of comparators varies by station from 2 to 0. • Stall if no station with minimum comparator available • Remove some operands by speculating on last operand to complete. • Needs predictor • Miss-predict penalty

Predictor • Paper discuses GSHARE predictor • Its based off branch predictor not seen in class. • Idea behind it starts by noting good indexes for selecting binary predictors are • Branch address • Global history • Thus if both are good, XORing them together should produce an index embodying more information than ether alone.

Predictor II • Here is how GSHARE does for various sizes of the prediction table.

Mis-pridiction • Alpha has scoreboard of valid registers called RDY. • Check if all operands available in register read stage, if not flush pipeline in the same fashion as latency miss-prediction. • RDY must be expanded to have the number of read ports match the issue width.

IPC losses • Reservation stations with two ports can be exhausted. Causes stalls for speck2k benchmarks like SWIM • Adding last tag prediction improves SWIM performance but causes 1-3% losses for benchmarks such as Crafly and Gcc due to misprediction

Simulation • Format show is for number of two tag/one tag/ zero tag • Last tag predictor used only on entries with no two tag reservation stations.

Benefits of comparator removal • In most cases clock rate can be 25-45% faster since • Tag bus no longer must reach all reservation stations • Removing comparators removes load capacitance • Energy saved from capacitance removal is 30-60% • Power savings don’t track energy saves this clock rate can now increase.

Simulation results for benefits

References • Complexity-effective superscalar processors • Subbarao Palacharla and Norman P. Jouppi and J. E. Smith • On pipelining dynamic instruction scheduling logic • J. Stark, M. D. Brown, and Yale N. Patt • Efficient Dynamic Scheduling Through Tag Elimination • Dan Ernst and Todd Austin • Combining Branch Predictors • Scott McFarling

Questions?

Reducing Issue Logic Complexity in Superscalar Microprocessors

Reducing Issue Logic Complexity in Superscalar Microprocessors

Presentation Transcript

CSE 171 Introduction to Digital Logic and Microprocessors

Superscalar Microprocessors

Superscalar Processors

SUPERSCALAR ARCHITECTURE

AccuPower: An Accurate Power Estimation Tool for Superscalar Microprocessors*

Superscalar Implementation

CSE 171 Introduction to Digital Logic and Microprocessors

Superscalar Processor Design Superscalar Architecture

Microprocessors

Superscalar Processor

Decoupling Local Variable Accesses in a Wide-Issue Superscalar Processor

Reducing Complexity in Algebraic Multigrid

Superscalar Processors

Instruction Issue Logic

Multiple Issue Processors: Superscalar and VLIW

Superscalar - summary

Reducing the Complexity of the Register File in Dynamic Superscalar Processors

Reducing Power Consumption of the Issue Logic

Approaching complexity through logic

Banked Multiported Register Files for High-Frequency Superscalar Microprocessors

Superscalar Processors

Superscalar Processors