320 likes | 460 Views
Reducing Issue Logic Complexity in Superscalar Microprocessors. Survey Project CprE 585 – Advanced Computer Architecture David Lastine Ganesh Subramanian. Introduction. The ultimate goal of any computer architect – designing a fast machine Approaches
E N D
Reducing Issue Logic Complexity in Superscalar Microprocessors Survey Project CprE 585 – Advanced Computer Architecture David Lastine Ganesh Subramanian
Introduction • The ultimate goal of any computer architect – designing a fast machine • Approaches • Increasing clocking rate (Help from VLSI) • Increasing bus width • Increasing pipeline depth • Superscalar architectures • Tradeoffs between hardware complexity and clock speed • Given a particular technology, the more complex the hardware, the lesser is the clocking rate
A New Paradigm • Retaining the effective functionality of complex superscalar processors • Target the bottleneck in present day microprocessors • Instruction scheduling is the throughput limiter • Need to effectively handle register renaming, issue window and wakeup selector • Increase the clocking rate • Rethinking circuit design methodologies • Modifying architectural design strategies • Wanting to have the cake and eat it too? • Aim at reducing power consumption too
Approaches to Handle Issue Logic Complexity • Performance = IPC * Clock Frequency • Pipelining scheduling logic reduces the IPC • Non-pipelined scheduling logic reduces clocking rate • Architectural solutions • Non-pipelined scheduling with dependence queue based issue logic – Complexity Effective [1] • Pipelined scheduling with speculative wakeup [2] • Generic speed up and power conservation using tag elimination [3]
Baseline Superscalar Model • The rename and the wake-up select stages of the generic superscalar pipeline model need to be targeted • Consider VLSI effects and decide to redesign a particular design component
Analyzing Baseline Implementations • Physical layout implementation of microprocessor circuits optimized for speed • Usage of dynamic logic for bottleneck circuits • Manual sizing of transistors in critical path • Logic optimizations like two level decomposition • Components analyzed • Register rename logic • Wakeup Logic / Issue window • Selection logic • Bypass logic
Register Rename Logic • RAM vs. CAM • Focus on RAM due to scalability • Decreasing feature sizes do not correspondingly scale down wire delays, but only logic delays • Delay relation with issue width is quadratic, but effectively linear • Need to handle wordline and bitline delays in future
Wakeup Logic • CAM is preferred • Tag drive times are quadratic functions of window size as well as issue width • Matching times are quadratic functions of issue width only • All delays are effectively linear for considered design space • Need to handle broadcast operation delays in future
Selection Logic • Tree of arbiters • Requests flow down while functional unit grants flow up to the issue window • Necessity of a selection policy (Oldest First / Leftmost First) • Delays proportional to the logarithm of the window size • All delays considered are logic delays
Bypass Logic • Number of bypass paths dependent upon pipeline depth (linear) and issue width (quadratic) • Composed of operand muxes and buffer drivers • Delays are quadratically proportional to length of result wires and hence issue width • Insignificant compared to other delays as feature size reduces
Complexity Effective Microarchitecture Design Premises • Retain benefits of complex issue schemes but enable faster clocking • Design assumption: Should not pipeline wakeup + select, or data bypassing, as these are atomic operations (if dependent instruction should be executable in consecutive cycles)
Dependence Based Microarchitecture • Replace Issue Window by FIFOs with each queue composed of dependent instructions • Steer instructions to the appropriate FIFO in rename stage using heuristics • ‘SRC_FIFO’ and ‘Reservations Tables’ to handle dependencies and wakeup • IPC reduces but clocking rate increases to give a faster implementation
Clustering Dependence Based Microarchitectures • Reducing bypass delays by reducing length of bypass paths • Minimization of inter-cluster communication, extra cycle penalty otherwise • Clustered Microarchitecture Types • Single Window, Execution Driven Steering • Two Windows, Dispatch Driven Steering - Best • Two Windows, Random Steering
Pipelining Dynamic Instruction Scheduling Logic • Wakeup+Select was held atomic in previous implementation • Increase performance by pipelining it, but retain execution of dependent instruction in consecutive cycles • Speculate on the wakeup by predicting based on both parent and grandparent instructions • Integrated into the Tomasulo approach
Wakeup Logic Details • Tag broadcast as soon as instruction begins execution • Broadcast – Execution Completion latency specified as shown • Match bit acts as the sticky bit to enable delay countdown • Need not always be correct due to unexpected stalls • Select logic remains as in previous work
Pipelining Rename Logic • Assumption by child instruction that parent would broadcast its tag in the next cycle, IF grandparent instructions broadcasts tag • Speculative wakeup on grandparent tag receiving for selection in the next cycle • Speculative since parent selection for execution is not guaranteed • Modifications in rename map and dependency analysis logic
Wakeup and Select Logic • Wakeup request sent after looking into ready bits from the parents’ and grandparents’ tags • A multi-cycle parent’s field can be ignored • In addition to speculative readiness signified by request line, a confirm line is activated when all parents are ready • False selection involve non-confirmed requests • Problematic only when really ready instructions are not selected
Implementation & Experimentation Details • Usage of a cycle accurate execution driven simulator for the Alpha ISA • Baseline conventional scheduled (2) pipeline • Budget / Deluxe – speculatively woken up scheduling • Ideal – 1 cycle scheduling pipeline • Factors like issue width and reservation station depth considered • Significant reduction in critical path with minor IPC impacts • Enables higher clock frequencies, deeper pipelines and larger instruction windows for better performance
Paradigm shift • So far we’ve added hardware to improve performance • However issue window could also be improved by removing hardware
Current Situation of Issue Windows • Content Addressable Memory (CAM) latency dominates instruction window latency. • Load Capacitance of CAM is a major limiting factor for speed. • Parasitic Capacitance also waste power. • Issue logic uses a lot of the power budget • 16% for the Pentium Pro • 18% for Alpha 21264
Unnecessary Circuity • Observation: Register stations compare broadcast tags to both operands. Often, this is unnecessary. • Only 25% to 35% of architectural instructions have two operands. • Simulation of speck2k programs shows only 10% to 20% of instructions need two comparators during runtime.
Simulation • Used SimpleScalar • Varied instruction window size 16, 64, 256. • Load/Store queue of half window size.
Removing extra comparators • Specialize the reservation stations. • Number of comparators varies by station from 2 to 0. • Stall if no station with minimum comparator available • Remove some operands by speculating on last operand to complete. • Needs predictor • Miss-predict penalty
Predictor • Paper discuses GSHARE predictor • Its based off branch predictor not seen in class. • Idea behind it starts by noting good indexes for selecting binary predictors are • Branch address • Global history • Thus if both are good, XORing them together should produce an index embodying more information than ether alone.
Predictor II • Here is how GSHARE does for various sizes of the prediction table.
Mis-pridiction • Alpha has scoreboard of valid registers called RDY. • Check if all operands available in register read stage, if not flush pipeline in the same fashion as latency miss-prediction. • RDY must be expanded to have the number of read ports match the issue width.
IPC losses • Reservation stations with two ports can be exhausted. Causes stalls for speck2k benchmarks like SWIM • Adding last tag prediction improves SWIM performance but causes 1-3% losses for benchmarks such as Crafly and Gcc due to misprediction
Simulation • Format show is for number of two tag/one tag/ zero tag • Last tag predictor used only on entries with no two tag reservation stations.
Benefits of comparator removal • In most cases clock rate can be 25-45% faster since • Tag bus no longer must reach all reservation stations • Removing comparators removes load capacitance • Energy saved from capacitance removal is 30-60% • Power savings don’t track energy saves this clock rate can now increase.
References • Complexity-effective superscalar processors • Subbarao Palacharla and Norman P. Jouppi and J. E. Smith • On pipelining dynamic instruction scheduling logic • J. Stark, M. D. Brown, and Yale N. Patt • Efficient Dynamic Scheduling Through Tag Elimination • Dan Ernst and Todd Austin • Combining Branch Predictors • Scott McFarling