210 likes | 293 Views
Defining Wakeup Width for Efficient Dynamic Scheduling A. Aggarwal, O. Ergin – Binghamton University M. Franklin – University of Maryland. Presented by: Deniz Balkan. Dynamic Scheduler. Workings of a dynamic scheduler Wakeup dependent instructions
E N D
Defining Wakeup Width for Efficient Dynamic SchedulingA. Aggarwal, O. Ergin – Binghamton UniversityM. Franklin – University of Maryland Presented by: Deniz Balkan
Dynamic Scheduler • Workings of a dynamic scheduler • Wakeup dependent instructions • Select instructions from a pool of ready instructions • Both these operations form a critical path • Increase of a single cycle in this critical path impacts performance
Implications of a large Dynamic Scheduler • Large dynamic scheduler has the potential to exploit more ILP • Larger issue queue • Larger issue width • Implications • Longer wire delays associated with driving register tags • Longer wire delays in driving tag comparison results • Longer select logic latency • Overall increased scheduler latency, resulting in slower clock speed
Contributions of this paper • Wakeup width definition – effective number of results used for instruction wakeup • Usually equal to the issue width • Reduced wakeup width dynamic scheduler • Issue width remains the same • Reduces instruction wakeup latency, energy consumption, and area • Less than 2% reduction in IPC
Program Behavior Study • Not all instructions produce a result • Branch and store instructions form about 30% • Entire issue width of the processor not used in every cycle • Average number of tags generated per cycle considerably less than the processor issue width
Tags generated in a cycle • To generate more tags per cycle, used a fetch, issue and commit width of 12 • Almost 50% of cycles have either 0 or 1 tag generated, even with a large issue width • About 80% of the cycles have 3 or less tags generated per cycle
Useful tags • Not all the generated tags are immediately useful • Branch mispredictions lead to tags generated along wrong path, and tags not immediately required • Dependent instructions not present in issue queue or waiting for other operands • Average number of useful tags in a cycle even less than the average number of tags generated in a cycle
Useful tags Only about 50-60% of instructions produce a tag that is immediately required
Reduced Wakeup Width Dynamic Scheduler • Wakeup width reduced while retaining the issue width intact • Some tags may have to wait before waking up the dependent instructions • Performance impact is not expected to be high • Soon there will be cycles with fewer tags • Waiting tags can use the available wakeup slots • Delays in not immediately useful tags may not have any performance impact
Hardware Implementation – Conventional DS • Select logic decides which instruction • executes on which FU • Register tags of issued instructions • placed in tag-latches • Enable signals controlled to enable • the drivers that drive the tags across • the instruction window
Hardware Implementation – RWW DS • Wakeup width reduced to half the • issue width • Two tag latches/FUs share common • tag-lines • If both tag-latches hold tags, only one • of them is driven, the other remains • in the tag-latch • To prevent overwriting, 1-bit indicator • latch used to control the selection • process
FU arbiter • Decides the instruction to be executed on the FU • Conventional arbiter giving priority to oldest instruction • Arbiter with RWW dynamic scheduler, where “a” is the value of the indicator latch for the arbiter Grant1 = req0 AND req1 AND enable Grant1 = req0 AND a AND req1 AND enable
Experimental Setup • Simulator based on Simplescalar to collect the performance statistics • Delay, energy, and area estimation from the actual VLSI layouts using SPICE, in a 0.18 micron 6 metal layer CMOS process (TSMC) • Dynamic scheduler size – 128-entry issue queue, 6-way issue width
Performance Results • Compared to I6W6 (Issue Width 6, Wakeup Width 6) configuration • I6W3 has 15% lower wakeup logic latency • IPC impact about 5% for I6W3 • Higher for high IPC FP benchmarks • Significantly better than I3W3, with the same wakeup logic latency as I6W3
IPC of FP benchmarks with RWW • Reasons of IPC impact • Instructions delayed due to waiting tags • Issue slots wasted because of waiting tags
Reasons of IPC impact • Delayed register tags have more impact than issue slot wastage • With reducing wakeup width, the impact of delayed register tags increases • dramatically
Area and Energy Results • Activation statistics obtained through simulations, and the energy consumption values from our detailed layouts • I6W3 reduced wakeup logic energy consumption by 10% • Area of the CAM cells (tag part of the instruction window) reduces by about 30% for I6W3
Reduced Issue Slots Wastage (RWIS) • Issue slots wasted because no instructions issued to FUs with already waiting tags • Classified instructions into • Tag-producing instructions • Non-tag-producing instructions • Can still issue non-tag-producing instructions to FUs with waiting tags without overwriting the tag value • Type bit included with the instruction to control issue
Reduced Tag Delays (RTD) • Register tags delayed when multiple tag-producing instructions issued to the FUs sharing the tag-lines (FU-group) • RTD limits the number of tag-producing instructions issued to an FU-group • Waiting tags of the previous cycle used for this purpose • Non-tag-producing instructions can still be issued to FUs with indicator bits set
Enhanced Performance • RTD-1 (with a maximum of 1 waiting tag) is the most effective • RWIS reduces the wastage of issue slots, RTD also reduces waiting register tags • RTD-2 results in more instructions getting delayed (compared to RTD-1) due to • waiting register tags
Conclusions • Larger dynamic schedulers can exploit more ILP, thus increasing performance • Larger dynamic scheduler results in longer scheduler latency • Reduced wakeup width (RWW) dynamic scheduler exploits the property that the number of useful tags generated per cycle are significantly less than the issue width • Significant reduction in wakeup logic latency and dynamic scheduler area and energy consumption with minimal IPC impact