180 likes | 265 Views
Schedulers and Instruction Windows for High Performance Superscalars. Kshitiz Malik Viral Mehta. Instruction Dispatch: Rename, and send to Scheduler Instruction Issue: Send from Scheduler to Execution Unit
E N D
Schedulers and Instruction Windows for High Performance Superscalars Kshitiz Malik Viral Mehta
Instruction Dispatch: Rename, and send to Scheduler Instruction Issue: Send from Scheduler to Execution Unit Instruction window: All instructions renamed but not retired (the ROB, for most machines) Scheduler: All instructions renamed but not issued Scheduler Ex Ex Ex ROB ROB HEAD Terminology… Renamer Later, in Prog. Order What's in a name? That which we call a RAT, by any other word would smell just as bad. -with apologies to a certain William Shakespeare
ROB Cache miss Before miss Dep on cache miss Indep instrn Completed Instr Latency Hiding in OOO Processors Miss Discovered • An OOO processor can do independent work in the shadow of a cache miss • The longer the miss penalty, the larger the scheduler and instruction window need to be Later, in Prog. Order ROB
The need for large windows • ROB: • For ‘no machine stall on cache miss’, large ROB needed. • 500 cycle miss penalty, 4 wide frontend: 2000 entry ROB required! • Scheduler: • Large scheduler needed to accommodate miss-dependent instructions • Percentage of miss-dependent instructions not very high • Even with only 1/10 miss-dependent instructions, need 200 entry scheduler (less than 32 on current machines) ROB Miss about to be serviced “Scrub (your windows) off every once in awhile, or the light won't come in” Alan Alda
Performance: Large ROB • Pentium-4 like machine • iw = instruction window = ROB • Mem Latency ~ 400 cycles • All other physical resources ‘large enough’ Take home point: Larger ROB’s, all the way to 2k entries improve performance Figure Credit: Akkary et. al, CheckpointProcessing and Recovery: Towards Scalable Large Instruction Window Processors, Micro 2003
Performance: Large Scheduler • Pentium-4 like machine • iw = instruction window • sw = scheduling window • Mem latency ~ 400 cycles • All other physical resources ‘large enough’ Take home point: 64 entry scheduler NOT good, 256 entry scheduler good enough Figure Credit: Akkary et. al, CheckpointProcessing and Recovery: Towards Scalable Large Instruction Window Processors, Micro 2003
Implementation Issues: Large Windows Schedulers • Target: 256 entry • Large Schedulers very hard to build • Wakeup-Select needs long wires and comparators in each scheduler slot • Scheduler on critical path • Cannot be pipelined • State-of-the-art: Pentium-4 has 2 small, separate schedulers, 12-14 entries each ROB • Target: 2k entry ROB • Large ROBs need Large Register Files • # Phys Regs ≈ (num ROB entries) + (num archRegs) • Register File: Size and Access Time are inversely proportional • Bulk retirement problem • Eg: 2000 entry ROB, full after cache miss. Need to retire instructions in bulk (another cache miss) • State-of-the-art: 126 entry ROB in Pentium-4
Smart Solutions: Large ‘effective’ windows • Cycle time limitations imply large schedulers and large ROBs will be hard to build • Solution: build ‘effectively’ large windows: • Identify the reasons why large structures improve performance • Emulate the performance-enhancing behaviour of large windows using smarter, smaller windows • Smarter Scheduler: WIB (Waiting Instruction Buffer) • Smarter ROB: CPR (Checkpoint Processing and Repair) • Smarter ROB and Scheduler: • Runahead Execution • CFP (Continual Flow Pipelines) “You can have your cake and eat it too!”
Smarter ROB: CPR • “Checkpoint Processing and Recovery” • Replace the ROB entirely by checkpointing, make checkpoints on low-confidence branches • Replace ROB-based physical register reclamation method by reference counting: • On rename, phys_reg_count ++ • On execute, phys_reg_count -- • 8 checkpoints found optimal. Checkpoints also reclaimed using reference counting. • 8 checkpoints show performance better than 320 entry ROB • Still needs a large scheduler
CPR Performance Figure Credit: Akkary et. al, CheckpointProcessing and Recovery: Towards Scalable Large Instruction Window Processors, Micro 2003
Smarter Scheduler: WIB • WIB = Waiting Instruction Buffer • Miss-dependent instructions removed from scheduler and put in a separate queue, called WIB (process invoked on L2 miss) • WIB can be very large, since its only a queue (no wakeup-select, FIFO structure) • When cache miss is serviced, WIB is emptied into the scheduler • Supports removing instructions dependent on multiple cache misses • Still needs a large ROB, and a large Register File
WIB Performance Figure Credit: Lebeck et. al, A Large, Fast Instruction Window for Tolerating Cache Misses, ISCA 2001
Runahead Execution: Improving MLP • MLP = memory level parallelism: how many cache misses are being serviced concurrently (when at least one miss is being serviced) • Contend that in the shadow of a cache miss: • Executing instructions is NOT important • Discovering more cache misses IS important • Create checkpoint on L2 cache miss, go into ‘runahead mode’: • Execute miss-independent instructions only (INV bit in regsiter file to track dependencies) • Miss-dependent instructions pseudo-executed and retired immediately, so no ROB block • Miss-independent instructions execute and trigger cache misses. • When miss is serviced, processor state restored from checkpoint, all instructions re-executed • Execution results in runahead-mode are thrown away
Runahead Execution: Improving MLP (II) • Impressive performance improvement • Very few hardware changes to the pipeline Figure Credit:Mutlu et. al, Runahead Execution: An Effective Alternative to Large Instruction Windows, Micro 2003
Complete Solution: CFP • CFP = Continual Flow Pipelines • Combines ideas from WIB and CPR: • Use checkpointing, so no ROB • Remove miss-dependent instructions from scheduler, store in separate queue • Save physical registers: miss-dependent instructions free their destination registers when put in waiting queue • Complete Solution: miss-independent instructions are actually executed, and cache misses are also discovered • Performance improvement very attractive, in keeping with completeness of solution. • Needs radical changes to current superscalar pipelines
CFP Performance Figure Credit:Srinivasan et. al, Continual Flow Pipelines, ASPLOS 2004
Conclusion • Need large schedulers and ROBs to tolerate cache misses • Physically building large structures not possible, therefore, need to build smarter windows • Large ROBs possible using checkpointing • Large schedulers possible by removing miss-dependent instrs. from scheduler • A combination of above two solutions shows excellent performance • MLP school claims actually executing miss-independent instructions not important, only discovering cache misses is important. Therefore, Runahead execution makes sense.
References • [1] H. Akkary, R. Rajwar, S. Srinivasan, CheckpointProcessing and Recovery: Towards Scalable Large Instruction Window Processors, Proceedings of the 36th annual IEEE International Symposium on Microarchitecture, 2003 • [2] S. Srinivasan, R. Rajwar, H. Akkary, M. Upton, Continual Flow Pipelines, Proceedings of the 11th International Conference on Architectural support for programming languages and operating systems, 2004 • [3] O. Mutlu, J. Stark, C. Wilkerson, Y. Patt, Runahead Execution: An Effective Alternative to Large Instruction Windows, Proceedings of the 36th annual IEEE International Symposium on Microarchitecture, 2003 • [4] R. Balasubramoniun, S Dwarkadas, D. Albonesi, Dynamically Allocating Processor Resources between Nearby and Distant ILP, Proceedings of the 28th International Symposium on Computer Architecture, 2001 • [5] O. Mutlu, H. Kim, J. Stark, Y. N. Patt. On Reusing the Results of Pre-Executed Instructions in a Runahead Execution Processor, Computer Architecture Letters, Volume 4, Jan. 2005 • [6] J. Dundas and T. Mudge, Improving Data Cache Performance by Pre-Executing Instructions Under a Cache Miss, 11th ACM International Conference on Supercomputing, 1997 • [7] A.Lebeck, J. Koppanalil, T. Li J. Patwardhan, E. Rotenberg A Large, Fast Instruction Window for Tolerating Cache Misses, Proceedings of the 28th International Symposium on Computer Architecture, 2001 • [8] Y. Chou, B. Fahs, S. Abraham, Microarchitecture optimizations for exploiting Memory-level parallelism, Proceedings of the 31st International Symposium on Computer Architecture, 2004 • [9] S. Palacharla, N. Jouppi and J. Smith, Complexity-Effective Superscalar Processors, Proceedings of the 24th International Symposium on Computer Architecture, 1997 • [10] Edward Brekelbaum, Jeff Rupley II, Chris Wilkerson, Bryan Black, Hierarchical Scheduling Windows, Proceedings of the 35th annual IEEE International Symposium on Microarchitecture, 2002 • [11] Dan Ernst, Andrew Hamel, Todd Austin, Cyclone: A Broadcast-Free Dynamic Instruction Scheduler with Selective Replay, Proceedings of the 30th International Symposium on Computer Architecture, 2003 • [12] T. Karkhanis and J. Smith, "A Day in the Life of a Cache Miss", Proceeding of the 2nd Annual Workshop on Memory Performance Issues (WMPI 2002), 2002 • [13] Dean M. Tullsen, Susan J. Eggers, and Henry M. Levy. Simultaneous multithreading: Maximizing on-chip parallelism. • [14] W. W. Hwu and Y. N. Patt, Checkpoint repair for out-of-order execution machines, Proceedings of the 14th International Symposium on Computer Architecture, 1987 • [15] A. Sodani and G. Sohi, Dynamic Instruction Reuse, Proceedings of the International Symposium on Computer Architecture, 1997