180 likes | 350 Views
“Flea-flicker” Multipass Pipelining: An Alternative to the High-Power Out-of-Order Offense. Ronald Barnes George Mason University. Shane Ryoo and Wen-mei Hwu University of Illinois Urbana-Champaign. Dynamic scheduling approach:. Tolerating memory latency and finding
E N D
“Flea-flicker” Multipass Pipelining: An Alternative to the High-Power Out-of-Order Offense Ronald Barnes George Mason University Shane Ryoo and Wen-mei Hwu University of Illinois Urbana-Champaign Dr. Ronald D. Barnes Department of Electrical and Computer Engineering
Dynamic scheduling approach: • Tolerating memory latency and finding ILP at runtime comes at heavy cost • Aggressive out-of-order execution incompatible with overriding power/power density concerns • ALPHA21264—18% of chip power, as much as int + fp exec • POWER4—10% of core power, scheduler highest power density • Power concerns influencing development towards efficiency rather than wide inst. window (Pentium M) In-order approach: • Rely on compiler-planned execution • Compiler techniques (e.g. prefetching) not solving problem of unanticipated memory latency Dr. Ronald D. Barnes Department of Electrical and Computer Engineering
Dr. Ronald D. Barnes Department of Electrical and Computer Engineering
Compiler Expressed Parallelism • Compiler can find a significant number of instructions for parallel execution on 6-issue processor Dr. Ronald D. Barnes Department of Electrical and Computer Engineering
Compiler Expressed Parallelism • Dynamic stalls (of which cache misses are most important [Sias04]) drastically reduce observed performance Dr. Ronald D. Barnes Department of Electrical and Computer Engineering
Dr. Ronald D. Barnes Department of Electrical and Computer Engineering
In-order runahead performance Dr. Ronald D. Barnes Department of Electrical and Computer Engineering
Dr. Ronald D. Barnes Department of Electrical and Computer Engineering
Benefits of multipass approach Dr. Ronald D. Barnes Department of Electrical and Computer Engineering
Key Multipass Contributions • Advance restart allows processing of newly woken insts. • Initial implementation relies on compiler-controlled restart • No expensive, fine-grain wakeup mechanism is needed • Re-use makes results of independent instructions persistent • Improves efficiency (no re-computation) • Hides long latency operations • Instruction Regrouping allows schedule-height reduction without reordering instructions Dr. Ronald D. Barnes Department of Electrical and Computer Engineering
Implementation cost of Multipass • Speculative memory state discussed in paper Dr. Ronald D. Barnes Department of Electrical and Computer Engineering
Experimental configuration • Benchmarks compiled with IMPACT C compiler using control-flow profiling and interprocedural alias analysis • Simulator augmented with power models of array structures Dr. Ronald D. Barnes Department of Electrical and Computer Engineering
Comparison with Out-of-Order Execution Dr. Ronald D. Barnes Department of Electrical and Computer Engineering
Comparison with Out-of-Order Execution Dr. Ronald D. Barnes Department of Electrical and Computer Engineering
Overheads of Out-of-Order execution Register renaming hardware to overcome output and anti-dependencies Complex scheduling table to issue instructions as dependencies are met Increase in pipeline length Dr. Ronald D. Barnes Department of Electrical and Computer Engineering
Power Ratio Comparison • Sequential, in-order access give multipass structures their advantage Dr. Ronald D. Barnes Department of Electrical and Computer Engineering
Related approaches • In-order runahead [Dundas97] Runahead to extend out-of-order window [Mutlu03] • Checkpoint and repair run-ahead execution • All “pre-execution” results are thrown away • Subordinate microthreads [Chappel99] Speculative precomputation [Collins01] • Helper threads initiate memory accesses early • Two-pass pipelining [Barnes03] • In-order advance execution on a separate, tightly-coupled pipeline Dr. Ronald D. Barnes Department of Electrical and Computer Engineering
Conclusions • Multipass execution provides an cache-miss latency tolerant microarchitecture • Advance restart facilitates the execution of independent, newly ready instructions • Initial implementation uses compiler-direction • Instruction regrouping achieves significant speedup by increasing “rally” mode throughput • Future work • Microarchitectural mechanism for controlling advance restart • Examination of tradeoffs between continuing (perhaps with prediction) vs. restarting advance execution • Partial reuse of results Dr. Ronald D. Barnes Department of Electrical and Computer Engineering