10 likes | 104 Views
RAT- ARF. RAT- PRF. ROB- data. ROB- tag. ARF. PRF. PL. RS. BP. BACKGROUND & MOTIVATION. DESIGN SPACE. METHODOLOGY. MACHINE TRADEOFFS. DETAILS OF DESIGN. CONCLUSION. Delay. 2.03. 2.07. 1.93. 2.12. 3.21. 2.25. 2.01. 1.98. 2.46. Timing and Power
E N D
RAT- ARF RAT- PRF ROB- data ROB- tag ARF PRF PL RS BP BACKGROUND & MOTIVATION DESIGN SPACE METHODOLOGY MACHINE TRADEOFFS DETAILS OF DESIGN CONCLUSION Delay 2.03 2.07 1.93 2.12 3.21 2.25 2.01 1.98 2.46 • Timing and Power • Implemented in Verilog, synthesized using Design Compiler, placed&routed using Astro • Used LSI Logic gflxp 0.11 micron CMOS standard cell library • Microarchitectural • Simplescalar/Alpha 3.0, speculative scheduling/squashing replay, load-store reordering • Used SMARTS sampling with reference input sets • 4-wide, 128 ROB, 32 ARF / 96 PRF, 32 LQ, 24 SQ, 32 sched, 64KB DL1, 16KB IL1, 2MB L2 • RAT • Provided renaming to resolve RAW and to eliminate WAR and WAW • Contained destination mapping, ready bits, and retire bits (only for ARF) • Built using RAM and comparators • ROB • ROB-tag used to store information needed while instructions are in the window • ROB-data used to store speculative results before being copied to ARF • Register File • RAM-based structure to store execution results • Payload RAM • Used to store input operands before instructions are executed • Consisted of CAM structure for tags amd RAM structure for the data • Reservation Stations • Consisted of CAM and select logic • Bypass Networks • Two-level bypass network to bypass data for back-to-back execution • Consisted of comparators and muxes • NoPL -- added one pipeline stage between schedule and execute • Increased load misscheduling penalty and branch resolution loop • Decreased IPC • ARF – harder to do out-of-order branch misprediction resolution • On PRF, easily done by checkpointing RAT: MIPS R10K, Alpha 21264, Power4,5 • On ARF, RAT could point to ROB or ARF, a potential of stale pointer on simple checkpoint • ARF machines used in-order branch resllution: Intel P6, AMD K8, Intel Core • Microprocessors can be categorized into ARF- or PRF, with or without PL • Energy impact • PRF+noPL consumes the least amount of energy • On average, 20% less energy than ARF+PL-style machine in the affected structures • Roughly 6-7%saving of total chip energy • IPC impact • Out-of-order branch resolution adds 3% of performace on average • Operand read between issue and execute (PL) decreases IPC by 1-2% • PRF machine also simplify the implementation out-of-order branch resolution • Improved performance with comparable energy • Power as design constraint • More transistor on chip -- more power dissipated to heat • Challenges in cooling and packaging design • Microprocessor technology shifts from performance-focused to power-effective design • Register renaming is used to support operand delivery in OoO machines • Widely used design: ARF and PRF • PRF used by MIPS R10K, Alpha 21264, IBM Power4, IBM Power 5, Intel Pentium 4 • ARF used by PowerPC 604, Intel P6 family, Intel Core family, AMD K8 family • Industry trend • ARF for mobile and power-aware server; PRF for power hungry design • No quantitative analysis has been published • Based on where the speculative results are stored – ARF and PRF • ARF • Kept non-speculative values in a small ARF and speculative values in the ROB • Wrote results to the ROB at writeback stage • Copied results to the ARF as instructions retire • PRF • Kept both speculative and non-speculative in PRF • Wrote results only once (to the PRF) at writeback stage • Updated only rename table (RAT) as instructions retire • Based on where the operands read occur – Payload RAM and no Payload RAM • Payload RAM (PL) • Read operand values at dispatch stage before inserted into the reservation stations • Copied ready operands to payload RAM and got unready operands from bypass logic • No Payload RAM (noPL) • Only checked operands ready status at dispatch stage • Read operands values from necessary structures (PRF, ARF+ROB)) upon issue • Both classifications are orthogonal to each other Area 1.53 0.29 0.38 0.09 1.05 0.06 0.98 0.62 0.24 Rename N/A N/A N/A N/A 2.94 2.69 N/A N/A N/A Queue N/A 0.74 N/A N/A N/A 1.32 0.71 N/A N/A Read 1.21 2.52 2.83 1.42 4.58 2.83 1.68 N/A N/A WB 2.02 N/A 2.02 2.46 0.82 2.16 N/A 0.80 2.81 Retire N/A N/A N/A 2.76 N/A 2.83 0.28 1.56 0.82 Power-Aware Operand Delivery Erika Gunadi and Mikko H. Lipasti University of Wisconsin - Madison ARF+PL ARF+noPL PRF+PL PRF+PL Rename wr RAT re RAT wr RAT re RAT wr RAT re RAT wr RAT re RAT Read re ARF re ROB re PRF Queue wr RS, wr PL wr ROB wr RS, wr PL wr ROB wr RS, wr PL wr ROB wr RS, wr PL wr ROB Sched wr RS wr RS wr RS wr RS Issue re PL re PL re PL re PL Read re ARF re ROB re PRF Exe execute execute execute execute WB wr ROB, wr RAT wr PL wr ROB wr RAT wr PRF, wr RAT wr PL wr PRF wr RAT Retire re ROB, wr ARF wr RAT re ROB, wr ARF wr RAT re ROB re ROB ARF PRF Payload RAM (PL) Intel P6, Intel Core, Intel Pentium M, AMD K8 (Athlon, Opteron) None No Payload RAM (noPL) None Intel Pentium 4, MIPS R10K, Alpha 21264, IBM Power4, Power 5 ARF ROB PRF PLRAM ARF ROB PLRAM PRF ALU ALU ALU ALU ARF with Payload RAM ARF with no Payload RAM PRF with Payload RAM PRF with no Payload RAM