350 likes | 511 Views
Out-of-Order Commit Processors. Adrián Cristal (UPC), Daniel Ortega (HP Labs), Josep Llosa (UPC) and Mateo Valero (UPC) HPCA-10, Madrid February 14-17 th 2004. 4. L2 Perfect. 100. 500. 1000. 3.5. 3. 2.5. IPC. 2. 1.5. 1. 0.5. 0. 128. 256. 512. 1024. 2048. 4096.
E N D
Out-of-Order Commit Processors Adrián Cristal (UPC), Daniel Ortega (HP Labs), Josep Llosa (UPC) and Mateo Valero (UPC) HPCA-10, Madrid February 14-17th 2004
4 L2 Perfect 100 500 1000 3.5 3 2.5 IPC 2 1.5 1 0.5 0 128 256 512 1024 2048 4096 In-flight Instructions 3.5X Motivation I 0.30X Spec FP 2000 Cristal et al.: “Large Virtual ROBs by Processor Checkpointing”, TR UPC-DAC, July 2002
10% 25% 50% 75% 90% 2000 1800 1600 1400 1200 Number of In-flight Instructions 1000 800 600 400 200 0 1168 1382 1607 1868 1955 2034 Number of In-flight Instructions (SpecFP) Motivation II – Resources - ROB Instructions in-flight (ROB=2048, Mem 500 cycles) Often nearly full A. Cristal, et al, “ A case for resource-conscious out-of-order processors”, IEEE TCCA CA Letters, Vol. 2, Oct. 2003
Motivation III – Resources – FP Queue State of FP Queues (ROB=2048, Mem 500 cycles) Number of Instructions 1168 1382 1607 1868 1955 600 Blocked-Long Blocked-Short 500 Ready 400 Long/Short Lat. Inst. Remove – Reinsert Dependence Chain 300 FP Queue 200 100 0 1 10 25 50 75 90 100 Distribution of in-flight Instructions A. Cristal, et al, “ A case for resource-conscious out-of-order processors”, IEEE TCCA CA Letters, Vol. 2, Oct. 2003
Outline • Motivation • Out-of-Order Commit • Multicheckpointing ROB • Slow Line Instruction Queue • Performance Evaluation • Conclusion
New Checkpoint Oldest Checkpoint New Checkpoint Checkpoint Out-of-Order Commit Ld I1 I2 Br 1 Ld I3 I4 St Br 2 I5 Br 3 I6
Oldest Checkpoint Oldest Checkpoint Oldest Checkpoint New Checkpoint Checkpoint Checkpoint To Memory Out-of-Order Commit Ld I1 Gang Commit I2 Br 1 Ld I3 I4 St Br 2 I5 Br 3 I6
Oldest Checkpoint Checkpoint Out-of-Order Commit Store Buffer St I3 I4 Miss Branch Prediction Recover from Checkpoint St Br 2 I5 I7 Br 3 I8
Out-of-Order Commit II • Checkpoint Table. Each entry has: • PC of the next Instruction • Instruction Counter: Count the number of instructions still alive • Map Table: Allows to recover the register file • Pointer to the Store Buffer • Mechanism to recover free Registers • Future Free • One bit for each Physical Register • Large Virtual ROB: Tech. Rep. UPC-DAC-2002-39 • Ephemeral Registers: Tech. Rep UPC-DAC-2003.51
Checkpoint Creation • Save Pc • Save Map Table • Clean Future Free Bits • Clean Instruction Counter • Get a pointer to the first free entry of the store buffer, and mark this entry in the store buffer.
Instruction Decodification • Add 1 to the Instruction Counter of the newest checkpoint • R1R2 op R3 • If R1 is mapped to PhyReg_N • Set PhyReg_N bit of the future free vector bits • Map R1 to the new Physical Register • Associate the instruction to the last created checkpoint
Instruction Writeback • Decrement the Instruction Counter of the checkpoint associated to the instruction • If the instruction is a mispredicted branch: • Recover From the associated checkpoint: • Fetch instructions from saved PC • Release all entries in the store buffer from the pointed entry • Free all registers in the future free vector of the entry and for all the newer checkpoints entries
Checkpoint Elimination • If this counter is 0 and if it is the oldest checkpoint, then: • The checkpoint is removed • Clean the corresponding mark in the store buffer • The registers marked in the Future Free vector are freed
Outline • Motivation • Out-of-Order Commit • Slow Line Instruction Queue • Performance Evaluation • Conclusions
o d b u o e R s P Slow Line Instruction Queue LD Load/Store Ld Queue x D a t a x D e p e Instruction a n d x e Queue n c b e a x x Slow Line x Instruction Queue b x
o d b u o e R s P Slow Line Instruction Queue LD Load/Store Ld Queue x D a t a x D e p e Instruction n d x e Queue n c b e a a x x Slow Line x Instruction Queue b x
o d b u o e R s P Slow Line Instruction Queue Load End LD Load/Store Ld Queue x D a t a x D e p e Begin reinsert Instruction n d x e Queue n c e a a x b x Slow Line x Instruction Queue b x
Slow Lane Instruction Queue II • Very simple Buffer – Slow Lane Instruction Queue (SLIQ) • Each Load that miss in L2 has a pointer to an entry in the SLIQ • Pseudo ROB
Slow Line Instruction Queue III • When a Instruction is retired from the Pseudo ROB, its state is looked on: • If the instruction is a load miss, the pointer is written • If the instruction depends on a long latency instruction, it is moved to de SLIQ • When a load that miss in L2 finish its execution: • The SLIQ is traversed from the instruction pointed by the load if this point is older than the current traversal position. • The load’s dependent instructions are reinserted to the IQ
Performance Evaluation • Processor Configuration (Baseline 4096): • Fetch/Commit width 4 • Branch Predictor 16K entries Gshare • Instruction L1 32Kb, 4-way, 32 bytes line, 2 cycle • Data L1 32Kb, 4-way, 32 bytes line, 2 cycle • L2 size 512Kb, 4-way, 64 bytes line, 10 cycle • Memory Latency 1000 cycles • Physical Registers 4096 entries • Load/Store Queue 4096 entries • Reorder Buffer 4096 entries • Integer General Units 4 (lat/rep 1/1) • Integer Mult/Div Units 2 (lat/rep 3/1 and 20/20) • FP Functional Units 4 (lat/rep 2/1) • FP Mult/Div/Sqrt Units 2 (lat/rep 4/1, 12/12, 24/24)
Performance Evaluation - Some Considerations • We mix both models. • The processor takes the checkpoints when the instructions are retired from the pseudo ROB. • Many branches are resolved at this time, so the probability to come back to the checkpoint is reduced. • If a miss predicted branch is detected in the pseudo ROB, a normal rollback mechanism is used.
Number of Checkpoints and Performance Baseline: 2048 IQ. SLIQ 2048 entries and 128 IQ. 2048 Physical Registers
Delay in re-insertion from SLIQ SLIQ: 1024 entries
Towards affordable Kilo-Instruction Processor • Adding Ephemeral Registers to the Out-of-Order Commit Processors • Change in the SLIQ to list of Buckets of Instructions J. Martínez et al. “Ephemeral Registers”, Technical Report CSL-TR-2003-1035 , 2003.
Putting It All Together PhysicalRegisters Virtual Registers Memory Latency IQs of 128 entries
Conclusion • To tolerate increasing memory latencies in Floating Point applications, a large number of in-flight instruction must be maintained. The resources must be up-sized. • The resources are underutilized • We present two techniques to reduce the need for resources and we show its effectiveness • Out of Order Commit • Slow Lane Instruction Queue
State of ST Queues (specInt, ROB=2048) INT Number of Instructions 20 108 435 1004 1361 250 Ready Address Ready Blocked-Long 200 Blocked-Short Locality 150 ST Queue 100 50 0 1 10 25 50 75 90 100 Distribution of in-flight Instructions
State of Int Queues (specInt, ROB=2048) INT Number of Instructions 20 108 435 1004 1361 450 Blocked-Long 400 Blocked-Short Ready 350 300 250 Long/Short Lat. Inst. Remove – Reinsert Dependence Chain Int. Queue 200 150 100 50 0 1 10 25 50 75 90 100 Distribution of in-flight Instructions
State of Registers (Int, ROB=2048) 10% 25% 50% 75% 90% 1000 Dead 900 Blocked-Long Blocked-Short 800 Live Early Release 700 600 500 Int. Registers Virtual Registers 400 300 200 100 0 20 108 435 1004 1361 1756 Number of In-flight Instructions (SpecInt)