1 / 32

Out-of-Order Commit Processors

Out-of-Order Commit Processors. Adrián Cristal (UPC), Daniel Ortega (HP Labs), Josep Llosa (UPC) and Mateo Valero (UPC) HPCA-10, Madrid February 14-17 th 2004. 4. L2 Perfect. 100. 500. 1000. 3.5. 3. 2.5. IPC. 2. 1.5. 1. 0.5. 0. 128. 256. 512. 1024. 2048. 4096.

mendel
Download Presentation

Out-of-Order Commit Processors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Out-of-Order Commit Processors Adrián Cristal (UPC), Daniel Ortega (HP Labs), Josep Llosa (UPC) and Mateo Valero (UPC) HPCA-10, Madrid February 14-17th 2004

  2. 4 L2 Perfect 100 500 1000 3.5 3 2.5 IPC 2 1.5 1 0.5 0 128 256 512 1024 2048 4096 In-flight Instructions 3.5X Motivation I 0.30X Spec FP 2000 Cristal et al.: “Large Virtual ROBs by Processor Checkpointing”, TR UPC-DAC, July 2002

  3. 10% 25% 50% 75% 90% 2000 1800 1600 1400 1200 Number of In-flight Instructions 1000 800 600 400 200 0 1168 1382 1607 1868 1955 2034 Number of In-flight Instructions (SpecFP) Motivation II – Resources - ROB Instructions in-flight (ROB=2048, Mem 500 cycles) Often nearly full A. Cristal, et al, “ A case for resource-conscious out-of-order processors”, IEEE TCCA CA Letters, Vol. 2, Oct. 2003

  4. Motivation III – Resources – FP Queue State of FP Queues (ROB=2048, Mem 500 cycles) Number of Instructions 1168 1382 1607 1868 1955 600 Blocked-Long Blocked-Short 500 Ready 400 Long/Short Lat. Inst. Remove – Reinsert Dependence Chain 300 FP Queue 200 100 0 1 10 25 50 75 90 100 Distribution of in-flight Instructions A. Cristal, et al, “ A case for resource-conscious out-of-order processors”, IEEE TCCA CA Letters, Vol. 2, Oct. 2003

  5. Outline • Motivation • Out-of-Order Commit • Multicheckpointing ROB • Slow Line Instruction Queue • Performance Evaluation • Conclusion

  6. New Checkpoint Oldest Checkpoint New Checkpoint Checkpoint Out-of-Order Commit Ld I1 I2 Br 1 Ld I3 I4 St Br 2 I5 Br 3 I6

  7. Oldest Checkpoint Oldest Checkpoint Oldest Checkpoint New Checkpoint Checkpoint Checkpoint To Memory Out-of-Order Commit Ld I1 Gang Commit I2 Br 1 Ld I3 I4 St Br 2 I5 Br 3 I6

  8. Oldest Checkpoint Checkpoint Out-of-Order Commit Store Buffer St I3 I4 Miss Branch Prediction Recover from Checkpoint St Br 2 I5 I7 Br 3 I8

  9. Out-of-Order Commit II • Checkpoint Table. Each entry has: • PC of the next Instruction • Instruction Counter: Count the number of instructions still alive • Map Table: Allows to recover the register file • Pointer to the Store Buffer • Mechanism to recover free Registers • Future Free • One bit for each Physical Register • Large Virtual ROB: Tech. Rep. UPC-DAC-2002-39 • Ephemeral Registers: Tech. Rep UPC-DAC-2003.51

  10. Checkpoint Creation • Save Pc • Save Map Table • Clean Future Free Bits • Clean Instruction Counter • Get a pointer to the first free entry of the store buffer, and mark this entry in the store buffer.

  11. Instruction Decodification • Add 1 to the Instruction Counter of the newest checkpoint • R1R2 op R3 • If R1 is mapped to PhyReg_N • Set PhyReg_N bit of the future free vector bits • Map R1 to the new Physical Register • Associate the instruction to the last created checkpoint

  12. Instruction Writeback • Decrement the Instruction Counter of the checkpoint associated to the instruction • If the instruction is a mispredicted branch: • Recover From the associated checkpoint: • Fetch instructions from saved PC • Release all entries in the store buffer from the pointed entry • Free all registers in the future free vector of the entry and for all the newer checkpoints entries

  13. Checkpoint Elimination • If this counter is 0 and if it is the oldest checkpoint, then: • The checkpoint is removed • Clean the corresponding mark in the store buffer • The registers marked in the Future Free vector are freed

  14. Outline • Motivation • Out-of-Order Commit • Slow Line Instruction Queue • Performance Evaluation • Conclusions

  15. o d b u o e R s P Slow Line Instruction Queue LD Load/Store Ld Queue x D a t a x D e p e Instruction a n d x e Queue n c b e a x x Slow Line x Instruction Queue b x

  16. o d b u o e R s P Slow Line Instruction Queue LD Load/Store Ld Queue x D a t a x D e p e Instruction n d x e Queue n c b e a a x x Slow Line x Instruction Queue b x

  17. o d b u o e R s P Slow Line Instruction Queue Load End LD Load/Store Ld Queue x D a t a x D e p e Begin reinsert Instruction n d x e Queue n c e a a x b x Slow Line x Instruction Queue b x

  18. Slow Lane Instruction Queue II • Very simple Buffer – Slow Lane Instruction Queue (SLIQ) • Each Load that miss in L2 has a pointer to an entry in the SLIQ • Pseudo ROB

  19. Slow Line Instruction Queue III • When a Instruction is retired from the Pseudo ROB, its state is looked on: • If the instruction is a load miss, the pointer is written • If the instruction depends on a long latency instruction, it is moved to de SLIQ • When a load that miss in L2 finish its execution: • The SLIQ is traversed from the instruction pointed by the load if this point is older than the current traversal position. • The load’s dependent instructions are reinserted to the IQ

  20. Performance Evaluation • Processor Configuration (Baseline 4096): • Fetch/Commit width 4 • Branch Predictor 16K entries Gshare • Instruction L1 32Kb, 4-way, 32 bytes line, 2 cycle • Data L1 32Kb, 4-way, 32 bytes line, 2 cycle • L2 size 512Kb, 4-way, 64 bytes line, 10 cycle • Memory Latency 1000 cycles • Physical Registers 4096 entries • Load/Store Queue 4096 entries • Reorder Buffer 4096 entries • Integer General Units 4 (lat/rep 1/1) • Integer Mult/Div Units 2 (lat/rep 3/1 and 20/20) • FP Functional Units 4 (lat/rep 2/1) • FP Mult/Div/Sqrt Units 2 (lat/rep 4/1, 12/12, 24/24)

  21. Performance Evaluation - Some Considerations • We mix both models. • The processor takes the checkpoints when the instructions are retired from the pseudo ROB. • Many branches are resolved at this time, so the probability to come back to the checkpoint is reduced. • If a miss predicted branch is detected in the pseudo ROB, a normal rollback mechanism is used.

  22. IPC – Different Configurations

  23. Number of Checkpoints and Performance Baseline: 2048 IQ. SLIQ 2048 entries and 128 IQ. 2048 Physical Registers

  24. In-Flight Instructions

  25. Delay in re-insertion from SLIQ SLIQ: 1024 entries

  26. Towards affordable Kilo-Instruction Processor • Adding Ephemeral Registers to the Out-of-Order Commit Processors • Change in the SLIQ to list of Buckets of Instructions J. Martínez et al. “Ephemeral Registers”, Technical Report CSL-TR-2003-1035 , 2003.

  27. Putting It All Together PhysicalRegisters Virtual Registers Memory Latency IQs of 128 entries

  28. Conclusion • To tolerate increasing memory latencies in Floating Point applications, a large number of in-flight instruction must be maintained. The resources must be up-sized. • The resources are underutilized • We present two techniques to reduce the need for resources and we show its effectiveness • Out of Order Commit • Slow Lane Instruction Queue

  29. Thank you very much 

  30. State of ST Queues (specInt, ROB=2048) INT Number of Instructions 20 108 435 1004 1361 250 Ready Address Ready Blocked-Long 200 Blocked-Short Locality 150 ST Queue 100 50 0 1 10 25 50 75 90 100 Distribution of in-flight Instructions

  31. State of Int Queues (specInt, ROB=2048) INT Number of Instructions 20 108 435 1004 1361 450 Blocked-Long 400 Blocked-Short Ready 350 300 250 Long/Short Lat. Inst. Remove – Reinsert Dependence Chain Int. Queue 200 150 100 50 0 1 10 25 50 75 90 100 Distribution of in-flight Instructions

  32. State of Registers (Int, ROB=2048) 10% 25% 50% 75% 90% 1000 Dead 900 Blocked-Long Blocked-Short 800 Live Early Release 700 600 500 Int. Registers Virtual Registers 400 300 200 100 0 20 108 435 1004 1361 1756 Number of In-flight Instructions (SpecInt)

More Related