1 / 52

Hardware-based Devirtualization (VPC Prediction)

Hardware-based Devirtualization (VPC Prediction). Hyesoon Kim, Jose A. Joao, Onur Mutlu ++ , Chang Joo Lee, Yale N. Patt, Robert Cohn*. *. ++. Outline. Background and Motivation VPC (Virtual Program Counter) Prediction Results Conclusion. Direct vs. Indirect Branch. A. A.

alarice
Download Presentation

Hardware-based Devirtualization (VPC Prediction)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hardware-based Devirtualization (VPC Prediction) Hyesoon Kim, Jose A. Joao, Onur Mutlu++, Chang Joo Lee, Yale N. Patt, Robert Cohn* * ++

  2. Outline • Background and Motivation • VPC (Virtual Program Counter) Prediction • Results • Conclusion

  3. Direct vs. Indirect Branch A A R1 = MEM[R2] branch R1 br.cond TARGET N T ? A+1 TARG a b d r Indirect Branch Conditional (Direct) Branch Indirect branches are costly on processor performance • Much more difficult to predict than conditional (direct) branches: multiple target addresses • Indirect branch predictor requires a large structure

  4. Source Code Examples • Switch structures • Virtual function calls Source code: Shape *s = …; a = s->area();// virtual function call Static assembly code: R1 = MEM[R2] // function address lookup call R1// a register-indirect call

  5. Indirect Branch Mispredictions Data from Intel Core Duo processor

  6. Branch Predictor Direction Predictor ..1001010 GHR Hash PC Addr 0x0800 TARG2 TARG2 Predicted target Indirect Branch Predictor T TARG1 PC+1 Direct Branch? Indirect Branch? Branch Target Buffer (BTB)

  7. Outline • Background and Motivation • VPC (Virtual Program Counter) Prediction • Results • Conclusion

  8. VPC Prediction: Basic Idea • Key idea: Treat an indirect branch as multiple “virtual” conditionalbranches • Only for prediction purposes • Use the conditional branch predictor

  9. VPC Branch Predictor Direction Predictor GHR ..1001010 Hash PC Addr 0x0800 VPC2 VPC1 Predicted target TARG2 TARG1 Branch Target Buffer

  10. VPC Prediction: Basic Idea • Key idea: Treat an indirect branch as multiple “virtual” conditionalbranches • Only for prediction purposes • Use the conditional branch predictor • Benefits: • No separate complex structure • Can be applied to any other conditional branch prediction algorithm • Improve conditional branch prediction algorithm • Will improve the indirect branch prediction accuracy

  11. Inspiration: Static Devirtualization Source code: Shape *s = …; a = s->area();// an indirect call Optimized source code: Shape *s = …; if (s->type == Rectangle) // a conditional branch at PC: X a = Rectangle::area(); else if (s->type == Circle) // a conditional branch at PC: Y a = Circle::area(); else a = s->area(); // an indirect call at PC: Z Small talk(’84), Calder and Grunwald (’94), Garret et al. (’94) , Ishizaki et al.(’00)

  12. VPC Prediction Source code: Shape *s = …; a = s->area();// an indirect call Static assembly code: R1 = MEM[R2] call R1// PC: L Dynamic virtual branches (for prediction purposes): conditional jump TARGET1 // virtual PC = L conditional jump TARGET2 // virtual PC = L XOR HASHVAL[1] conditional jump TARGET3 // virtual PC = L XOR HASHVAL[2] conditional jump TARGET4 // virtual PC = L XOR HASHVAL[3]

  13. Virtual PC Address Generation Use original PC address and iteration counter value Hash value table iteration counter value

  14. VPC Prediction Process-I Direction Predictor cond. jump TARG1 // VPC: L cond. jump TARG2 // VPC: VL2 cond. jump TARG3 // VPC: VL3 cond. jump TARG4 // VPC: VL4 Real Instruction GHR call R1 // PC: L 1111 not taken Virtual Instructions PC L BTB Next iteration TARG1

  15. VPC Prediction Process-II Direction Predictor cond. jump TARG1 // VPC: L cond. jump TARG2 // VPC: VL2 cond. jump TARG3 // VPC: VL3 cond. jump TARG4 // VPC: VL4 Real Instruction VGHR call R1 // PC: L 1110 Virtual Instructions VPC VL2 not taken BTB TARG2 Next iteration

  16. VPC Prediction Process-III Direction Predictor cond. jump TARG1 // VPC: L cond. jump TARG2 // VPC: VL2 cond. jump TARG3 // VPC: VL3 cond. jump TARG4 // VPC: VL4 Real Instruction taken VGHR call R1 // PC: L 1100 Virtual Instructions VPC VL3 BTB Predicted Target = TARG3 TARG3

  17. VPC Prediction Algorithm • Access the conditional branch predictor and the BTB with VPCA and VGHR • Compute VPCA and VGHR for the next iteration • VPCA = PC XOR HASHVAL[iter] • VGHR = VGHR << 1 • Predicted not taken: Move to the next iteration • Predicted taken:Use the target in the BTB as the target of an indirect branch • Give up and stall if • Iteration count > MAX_ITER or BTB miss

  18. VPC Training Algorithm • An iterative process when an indirect branch is retired (not on the critical path) • Update the conditional branch predictor • Virtual branch has a correct target: Taken • Virtual branch has a wrong target: Not-taken • Update replacement policy bits of the correct target in the BTB • Insert the correct target into the BTB • Conditional branch predictor: taken • Replace the least frequently used target (LFU)

  19. GHR VGHR Branch Direction Predictor (BP) PC VPCA Hash Function BTB Iteration counter + Hardware Cost and Complexity Taken/Not Taken Predict? Direct/Indirect Target Address

  20. Outline • Background and Motivation • VPC Prediction • Results • Conclusion

  21. Simulation Methodology • Pin-based x86 Simulator • Processor configuration • 4K-entry BTB • 64KB perceptron conditional branch predictor • Minimum 30-cycle branch misprediction penalty • 8-wide, 512-entry instruction window • Less aggressive processor (in the paper) • Gshare, O-GEHL conditional branch predictors • Indirect branch intensive benchmarks • 5 SPEC CPU2000, 5 SPEC CPU 2006, 2 other C++ • IBM server benchmarks (OLTP) (in the paper)

  22. VPC MPKI

  23. VPC Performance

  24. Different Direction Predictors 98% 98.3% 99% Conditional branch accuracy (%) Improving conditional branch prediction accuracy also improvesindirect branch prediction accuracy!

  25. VPC vs. Static Devirtualization • Advantages • Enables other compiler optimizations (function inlining) • Can reduce the number of mispredictions • Disadvantages/Limitations • Not all indirect branches can be statically devirtualized • Extensive static analysis/profiling • Lack of adaptivity to run-time input set and phase behavior • VPC prediction can be used with statically devirtualized binaries • 10% improvement on top of static devirtualization

  26. Outline • Background and Motivation • VPC Prediction • Results • Conclusion

  27. Conclusion • VPC dynamically convertsindirect branches into multiple conditional branches; uses the existing conditional branch prediction hardware • VPC prediction reduces the branch misprediction penalty without significant extra hardware storage. • Baseline: 26% IPC improvement • O-GEHL: 31% IPC improvement • VPC can be an enabler encouraging programmers to use object-oriented programming styles

  28. Thank you! Questions?

  29. VPC vs. Cascaded IBP

  30. VPC vs. Other Indirect BP TTC: Chang et al. (’96) Cascaded: Driesen and Holzle(’98)

  31. Iterative prediction • It doesn’t hurt performance significantly • Results • Why? • Most prediction is within a few iterations. • Results

  32. VPC Hit Iteration Counter

  33. Can the BTB be pipelined? • Yes • The next iteration of VPC can be started without knowing the previous iteration in the pipeline. • Consecutive VPC prediction iterations can be simply pipelined. • If the iteration is not needed then simply discard the prediction.

  34. Is 4K-entry BTB too large? • Pentium 4 has a 4K-entry BTB • IBM Z series (z990) has an 8K-entry BTB • AMD Athlon and Hammer have 2K-entry BTBs

  35. BTB Size Effects

  36. VPC Prediction Accuracy

  37. Target Distribution

  38. VPC vs. Tagged Target Cache

  39. VPC Prediction Delay Effects

  40. VPC with O-GEHL BP

  41. VPC with a Less Aggressive Processor

  42. Server Benchmarks

  43. Server Benchmarks (VPC vs. TTC)

  44. VPC Prediction vs. Compiler-Based Devirtualization (With TTC)

  45. Conditional Br. Prediction Effects VPC Prediction reduces the accuracy of direction branch prediction but not that much!

  46. Indirect Branch Mispredictions

  47. VPC Prediction with Static Devirtualization • VPC prediction can be used with static devirtualized binaries. • Not all indirect branches could be devirtualized

  48. VPC Training: Correct Prediction Retirement: Real Instruction call R1 // PC: L Known: Correct predicted, predicted iter = 3 Update the BTB replacement counter

  49. VPC Training: Misprediction Retirement: Real Instruction call R1 // PC: L Known: Mispredicted, correct target address Update the BTB replacement counter

  50. VPC Training: Misprediction Retirement: Real Instruction call R1 // PC: L Known: Mispredicted, correct target address No Target

More Related