520 likes | 538 Views
Explore the benefits of utilizing hardware-based devirtualization in VPC prediction algorithms to improve branch prediction accuracy and processor performance. Learn the key idea and process behind VPC prediction and its impact on indirect branching. Source code examples and simulation methodology included.
E N D
Hardware-based Devirtualization (VPC Prediction) Hyesoon Kim, Jose A. Joao, Onur Mutlu++, Chang Joo Lee, Yale N. Patt, Robert Cohn* * ++
Outline • Background and Motivation • VPC (Virtual Program Counter) Prediction • Results • Conclusion
Direct vs. Indirect Branch A A R1 = MEM[R2] branch R1 br.cond TARGET N T ? A+1 TARG a b d r Indirect Branch Conditional (Direct) Branch Indirect branches are costly on processor performance • Much more difficult to predict than conditional (direct) branches: multiple target addresses • Indirect branch predictor requires a large structure
Source Code Examples • Switch structures • Virtual function calls Source code: Shape *s = …; a = s->area();// virtual function call Static assembly code: R1 = MEM[R2] // function address lookup call R1// a register-indirect call
Indirect Branch Mispredictions Data from Intel Core Duo processor
Branch Predictor Direction Predictor ..1001010 GHR Hash PC Addr 0x0800 TARG2 TARG2 Predicted target Indirect Branch Predictor T TARG1 PC+1 Direct Branch? Indirect Branch? Branch Target Buffer (BTB)
Outline • Background and Motivation • VPC (Virtual Program Counter) Prediction • Results • Conclusion
VPC Prediction: Basic Idea • Key idea: Treat an indirect branch as multiple “virtual” conditionalbranches • Only for prediction purposes • Use the conditional branch predictor
VPC Branch Predictor Direction Predictor GHR ..1001010 Hash PC Addr 0x0800 VPC2 VPC1 Predicted target TARG2 TARG1 Branch Target Buffer
VPC Prediction: Basic Idea • Key idea: Treat an indirect branch as multiple “virtual” conditionalbranches • Only for prediction purposes • Use the conditional branch predictor • Benefits: • No separate complex structure • Can be applied to any other conditional branch prediction algorithm • Improve conditional branch prediction algorithm • Will improve the indirect branch prediction accuracy
Inspiration: Static Devirtualization Source code: Shape *s = …; a = s->area();// an indirect call Optimized source code: Shape *s = …; if (s->type == Rectangle) // a conditional branch at PC: X a = Rectangle::area(); else if (s->type == Circle) // a conditional branch at PC: Y a = Circle::area(); else a = s->area(); // an indirect call at PC: Z Small talk(’84), Calder and Grunwald (’94), Garret et al. (’94) , Ishizaki et al.(’00)
VPC Prediction Source code: Shape *s = …; a = s->area();// an indirect call Static assembly code: R1 = MEM[R2] call R1// PC: L Dynamic virtual branches (for prediction purposes): conditional jump TARGET1 // virtual PC = L conditional jump TARGET2 // virtual PC = L XOR HASHVAL[1] conditional jump TARGET3 // virtual PC = L XOR HASHVAL[2] conditional jump TARGET4 // virtual PC = L XOR HASHVAL[3]
Virtual PC Address Generation Use original PC address and iteration counter value Hash value table iteration counter value
VPC Prediction Process-I Direction Predictor cond. jump TARG1 // VPC: L cond. jump TARG2 // VPC: VL2 cond. jump TARG3 // VPC: VL3 cond. jump TARG4 // VPC: VL4 Real Instruction GHR call R1 // PC: L 1111 not taken Virtual Instructions PC L BTB Next iteration TARG1
VPC Prediction Process-II Direction Predictor cond. jump TARG1 // VPC: L cond. jump TARG2 // VPC: VL2 cond. jump TARG3 // VPC: VL3 cond. jump TARG4 // VPC: VL4 Real Instruction VGHR call R1 // PC: L 1110 Virtual Instructions VPC VL2 not taken BTB TARG2 Next iteration
VPC Prediction Process-III Direction Predictor cond. jump TARG1 // VPC: L cond. jump TARG2 // VPC: VL2 cond. jump TARG3 // VPC: VL3 cond. jump TARG4 // VPC: VL4 Real Instruction taken VGHR call R1 // PC: L 1100 Virtual Instructions VPC VL3 BTB Predicted Target = TARG3 TARG3
VPC Prediction Algorithm • Access the conditional branch predictor and the BTB with VPCA and VGHR • Compute VPCA and VGHR for the next iteration • VPCA = PC XOR HASHVAL[iter] • VGHR = VGHR << 1 • Predicted not taken: Move to the next iteration • Predicted taken:Use the target in the BTB as the target of an indirect branch • Give up and stall if • Iteration count > MAX_ITER or BTB miss
VPC Training Algorithm • An iterative process when an indirect branch is retired (not on the critical path) • Update the conditional branch predictor • Virtual branch has a correct target: Taken • Virtual branch has a wrong target: Not-taken • Update replacement policy bits of the correct target in the BTB • Insert the correct target into the BTB • Conditional branch predictor: taken • Replace the least frequently used target (LFU)
GHR VGHR Branch Direction Predictor (BP) PC VPCA Hash Function BTB Iteration counter + Hardware Cost and Complexity Taken/Not Taken Predict? Direct/Indirect Target Address
Outline • Background and Motivation • VPC Prediction • Results • Conclusion
Simulation Methodology • Pin-based x86 Simulator • Processor configuration • 4K-entry BTB • 64KB perceptron conditional branch predictor • Minimum 30-cycle branch misprediction penalty • 8-wide, 512-entry instruction window • Less aggressive processor (in the paper) • Gshare, O-GEHL conditional branch predictors • Indirect branch intensive benchmarks • 5 SPEC CPU2000, 5 SPEC CPU 2006, 2 other C++ • IBM server benchmarks (OLTP) (in the paper)
Different Direction Predictors 98% 98.3% 99% Conditional branch accuracy (%) Improving conditional branch prediction accuracy also improvesindirect branch prediction accuracy!
VPC vs. Static Devirtualization • Advantages • Enables other compiler optimizations (function inlining) • Can reduce the number of mispredictions • Disadvantages/Limitations • Not all indirect branches can be statically devirtualized • Extensive static analysis/profiling • Lack of adaptivity to run-time input set and phase behavior • VPC prediction can be used with statically devirtualized binaries • 10% improvement on top of static devirtualization
Outline • Background and Motivation • VPC Prediction • Results • Conclusion
Conclusion • VPC dynamically convertsindirect branches into multiple conditional branches; uses the existing conditional branch prediction hardware • VPC prediction reduces the branch misprediction penalty without significant extra hardware storage. • Baseline: 26% IPC improvement • O-GEHL: 31% IPC improvement • VPC can be an enabler encouraging programmers to use object-oriented programming styles
Thank you! Questions?
VPC vs. Other Indirect BP TTC: Chang et al. (’96) Cascaded: Driesen and Holzle(’98)
Iterative prediction • It doesn’t hurt performance significantly • Results • Why? • Most prediction is within a few iterations. • Results
Can the BTB be pipelined? • Yes • The next iteration of VPC can be started without knowing the previous iteration in the pipeline. • Consecutive VPC prediction iterations can be simply pipelined. • If the iteration is not needed then simply discard the prediction.
Is 4K-entry BTB too large? • Pentium 4 has a 4K-entry BTB • IBM Z series (z990) has an 8K-entry BTB • AMD Athlon and Hammer have 2K-entry BTBs
VPC Prediction vs. Compiler-Based Devirtualization (With TTC)
Conditional Br. Prediction Effects VPC Prediction reduces the accuracy of direction branch prediction but not that much!
VPC Prediction with Static Devirtualization • VPC prediction can be used with static devirtualized binaries. • Not all indirect branches could be devirtualized
VPC Training: Correct Prediction Retirement: Real Instruction call R1 // PC: L Known: Correct predicted, predicted iter = 3 Update the BTB replacement counter
VPC Training: Misprediction Retirement: Real Instruction call R1 // PC: L Known: Mispredicted, correct target address Update the BTB replacement counter
VPC Training: Misprediction Retirement: Real Instruction call R1 // PC: L Known: Mispredicted, correct target address No Target