520 likes | 638 Views
Hardware-based Devirtualization (VPC Prediction). Hyesoon Kim, Jose A. Joao, Onur Mutlu ++ , Chang Joo Lee, Yale N. Patt, Robert Cohn*. *. ++. Outline. Background and Motivation VPC (Virtual Program Counter) Prediction Results Conclusion. Direct vs. Indirect Branch. A. A.
E N D
Hardware-based Devirtualization (VPC Prediction) Hyesoon Kim, Jose A. Joao, Onur Mutlu++, Chang Joo Lee, Yale N. Patt, Robert Cohn* * ++
Outline • Background and Motivation • VPC (Virtual Program Counter) Prediction • Results • Conclusion
Direct vs. Indirect Branch A A R1 = MEM[R2] branch R1 br.cond TARGET N T ? A+1 TARG a b d r Indirect Branch Conditional (Direct) Branch Indirect branches are costly on processor performance • Much more difficult to predict than conditional (direct) branches: multiple target addresses • Indirect branch predictor requires a large structure
Source Code Examples • Switch structures • Virtual function calls Source code: Shape *s = …; a = s->area();// virtual function call Static assembly code: R1 = MEM[R2] // function address lookup call R1// a register-indirect call
Indirect Branch Mispredictions Data from Intel Core Duo processor
Branch Predictor Direction Predictor ..1001010 GHR Hash PC Addr 0x0800 TARG2 TARG2 Predicted target Indirect Branch Predictor T TARG1 PC+1 Direct Branch? Indirect Branch? Branch Target Buffer (BTB)
Outline • Background and Motivation • VPC (Virtual Program Counter) Prediction • Results • Conclusion
VPC Prediction: Basic Idea • Key idea: Treat an indirect branch as multiple “virtual” conditionalbranches • Only for prediction purposes • Use the conditional branch predictor
VPC Branch Predictor Direction Predictor GHR ..1001010 Hash PC Addr 0x0800 VPC2 VPC1 Predicted target TARG2 TARG1 Branch Target Buffer
VPC Prediction: Basic Idea • Key idea: Treat an indirect branch as multiple “virtual” conditionalbranches • Only for prediction purposes • Use the conditional branch predictor • Benefits: • No separate complex structure • Can be applied to any other conditional branch prediction algorithm • Improve conditional branch prediction algorithm • Will improve the indirect branch prediction accuracy
Inspiration: Static Devirtualization Source code: Shape *s = …; a = s->area();// an indirect call Optimized source code: Shape *s = …; if (s->type == Rectangle) // a conditional branch at PC: X a = Rectangle::area(); else if (s->type == Circle) // a conditional branch at PC: Y a = Circle::area(); else a = s->area(); // an indirect call at PC: Z Small talk(’84), Calder and Grunwald (’94), Garret et al. (’94) , Ishizaki et al.(’00)
VPC Prediction Source code: Shape *s = …; a = s->area();// an indirect call Static assembly code: R1 = MEM[R2] call R1// PC: L Dynamic virtual branches (for prediction purposes): conditional jump TARGET1 // virtual PC = L conditional jump TARGET2 // virtual PC = L XOR HASHVAL[1] conditional jump TARGET3 // virtual PC = L XOR HASHVAL[2] conditional jump TARGET4 // virtual PC = L XOR HASHVAL[3]
Virtual PC Address Generation Use original PC address and iteration counter value Hash value table iteration counter value
VPC Prediction Process-I Direction Predictor cond. jump TARG1 // VPC: L cond. jump TARG2 // VPC: VL2 cond. jump TARG3 // VPC: VL3 cond. jump TARG4 // VPC: VL4 Real Instruction GHR call R1 // PC: L 1111 not taken Virtual Instructions PC L BTB Next iteration TARG1
VPC Prediction Process-II Direction Predictor cond. jump TARG1 // VPC: L cond. jump TARG2 // VPC: VL2 cond. jump TARG3 // VPC: VL3 cond. jump TARG4 // VPC: VL4 Real Instruction VGHR call R1 // PC: L 1110 Virtual Instructions VPC VL2 not taken BTB TARG2 Next iteration
VPC Prediction Process-III Direction Predictor cond. jump TARG1 // VPC: L cond. jump TARG2 // VPC: VL2 cond. jump TARG3 // VPC: VL3 cond. jump TARG4 // VPC: VL4 Real Instruction taken VGHR call R1 // PC: L 1100 Virtual Instructions VPC VL3 BTB Predicted Target = TARG3 TARG3
VPC Prediction Algorithm • Access the conditional branch predictor and the BTB with VPCA and VGHR • Compute VPCA and VGHR for the next iteration • VPCA = PC XOR HASHVAL[iter] • VGHR = VGHR << 1 • Predicted not taken: Move to the next iteration • Predicted taken:Use the target in the BTB as the target of an indirect branch • Give up and stall if • Iteration count > MAX_ITER or BTB miss
VPC Training Algorithm • An iterative process when an indirect branch is retired (not on the critical path) • Update the conditional branch predictor • Virtual branch has a correct target: Taken • Virtual branch has a wrong target: Not-taken • Update replacement policy bits of the correct target in the BTB • Insert the correct target into the BTB • Conditional branch predictor: taken • Replace the least frequently used target (LFU)
GHR VGHR Branch Direction Predictor (BP) PC VPCA Hash Function BTB Iteration counter + Hardware Cost and Complexity Taken/Not Taken Predict? Direct/Indirect Target Address
Outline • Background and Motivation • VPC Prediction • Results • Conclusion
Simulation Methodology • Pin-based x86 Simulator • Processor configuration • 4K-entry BTB • 64KB perceptron conditional branch predictor • Minimum 30-cycle branch misprediction penalty • 8-wide, 512-entry instruction window • Less aggressive processor (in the paper) • Gshare, O-GEHL conditional branch predictors • Indirect branch intensive benchmarks • 5 SPEC CPU2000, 5 SPEC CPU 2006, 2 other C++ • IBM server benchmarks (OLTP) (in the paper)
Different Direction Predictors 98% 98.3% 99% Conditional branch accuracy (%) Improving conditional branch prediction accuracy also improvesindirect branch prediction accuracy!
VPC vs. Static Devirtualization • Advantages • Enables other compiler optimizations (function inlining) • Can reduce the number of mispredictions • Disadvantages/Limitations • Not all indirect branches can be statically devirtualized • Extensive static analysis/profiling • Lack of adaptivity to run-time input set and phase behavior • VPC prediction can be used with statically devirtualized binaries • 10% improvement on top of static devirtualization
Outline • Background and Motivation • VPC Prediction • Results • Conclusion
Conclusion • VPC dynamically convertsindirect branches into multiple conditional branches; uses the existing conditional branch prediction hardware • VPC prediction reduces the branch misprediction penalty without significant extra hardware storage. • Baseline: 26% IPC improvement • O-GEHL: 31% IPC improvement • VPC can be an enabler encouraging programmers to use object-oriented programming styles
Thank you! Questions?
VPC vs. Other Indirect BP TTC: Chang et al. (’96) Cascaded: Driesen and Holzle(’98)
Iterative prediction • It doesn’t hurt performance significantly • Results • Why? • Most prediction is within a few iterations. • Results
Can the BTB be pipelined? • Yes • The next iteration of VPC can be started without knowing the previous iteration in the pipeline. • Consecutive VPC prediction iterations can be simply pipelined. • If the iteration is not needed then simply discard the prediction.
Is 4K-entry BTB too large? • Pentium 4 has a 4K-entry BTB • IBM Z series (z990) has an 8K-entry BTB • AMD Athlon and Hammer have 2K-entry BTBs
VPC Prediction vs. Compiler-Based Devirtualization (With TTC)
Conditional Br. Prediction Effects VPC Prediction reduces the accuracy of direction branch prediction but not that much!
VPC Prediction with Static Devirtualization • VPC prediction can be used with static devirtualized binaries. • Not all indirect branches could be devirtualized
VPC Training: Correct Prediction Retirement: Real Instruction call R1 // PC: L Known: Correct predicted, predicted iter = 3 Update the BTB replacement counter
VPC Training: Misprediction Retirement: Real Instruction call R1 // PC: L Known: Mispredicted, correct target address Update the BTB replacement counter
VPC Training: Misprediction Retirement: Real Instruction call R1 // PC: L Known: Mispredicted, correct target address No Target