330 likes | 515 Views
Microbenchmarks and Mechanisms for Reverse Engineering of Branch Predictor Structures. Vladimir Uzelac and Aleksandar Milenkovi ć LaCASA Laboratory Electrical and Computer Engineering Department The University of Alabama in Huntsville {uzelacv | milenka}@ece.uah.edu. Outline.
E N D
Microbenchmarks and Mechanisms for Reverse Engineering of Branch Predictor Structures Vladimir Uzelac and Aleksandar Milenković LaCASA LaboratoryElectrical and Computer Engineering Department The University of Alabama in Huntsville {uzelacv | milenka}@ece.uah.edu
Outline • Motivation and Goals • Reverse Engineering Flow • Predictors Details Deconstruction • Target Predictors • Branch Target Buffer • Indirect Branch Target Buffer • Outcome Predictors • Loop Predictor • Global/Bimodal Predictors • Conclusion
Motivation • If we know branch predictor organization we could … • Implement predictor-aware compiler optimizations • Code alignment to avoid BTB conflicts in critical code sections • Code split to replace long correlations with shorter ones • Camino environment [PLDI `05] • Have a “golden standard” for academic research • Design tools for rapid BP design space exploration and verification • But, details are rarely publicly disclosed • In spite of hints in software optimization manuals • Develop microbenchmarks and mechanisms for reverse engineering of modern branch predictor units
Goals • Microbenchmarks and mechanisms developed to reverse engineer Pentium M’s branch predictor including • Target predictor • BTB and IBTB • Outcome predictor • Loop predictor • Global outcome predictor • Bimodal predictor • Branch predictor parameters • Organization and size of all branch predictor structures • Indexing, allocation, update, replacement policies • Interdependencies between these structures • Validation of our effort through a functional PIN model
Presentation Outline • Motivation and Goals • Reverse Engineering Flow • Predictors Details Deconstruction • Target Predictors • Branch Target Buffer • Indirect Branch Target Buffer • Outcome Predictors • Loop Predictor • Global/Bimodal Predictors • Conclusion
Reverse Engineering Flow • Goal: determine a specific branch predictor parameter (e.g., BTB size) • Design benchmark(s) to stress the parameter • Influenced by the type of observable events • Build expectations for relevant event(s) based on back-of-the-envelope analysis • Execute benchmarks and collect events (Vtune) • Compare expectations with actual results • Retire findings or modify benchmark • Verify findings using functional PIN model
Outline • Goals and Motivation • Reverse Engineering Flow • Predictors Details Deconstruction • Target Predictors • Branch Target Buffer • Indirect Branch Target Buffer • Outcome Predictors • Loop Predictor • Global/Bimodal Predictors • Conclusion
Branch Target Buffer (BTB) Background: • BTB is a cache structure • Instructions are fetched in 16-byte blocks (Intel) • Can have multiple branches per line • BTB can have multiplehits (same tags) • => Offset field in each entry • => Offset algorithm selects the target among several offered • Try to find: • Number of BTB entries (NBTB) • Number of sets (NSETS) • Number of ways (NWAYS) • Index, Tag bits • Offset bits and presence of offset algorithm • Bogus branches handling • Replacement policy
Core BTB Test • Use B taken branches at the distance D from each other • Code executed many times to amplify effects on performance counters • Control how these branchesare presented to BTB • To cope with different allocation policies • Here, we execute each branch twice consecutively • Missprediction rate (MPR) as function of B and D is sufficient to conclude on BTB parameters
BTB Capacity Tests • Try to fill whole BTB using very small distances between branches • Example: 4-way BTB with 512 entries, BTB index = IP[10:4] • NBTB branches can fit for three distances • Branches fill sets consecutively • For larger D, MPR = f(B,D) • Branches jump over sets • For very small D, there aremore branches in the line than sets • MPR exist for any D if B>NBTB • MPR = f(B,D, BTB parameters)can be mathematically formalized
BTB Set Tests • Try to fill one BTB set varying distance D • When D > NSET all branches collide in one set • MPR is a function of B only (only 4 branches can fit) • Helps finding NWAYS and Index MSB • When D > NSET, change D’ between lasttwo to find Index LSB • D’ for which MPR disappear determines Index LSB • When D over Tag MSB distance, false hits occur • Only two branches produce MPR
BTB Findings • Number of BTB entries: 2048 • Number of sets: 512 • Number of ways : 4 • Index= IP[12:4], Tag=IP[21:13], Offset=IP[3:0] • Offset algorithm: When multiple hits, selects the target with the lowest offset yet no smaller than the current IP • Bogus branches handling: Evict whole set • Replacement policy: Tree based pseudo LRU
Outline • Motivation and Goals • Reverse Engineering Flow • Predictors Details Deconstruction • Target Predictors • Branch Target Buffer • Indirect Branch Target Buffer • Outcome Predictors • Loop Predictor • Global/Bimodal Predictors • Conclusion
Indirect Branch Target Buffer (IBTB) Background: • Target predictor indexedby program-path information Try to find: • Which branch parts affect the PIR during update? • How is PIR updated? • Which branch IP bits affect the hash access function? • What is hash access function? • What are Index and Tag fields? • What is IBTB organization?
Path Information Register: Background • PIR is a (shift) register – updated with program branches • Different ways to allocate newly occurred branch : • Shift and Add (add to lowest PIR bits) • Shift and Add with interleave(better indexing) • Shift and XOR
PIR Organization Test • PIR is the same prior to both Target1 and Target2 • Branches are at large distance from each other(> 2q) • P1.SB1 and P2.SB1 differ in one bit – k = log2D • If bit k affects the PIR there is no collisions and opposite • H block – H branches that affect the PIR • For large H, P1.SB1 and P2.SB1 shifted out of PIR • Analysis MPR = f(H, D) gives following answers • PIR History depth • Which branch address/target bits affect the PIR • PIR Update mechanism details (XOR or Add…) • P1.SB1 and P1.SB1 replaced with different types of branches • Both address and target bits tested in this way
IBTB Access Hash Function Test • Find which PIR and branch IP bits are XORed in the iBTB access hash function • Previously we found XOR • Reuse previous test • Difference at P1.SB1 and P2.SB2 bit k makes targets not to collide • Use two Spies at distance DIP = 2l • If bitsland k are XORed in the hash function difference in PIR values is annulated
IBTB Organization Test • Employ N indirect branch targets to fill iBTB in different ways • By using N different PIR values • SB1…SBN create N different PIRs to the each of iSpy target • SB1…SBN are at distance D=2kfrom each other • MPR = f(D,N) sufficient to find IBTB organization • Similarly as for the BTB
IBTB Predictor Findings • Which branch parts affect the PIR during update? • 15 IP bits from conditional branch IP • Combined 15 bits from indirect branch target and IP • How is PIR updated? • Shifted for two bits left prior to update (XOR) • Which branch IP bits affect the hash access function? • 15 bits, IP[18:4] • What is hash access function? • XOR • What are Index and Tag fields? • Index = HASH[13:6], Tag = IP[14,5:0] • What is IBTB organization? • A direct-mapped cache with 256 entries
Outline • Motivation and Goals • Reverse Engineering Flow • Predictors Details Deconstruction • Target Predictors • Branch Target Buffer • Indirect Branch Target Buffer • Outcome Predictors • Loop Predictor • Global/Bimodal Predictors • Conclusion
Loop Predictor • What do we know? • Each entry has two counters • Counter MAX_VAL stores the loop branch maximum count value • Counter CURR_VAL stores the loop branch current iteration • Assumptions: • Loop BP is an IP indexed cache • Try to find: • Counters’ length • Size and organization of the loop branch predictor buffer (Loop BPB) • Allocation policy (when a branch becomes a candidate for a loop branch) • Training policy – how new loop branch MAX_VAL is set
Test: “spy” loop (LSpy) has loop modulo L MPR exists if L > MAX_VAL counter length Results: Maximum predictable L is 64 (6-bit counters) Loop Counters Size Test
Loop BPB Capacity and Set Tests • Similar to the BTB Capacity/Set tests • Employ B loops at the distance D from each other • MPR is a function of B, D and Loop BPB parameters similarly as for the BTB
Loop BPB Capacity and Set Tests • Counters’ length: 6 bits • Size and organization of the loop branch predictor buffer • Two-way cache with 128 entries • Index = IP[9:4], Tag = IP[15:10] • Allocation policy: Branch allocated on first opposite outcome • Training policy: Set MAX_VALduring 2nd loop iteration
Outline • Motivation and Goals • Reverse Engineering Flow • Predictors Details Deconstruction • Target Predictors • Branch Target Buffer • Indirect Branch Target Buffer • Outcome Predictors • Loop Predictor • Global/Bimodal Predictors • Conclusion
Global and Bimodal Predictor • What do we know? • All branches predicted dynamically • At least one predictor not tagged • Assumptions: • Cascade organization • Bimodal predictor is not tagged • Global predictor can correct Bimodal • Global is path indexed (BHR register) • Try to find: • Organization of Global Predictor • Indexing to Global predictor (BHR and hashing function details) • Bimodal predictor details • Size only (not tagged) • Indexing bits (IP indexed)
BHR Organization Test • Similar to PIR Organization test • iSpy with two targets replaced with the conditional branch (cSpy) with two outcomes • MPR =f(D, H) sufficient to find BHR organization Results: • BHR affected in the same way as the PIR • BHR and PIR are the same register
Global Predictor Organization Test • Similar to IBTB Organization test • N different paths to cSpyN (always not taken) • PIR values depend on distance D • cSpyN allocated to up to N different entries • Similar to IBTB, MPR=f(D,N) is sufficient to determine the predictor organization • Eliminate correct prediction from Bimodal predictor: • cSpyT distance from SpyN is large – target the same Bimodal entry • Paths occurrence pattern: T*PT, PN1, T*PT, PN2, …, T*PT, PNN, … • Eliminate correct prediction from Loop Predictor if needed
Bimodal Predictor Organization Test • Reuse the previous test • Make contentions in Global predictor • Change distance between cSpyTand cSpyN to try predicting brancheswith the Bimodal predictor • DG =2k • No contentions in Bimodal Predictor if bit k is used for Bimodal Index
Global and Bimodal Predictor Findings • Global: • 4-way cache structure with 2048 entries • Accessed with the hash function - PIR XORed with conditional branch IP • 9 bits used as the index, 6 bits as the tag • Bimodal: • A table with 4096 bimodal counters • Indexed with IP [11:0]
Outline • Motivation and Goals • Reverse Engineering Flow • Predictors Details Deconstruction • Target Predictors • Branch Target Buffer • Indirect Branch Target Buffer • Outcome Predictors • Loop Predictor • Global/Bimodal Predictors • Conclusion
Limitations and Verification • Generalization of reverse engineering flow is difficult • Different branch prediction organizations • Implementation of microbenchmarks is a challenging task • Balance of observability of certain parameters and isolation of different parameters that share the same event • Certain knowledge on targeted predictor is needed • E.g. Prediction in cache lines (AMD K8) • Tests must cover large design space • Verification • Using PIN model – achieved more than 95% accuracy
Conclusion • Microbenchmarks and mechanisms for reverse engineering of path- or IP- indexed predictor structures • Demonstrated on Pentium M • BTB, IBTB, Loop, Global/Bimodal