Microbenchmarks and Mechanisms for Reverse Engineering of Branch Predictor Structures

Microbenchmarks and Mechanisms for Reverse Engineering of Branch Predictor Structures Vladimir Uzelac and Aleksandar Milenković LaCASA LaboratoryElectrical and Computer Engineering Department The University of Alabama in Huntsville {uzelacv | milenka}@ece.uah.edu

Outline • Motivation and Goals • Reverse Engineering Flow • Predictors Details Deconstruction • Target Predictors • Branch Target Buffer • Indirect Branch Target Buffer • Outcome Predictors • Loop Predictor • Global/Bimodal Predictors • Conclusion

Motivation • If we know branch predictor organization we could … • Implement predictor-aware compiler optimizations • Code alignment to avoid BTB conflicts in critical code sections • Code split to replace long correlations with shorter ones • Camino environment [PLDI `05] • Have a “golden standard” for academic research • Design tools for rapid BP design space exploration and verification • But, details are rarely publicly disclosed • In spite of hints in software optimization manuals •  Develop microbenchmarks and mechanisms for reverse engineering of modern branch predictor units

Goals • Microbenchmarks and mechanisms developed to reverse engineer Pentium M’s branch predictor including • Target predictor • BTB and IBTB • Outcome predictor • Loop predictor • Global outcome predictor • Bimodal predictor • Branch predictor parameters • Organization and size of all branch predictor structures • Indexing, allocation, update, replacement policies • Interdependencies between these structures • Validation of our effort through a functional PIN model

Presentation Outline • Motivation and Goals • Reverse Engineering Flow • Predictors Details Deconstruction • Target Predictors • Branch Target Buffer • Indirect Branch Target Buffer • Outcome Predictors • Loop Predictor • Global/Bimodal Predictors • Conclusion

Reverse Engineering Flow • Goal: determine a specific branch predictor parameter (e.g., BTB size) • Design benchmark(s) to stress the parameter • Influenced by the type of observable events • Build expectations for relevant event(s) based on back-of-the-envelope analysis • Execute benchmarks and collect events (Vtune) • Compare expectations with actual results • Retire findings or modify benchmark • Verify findings using functional PIN model

Outline • Goals and Motivation • Reverse Engineering Flow • Predictors Details Deconstruction • Target Predictors • Branch Target Buffer • Indirect Branch Target Buffer • Outcome Predictors • Loop Predictor • Global/Bimodal Predictors • Conclusion

Branch Target Buffer (BTB) Background: • BTB is a cache structure • Instructions are fetched in 16-byte blocks (Intel) • Can have multiple branches per line • BTB can have multiplehits (same tags) • => Offset field in each entry • => Offset algorithm selects the target among several offered • Try to find: • Number of BTB entries (NBTB) • Number of sets (NSETS) • Number of ways (NWAYS) • Index, Tag bits • Offset bits and presence of offset algorithm • Bogus branches handling • Replacement policy

Core BTB Test • Use B taken branches at the distance D from each other • Code executed many times to amplify effects on performance counters • Control how these branchesare presented to BTB • To cope with different allocation policies • Here, we execute each branch twice consecutively • Missprediction rate (MPR) as function of B and D is sufficient to conclude on BTB parameters

BTB Capacity Tests • Try to fill whole BTB using very small distances between branches • Example: 4-way BTB with 512 entries, BTB index = IP[10:4] • NBTB branches can fit for three distances • Branches fill sets consecutively • For larger D, MPR = f(B,D) • Branches jump over sets • For very small D, there aremore branches in the line than sets • MPR exist for any D if B>NBTB • MPR = f(B,D, BTB parameters)can be mathematically formalized

BTB Set Tests • Try to fill one BTB set varying distance D • When D > NSET all branches collide in one set • MPR is a function of B only (only 4 branches can fit) • Helps finding NWAYS and Index MSB • When D > NSET, change D’ between lasttwo to find Index LSB • D’ for which MPR disappear determines Index LSB • When D over Tag MSB distance, false hits occur • Only two branches produce MPR

BTB Findings • Number of BTB entries: 2048 • Number of sets: 512 • Number of ways : 4 • Index= IP[12:4], Tag=IP[21:13], Offset=IP[3:0] • Offset algorithm: When multiple hits, selects the target with the lowest offset yet no smaller than the current IP • Bogus branches handling: Evict whole set • Replacement policy: Tree based pseudo LRU

Indirect Branch Target Buffer (IBTB) Background: • Target predictor indexedby program-path information Try to find: • Which branch parts affect the PIR during update? • How is PIR updated? • Which branch IP bits affect the hash access function? • What is hash access function? • What are Index and Tag fields? • What is IBTB organization?

Path Information Register: Background • PIR is a (shift) register – updated with program branches • Different ways to allocate newly occurred branch : • Shift and Add (add to lowest PIR bits) • Shift and Add with interleave(better indexing) • Shift and XOR

PIR Organization Test • PIR is the same prior to both Target1 and Target2 • Branches are at large distance from each other(> 2q) • P1.SB1 and P2.SB1 differ in one bit – k = log2D • If bit k affects the PIR there is no collisions and opposite • H block – H branches that affect the PIR • For large H, P1.SB1 and P2.SB1 shifted out of PIR • Analysis MPR = f(H, D) gives following answers • PIR History depth • Which branch address/target bits affect the PIR • PIR Update mechanism details (XOR or Add…) • P1.SB1 and P1.SB1 replaced with different types of branches • Both address and target bits tested in this way

IBTB Access Hash Function Test • Find which PIR and branch IP bits are XORed in the iBTB access hash function • Previously we found XOR • Reuse previous test • Difference at P1.SB1 and P2.SB2 bit k makes targets not to collide • Use two Spies at distance DIP = 2l • If bitsland k are XORed in the hash function difference in PIR values is annulated

IBTB Organization Test • Employ N indirect branch targets to fill iBTB in different ways • By using N different PIR values • SB1…SBN create N different PIRs to the each of iSpy target • SB1…SBN are at distance D=2kfrom each other • MPR = f(D,N) sufficient to find IBTB organization • Similarly as for the BTB

IBTB Predictor Findings • Which branch parts affect the PIR during update? • 15 IP bits from conditional branch IP • Combined 15 bits from indirect branch target and IP • How is PIR updated? • Shifted for two bits left prior to update (XOR) • Which branch IP bits affect the hash access function? • 15 bits, IP[18:4] • What is hash access function? • XOR • What are Index and Tag fields? • Index = HASH[13:6], Tag = IP[14,5:0] • What is IBTB organization? • A direct-mapped cache with 256 entries

Loop Predictor • What do we know? • Each entry has two counters • Counter MAX_VAL stores the loop branch maximum count value • Counter CURR_VAL stores the loop branch current iteration • Assumptions: • Loop BP is an IP indexed cache • Try to find: • Counters’ length • Size and organization of the loop branch predictor buffer (Loop BPB) • Allocation policy (when a branch becomes a candidate for a loop branch) • Training policy – how new loop branch MAX_VAL is set

Test: “spy” loop (LSpy) has loop modulo L MPR exists if L > MAX_VAL counter length Results: Maximum predictable L is 64 (6-bit counters) Loop Counters Size Test

Loop BPB Capacity and Set Tests • Similar to the BTB Capacity/Set tests • Employ B loops at the distance D from each other • MPR is a function of B, D and Loop BPB parameters similarly as for the BTB

Loop BPB Capacity and Set Tests • Counters’ length: 6 bits • Size and organization of the loop branch predictor buffer • Two-way cache with 128 entries • Index = IP[9:4], Tag = IP[15:10] • Allocation policy: Branch allocated on first opposite outcome • Training policy: Set MAX_VALduring 2nd loop iteration

Global and Bimodal Predictor • What do we know? • All branches predicted dynamically • At least one predictor not tagged • Assumptions: • Cascade organization • Bimodal predictor is not tagged • Global predictor can correct Bimodal • Global is path indexed (BHR register) • Try to find: • Organization of Global Predictor • Indexing to Global predictor (BHR and hashing function details) • Bimodal predictor details • Size only (not tagged) • Indexing bits (IP indexed)

BHR Organization Test • Similar to PIR Organization test • iSpy with two targets replaced with the conditional branch (cSpy) with two outcomes • MPR =f(D, H) sufficient to find BHR organization Results: • BHR affected in the same way as the PIR • BHR and PIR are the same register

Global Predictor Organization Test • Similar to IBTB Organization test • N different paths to cSpyN (always not taken) • PIR values depend on distance D • cSpyN allocated to up to N different entries • Similar to IBTB, MPR=f(D,N) is sufficient to determine the predictor organization • Eliminate correct prediction from Bimodal predictor: • cSpyT distance from SpyN is large – target the same Bimodal entry • Paths occurrence pattern: T*PT, PN1, T*PT, PN2, …, T*PT, PNN, … • Eliminate correct prediction from Loop Predictor if needed

Bimodal Predictor Organization Test • Reuse the previous test • Make contentions in Global predictor • Change distance between cSpyTand cSpyN to try predicting brancheswith the Bimodal predictor • DG =2k • No contentions in Bimodal Predictor if bit k is used for Bimodal Index

Global and Bimodal Predictor Findings • Global: • 4-way cache structure with 2048 entries • Accessed with the hash function - PIR XORed with conditional branch IP • 9 bits used as the index, 6 bits as the tag • Bimodal: • A table with 4096 bimodal counters • Indexed with IP [11:0]

Limitations and Verification • Generalization of reverse engineering flow is difficult • Different branch prediction organizations • Implementation of microbenchmarks is a challenging task • Balance of observability of certain parameters and isolation of different parameters that share the same event • Certain knowledge on targeted predictor is needed • E.g. Prediction in cache lines (AMD K8) • Tests must cover large design space • Verification • Using PIN model – achieved more than 95% accuracy

Conclusion • Microbenchmarks and mechanisms for reverse engineering of path- or IP- indexed predictor structures • Demonstrated on Pentium M • BTB, IBTB, Loop, Global/Bimodal

Microbenchmarks and Mechanisms for Reverse Engineering of Branch Predictor Structures

Microbenchmarks and Mechanisms for Reverse Engineering of Branch Predictor Structures

Presentation Transcript

Microbenchmarks and Mechanisms For Reverse Engineering Of Modern Branch Predictor Units

Reverse Engineering

Reverse Engineering:

The O-GEHL branch predictor

A Penalty-Sensitive Branch Predictor

Reverse Engineering

Temporal Stream Branch Predictor (TS Predictor)

Reverse Engineering

Reverse Engineering

Branch Predictor Interface

Reverse Engineering

Structures and Mechanisms

REVERSE ENGINEERING

Structures and Mechanisms

Reverse Engineering

Branch Predictor Design for AE64000

Structures and Mechanisms

Reverse Engineering

Structures and Mechanisms

Reverse Engineering

Microbenchmarks and Mechanisms For Reverse Engineering Of Modern Branch Predictor Units

Structures and Mechanisms