250 likes | 340 Views
Branch-mispredict Level Parallelism for Control Independence. Kshitiz Malik, Mayank Agarwal Sam Stone, Kevin Woley, Matthew Frank Implicitly Parallel Architectures Group University of Illinois at Urbana Champaign. Summary. Mispredicted branches : A major bottleneck to ILP
E N D
Branch-mispredict Level Parallelism for Control Independence Kshitiz Malik, Mayank Agarwal Sam Stone, Kevin Woley, Matthew Frank Implicitly Parallel Architectures Group University of Illinois at Urbana Champaign
Summary • Mispredicted branches: A major bottleneck to ILP • BLP: An application property that can help • Control-Independence architectures can exploit BLP • Current policies ill-suited • BLP-targeted policies lead to dramatic improvements Implicitly Parallel Architectures Group, UIUC
Outline • The Branch Prediction Wall • Application BLP to Scale the Wall • Architectures to Exploit BLP • Maximizing BLP Implicitly Parallel Architectures Group, UIUC
B1 B2 The Branch Prediction Wall • Successful branch prediction => high ILP • Parallel and OOO execution across branches • Mispredicted branches Catastrophic • All instructions beyond mispredicts squashed • Mispredicted branches fetched and executed serially Squashed Squashed B2 Resolved B1 Resolved Useful Fetch Mispredicted Branch Wasted Fetch Implicitly Parallel Architectures Group, UIUC
Pushing the Wall Farther Out • Better branch predictors • Multipath execution • Predication • Parallel resolution of independent mispredicts • Exploit existence of BLP in the application Implicitly Parallel Architectures Group, UIUC
Outline • The Branch Prediction Wall • Application BLP to Scale the Wall • Architectures to Exploit BLP • Maximizing BLP Implicitly Parallel Architectures Group, UIUC
B1 B1 B2 B2 Branch-mispredict Level Parallelism (BLP) • Resolve multiple mispredicted branches in parallel • Overlap the penalty of individual mispredicts • Higher rate of mispredict resolution • Increased performance Superscalar Overlap BLP Useful fetch Mispredicted Branch Wasted fetch Implicitly Parallel Architectures Group, UIUC
Conditions for BLP to Exist • For parallel resolution, mispredicts need to be • Control-Independent • Data-Independent • A and B don’t lead to BLP • B control-dependent on A • A and E don’t lead to BLP • E data-dependent on A (but CI) • A and G can lead to BLP • Control and Data Independent j++ Mispredicted Branch A Control Flow B C Data Flow CD D: i++ E: if (i) CIDD F CIDI G: if (j<10) Implicitly Parallel Architectures Group, UIUC
j++ A B C CD B A G E D: i++ E: if (i) CIDD F CIDI G: if (j<10) CIDD Branches → No BLP No Overlap Useful Fetch Mispredicted Branch Wasted Fetch Implicitly Parallel Architectures Group, UIUC
j++ A B C CD B A G E D: i++ E: if (i) CIDD F CIDI G: if (j<10) CIDI Branches → BLP Overlap Useful fetch Mispredicted Branch Wasted Fetch Implicitly Parallel Architectures Group, UIUC
B A How to Measure BLP • Average # simultaneous independent mispredicts • Unresolved branches (fetched but not completed) • When atleast one unresolved • Only branches that eventually retire • Superscalar : BLP is exactly 1 • Case 2: BLP more than 1 BLP = 1.5 Overlap Mispredict Penalty Mispredict Fetched Correct path fetch Implicitly Parallel Architectures Group, UIUC
Outline • The Branch Prediction Wall • Application BLP to Scale the Wall • Architectures to Exploit BLP • Maximizing BLP Implicitly Parallel Architectures Group, UIUC
Control Independence Architectures • Start (spawn) task at control-independent (CI) point • Execute concurrently with spawner task • Can potentially exploit BLP Implicitly Parallel Architectures Group, UIUC
Control-Independent Spawning • E control-independent of branch B • Spawn E as a new task at B • Execute concurrently • Data-dependent instructions delayed until previous thread completes A Spawner B C D Spawnee E F Implicitly Parallel Architectures Group, UIUC
Control-Independent Spawning for BLP CPU1 CPU2 CPU3 A A Task Spawn B B Resolve Mispredict Spawned Task E D C C D F E Reconnect A F Task Spawn B Spawned Task Resolve Mispredict E Useful fetch D C Mispredicted Branch F Reconnect Wasted Fetch Implicitly Parallel Architectures Group, UIUC
Targeting BLP in CI Architectures • CI architectures can exploit BLP • But conventional policies ill-suited for BLP • Two policies target BLP and improve performance: • Spawn Selection • Smarter data dependence handling Implicitly Parallel Architectures Group, UIUC
Outline • The Branch Prediction Wall • Application BLP to Scale the Wall • Architectures to Exploit BLP • Maximizing BLP Implicitly Parallel Architectures Group, UIUC
Where to look for BLP? heap_head = .. if (...) Low-Confidence Branch Inner Loop Control Flow P: if (heap[i+1] < heap[i]) Data Flow Z: i++ Example: get_heap_head() from VPR Route ~30% of instructions ~40% of mispredicts CIDD Q: if (heap[i] > val) swap (i, from) if (...) R: ret_val = heap_head Limited BLP in innermost loop Might need to look farther out for CIDI branches CIDI S: if(ret_val.flag = 0) T: if(ret_val.flag = 0) Implicitly Parallel Architectures Group, UIUC
Spawn Selection for BLP heap_head = .. if (...) Inner Loop P: if (heap[i+1] < heap[i]) Z: i++ CIDD Q: if (heap[i] > val) swap (i, from) if (...) R: ret_val = heap_head CIDI S: if(ret_val.flag = 0) T: if(ret_val.flag = 0) Low-Confidence Branch Control Flow Data Flow Implicitly Parallel Architectures Group, UIUC
Spawn Selection for BLP heap_head = .. if (...) Inner Loop P: if (heap[i+1] < heap[i]) Z: i++ CIDD Q: if (heap[i] > val) swap (i, from) if (...) R: ret_val = heap_head CIDI S: if(ret_val.flag = 0) T: if(ret_val.flag = 0) Low-Confidence Branch High-BLP Spawn R → High Performance Naïve policies will select Q Control Flow Data Flow Implicitly Parallel Architectures Group, UIUC
Exploiting Choice in Spawn Selection BLP good indicator of performance Effective strategy for tasks selection Implicitly Parallel Architectures Group, UIUC
Balanced Dependence Handling • Impacts exploitable BLP • Conservative handling => CIDI mispredicts marked CIDD • Blind speculation => wastage from misspeculation • Balanced approach • Adapts dynamically • Large improvements in BLP Implicitly Parallel Architectures Group, UIUC
BLP → Performance 4-core setup Baseline 4-wide OOO core Aggressive branch predictor Aggressive backend Exploiting BLP is an effective heuristic to improve performance Implicitly Parallel Architectures Group, UIUC
Conclusions • Branch prediction a major bottleneck to ILP • Applications possess significant amounts of “Branch-Mispredict Level Parallelism (BLP)” • Control-Independence architectures can exploit BLP • But current policies ill-suited for BLP • Two techniques dramatically increase exploited BLP • Spawn selection and Balanced Dependence Handling • High BLP Exploited => High Performance Implicitly Parallel Architectures Group, UIUC
Thank You Questions?