Predicate-Aware Scheduling: A Technique for Reducing Resource Constraints

Predicate-Aware Scheduling:A Technique for ReducingResource Constraints Mikhail Smelyanskiy, Scott Mahlke, Edward Davidson Department of EECS University of Michigan Hsien-Hsin (Sean) Lee School of ECE Georgia Institute of Technology

Motivation • Predication eliminates branch instructions • but increases resource requirements • Predicate-aware scheduling oversubscribes resources • reduces resource requirements • reduces schedule length A br cond 0: A 1:p1,p2=pred_def(cond) 2: B if p1 C if p2 3: E 0: A 1:p1,p2=pred_def(cond) 2: B if p1 3: C if p2 4: E F T B C D

Potential for Disjoint Operations • Combining reduces dynamic operation count by 13%

Outline • Motivation • Resource Pressure Problem in Predicated Code • PRAVO: PRedicate-Aware VLIW Processor • Predicate-aware Scheduling • Performance Results • Conclusion and Future Work

Modulo Scheduling Example Predicated Code Source Code for(i=0; i < im_size; i++) { if (q_im[i] ≥ 1) res[i] = q_im[i] * bin_size – correction; else if (q_im[i] ≤ -1) res[i] = q_im[i] * bin_size – correction; else res[i] = bin_size + correction; } op1: t1 = load(i1, q_im) if T op2: p1,p2=pred_def (t1 ≥ 1) if T op3: t2 = multsub(t1, tbs, tcor) if p1 op4: store(i1, res, t2) if p1 op5: p3,p4= pred_def (t1 ≤-1) if p2 op6: t2 = multadd(t1, tbs, tcor) if p3 op7: store(i1, res, t2) if p3 op8: t2 = add(tbs, tcor) if p4 op9: store(i1++, res, t2) if p4 op10: if (i++ < im_size) goto op1 if T • Three control paths: PT, PFT, PFF

Traditional Modulo Schedule (Rau 94) Modulo Schedule II=5

Two Predicate-Aware Modulo Schedules • Resource oversubscription can produce more efficient schedules (if colored operations can share entry) • Larger Fetch Width (FW) allows more oversubscription and faster schedule

Baseline Architecture Model Must-use Resources May-use • Predicate Register File is only accessed in EXECUTE stage • Resources from FETCH to EXECUTE are unconditionally reserved Predicate Register File REGISTER READ DECODE FETCH DISPATCH WRITE BACK PRED READ & EXECUTE

Must-use Resources May-use Resources Predicate Register File (PRF) REGISTER READ FETCH PRED READ & DISPATCH DECODE WRITE BACK EXECUTE Predicate-aware Architecture (PRAVO) • PRF is accessed early in DISPATCH stage • increases predicate defining operation latency

Must-use Resources May-use Resources Predicate Register File (PRF) REGISTER READ FETCH DECODE PRED READ & DISPATCH WRITE BACK EXECUTE Predicate-aware Architecture (PRAVO) • DECODE and DISPATCH are reversed

Build DDG Cyclic Scheduler Acyclic Scheduler Compute ResMII / RecMII Three Main Changes to Conventional Scheduler • Predicate defining operation edge latency adjustment • ResMII computation • Predicate-Aware Reservation Table 4 Reservation Tables 1 5 3 2

Data Dependence Graph Latency Adjustment Original Brute force Selective p1,p2=pred_def p1,p2=pred_def p1,p2=pred_def 2 2 2 1 1 1 +1 if p1 +1 if p1 +1 if p1 ld if p2 ld if p2 ld if p2 1 1 1 1 1 1 +3 if p2 +3 if p2 +3 if p2 +2 if p1 +2 if p1 +2 if p1 1 1 1 +4 if p2 +4 if p2 +4 if p2

Computation of Resource-Constrained Lower Bound • Predicate-aware ResMII computation • “first-fit” combining • Fetch Width (FW) resource constraint p1,p2=pred_def +4 if p2 1 1 +3 if p2 +1 if p1 ld if p2 +2 if p1 +2 if p1 +4 if p2 1 1 +1 if p1 +1 if p1 +3 if p2 +2 if p1 +3 if p2 p,p= ld if p2 p1,p2= ld if p 1 +4 if p2 A M FW Amay Mmay FWmust Original (ResMII=5) Predicate-Aware (ResMII=3)

Reservation Table (similar to [Warter 92]) • One operation per RT entry • Multiple disjoint operations per RT entry • Check disjointness (using PQS [Johnson96])

Performance Results • Compare the performance of baseline and predicate-aware scheduling • Compiler Support • Trimaran and ELCOR [Trimaran99] • Mediabench [Lee97] benchmark suite was evaluated • Processor Models (BA – base, PA – predicate-aware)

Predicate-aware Speedup over Baseline(PA42 vs. BA42) • Speedup is only due to improvable PA regions • Speedup decreases for higher latency and wider machine average

Average Speedup Breakdown • Only 68% of regions are PA scheduled • PA is more effective in modulo scheduled loops

30 30 27 27 24 24 21 21 18 18 15 15 Cycles Cycles 12 12 9 9 6 6 3 3 Speedup Analysis Predicate-Aware Acyclic Region Predicate-Aware Cyclic Region 6-wide cmpplat=2 4-wide cmpplat=3 4-wide cmpplat=2 4-wide cmpplat=2 6-wide cmpplat=2 4-wide cmpplat=3 Case 2 Case 1 Case 3 Case 4 Case 6 Case 5 0 0 PA Potential ▬ Base Sched. Length ▬ PA Sched. Length ▬ PA Critical Path Length ▬ PA Resource Bound

Summary and Future Work • Summary Predicate-aware Scheduling • reduces resource constraints in predicated code • is supported by PRAVO architecture • is effective in cyclic regions (16% speedup on 4-wide PRAVO) • Future work • More resource sharing can be achieved by combining probabalistically disjoint operations

Q&A and Suggestions

Modulo Scheduling Using PART

Predicate-Aware Acyclic Region Predicate-Aware Cyclic Region 6-wide cmpplat=1 4-wide cmpplat=1 4-wide cmpplat=2 4-wide cmpplat=1 4-wide cmpplat=2 6-wide cmpplat=1 30 30 27 27 24 24 21 21 18 18 15 15 Cycles Time 12 12 9 9 6 6 3 3 Case 2 Case 3 Case 4 Case 5 Case 6 Case 1 0 0 Speedup Analysis PA Potential ▬ Base Sched. Length ▬ PA Sched. Length ▬ PA Critical Path Length ▬ PA Resource Bound

Predicate-Aware Scheduling: A Technique for Reducing Resource Constraints