230 likes | 354 Views
Probabilistic Predicate-Aware Modulo Scheduling. Mikhail Smelyanskiy 1 , Scott Mahlke, Edward Davidson Department of EECS University of Michigan. 1 Currently with the System Technology Lab at Intel Corporation. Introduction to Deterministic Predicate-aware Scheduling (DPAS) [Smelyanskiy03].
E N D
Probabilistic Predicate-Aware Modulo Scheduling Mikhail Smelyanskiy1, Scott Mahlke, Edward Davidson Department of EECS University of Michigan 1 Currently with the System Technology Lab at Intel Corporation
Introduction to Deterministic Predicate-aware Scheduling (DPAS) [Smelyanskiy03] • Predication eliminates branch instructions • but increases resource requirements • Predicate-aware scheduling oversubscribes resources • reduces resource requirements • reduces schedule length A br cond F T B C D
Motivation for Probabilistic Predicate-aware Scheduling (PPAS) A1 M1 … • DPAS can only combine A5 with A2, A3 and A4 • What about combining • A2 with A3 ? • A3 with A4 ? • A2 with A6 ? • PPAS allows much more aggressive sharing than DPAS but can result in delay due to resource conflict A5 A2 … A3 … 2 A4 … … A6 M2 … br
Characteristics of Predicated Code • 52% of time is spent in cyclic regions • Cyclic PPAS might eliminate up to 38% of all dynamic operations from cyclic regions
Outline • Motivation • Resource Pressure Problem in Predicated Code • Probabilistic Predicate Aware Architecture • Probabilistic Predicate-aware Modulo Scheduling • Performance Results • Conclusions
Modulo Scheduling Example • This control path is taken 30% of the time • Assumed machine: • 1 ALU, 1 MEMORY and 1 BRANCH units br st +1 p1=cmpp 0 2 T freq=0.3 freq=0.3 +3ifp1 +2ifp1 1 1 1
Traditional Modulo Schedule (Rau 94) Modulo Schedule II=5
Probabilistic Predicate-Aware Modulo Scheduling br 0 st 2 +1 p1=cmpp freq=0.3 freq=0.3 1 / 2 +3if p1 1 +2if p1 1 / 2 II = 4 II = 4 II = 3.18
Baseline Architecture Model Must-use Resources May-use • Predicate Register File is only accessed in EXECUTE stage • Resources from FETCH to EXECUTE are unconditionally reserved Predicate Register File REGISTER READ DECODE FETCH DISPATCH WRITE BACK PRED READ & EXECUTE
Extended Predicate-Aware Architecture Must-use Resources May-use Resources • Conflict Detection and Recover Latency (CDRL) can be 0 or 1 cycles Predicate Register File (PRF) REGISTER READ FETCH PRED READ & DISPATCH DECODE WRITE BACK EXECUTE Resource Conflict Detection and Recovery Unit stall stall conflict detection conflict recovery
Expected Delay Model • Example (assume 3 operations, one FU and CDRL=1) • ev is execution vector • delay_cycles(ev) = CDRL + dispatch_cycles(ev) – 1 • P(ev) is probability of occurrence of ev • P(ev) is computed using disjointness and implication, and assuming independence otherwise
br 0 st 2 +1 p1=cmpp freq=0.3 freq=0.3 2 +3 if p1 +2 if p1 1 1 Modulo Scheduling using Expected Delay Model (scheduling operation +3 if p1) Time A may MEM may BR may Expected Delay due to Conflicts (CDRL = 1) 0 0 +1 0 1 1 0 2 2 p1=cmpp br 0 SRT 3 0 0 4 1 +2 if p1 0 5 2 +3 if p1 1 Pconf(p1=pred, +3) = 2 1.0 0.3 = 0.60 6 0 +3 if p1 1 Pconf(+1, +3) = 2 1.0 0.3 = 0.60 7 1 +3 if p1 1 Pconf(+2, +3) = 2 0.3 0.3 = 0.18 8 2 st 0 0 +1 0 MRT 1 +2 if p1 +3 if p1 0.18 2 p1=cmpp st br 0 total expected delay due to conflicts 0.18
Use binary search to find • upper bound = • lower bound = • start with and increase till or sched. found • of schedule found becomes new upper bound • becomes new lower bound if no schedule found 13 Modulo Scheduling using Expected Delay Model (Finding Expected Initiation Interval, IIexp) • More than one way to achieve the same (eg. 3.2)
Performance Results • Compare the performance of baseline (BASE), deterministic (DPAS) and probabilistic (PPAS) predicate-aware modulo scheduling • Compiler Support • Trimaran and ELCOR [Trimaran99] • Mediabench [Lee97] benchmark suite was evaluated • Processor Models (BA – base, PA – predicate-aware) 4-wide 6-wide
Cyclic PPAS Speedup over BASE (4-wide machine) • 4-wide cyclic PPAS with CDRL=0 is 20% better than base and 10% better than cyclic DPAS • Increased CDRL has degraded performance
IIcompile IIruntime Absolute Error EpilogueSize # Rotating Registers BASE 27.6 27.6 0.0% 1.2 14.8 DPAS 23.5 23.5 0.0% 2.2 18.4 PPAS 21.0 20.8 1.6% 4.7 29.9 Various Scheduling Measurements(4-wide machine, CDRL = 0) • Cyclic PPAS reduces II by 32% compared with BASE and by 12% compared with cyclic DPAS • Expected delay mode accurately predicts delay due to conflict • Predicate-aware scheduling increases the epilogue size and required more rotating registers than BASE
Overall Speedup over BASE with Cyclic PPAS • Only 52% of regions are scheduled with cyclic PPAS • Overall 4-wide cyclic PPAS is 10% better than base and 6-wide cyclic PPAS is 4% better than base
Summary of PPAS • PPAS significantly reduces resource requirements in predicated cyclic code but cause conflicts • compiler maximizes sharing in view of expected conflict • PPAS architecture detects and recovers from conflicts • PPAS improves performance by • For further discussion, see http://www.eecs.umich.edu/~msmelyan/publications.html Mikhail Smelyanskiy. Hardware/Software Mechanisms for Increasing Resource Utilization on VLIW/EPIC Processors. Ph.D. Dissertation, University of Michigan, 2004
Resource Conflict Detection and Recovery Unit • Design alternatives to dispatch conflicting operations • Conflict Detection and Recovery Latency (CDRL)
Cyclic PPAS Speedup for Training and Reference Input Sets (4-wide, CDRL=1)