1 / 22

Probabilistic Predicate-Aware Modulo Scheduling

Probabilistic Predicate-Aware Modulo Scheduling. Mikhail Smelyanskiy 1 , Scott Mahlke, Edward Davidson Department of EECS University of Michigan. 1 Currently with the System Technology Lab at Intel Corporation. Introduction to Deterministic Predicate-aware Scheduling (DPAS) [Smelyanskiy03].

kenton
Download Presentation

Probabilistic Predicate-Aware Modulo Scheduling

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Probabilistic Predicate-Aware Modulo Scheduling Mikhail Smelyanskiy1, Scott Mahlke, Edward Davidson Department of EECS University of Michigan 1 Currently with the System Technology Lab at Intel Corporation

  2. Introduction to Deterministic Predicate-aware Scheduling (DPAS) [Smelyanskiy03] • Predication eliminates branch instructions • but increases resource requirements • Predicate-aware scheduling oversubscribes resources • reduces resource requirements • reduces schedule length A br cond F T B C D

  3. Motivation for Probabilistic Predicate-aware Scheduling (PPAS) A1 M1 … • DPAS can only combine A5 with A2, A3 and A4 • What about combining • A2 with A3 ? • A3 with A4 ? • A2 with A6 ? • PPAS allows much more aggressive sharing than DPAS but can result in delay due to resource conflict A5 A2 … A3 … 2 A4 … … A6 M2 … br

  4. Characteristics of Predicated Code • 52% of time is spent in cyclic regions • Cyclic PPAS might eliminate up to 38% of all dynamic operations from cyclic regions

  5. Outline • Motivation • Resource Pressure Problem in Predicated Code • Probabilistic Predicate Aware Architecture • Probabilistic Predicate-aware Modulo Scheduling • Performance Results • Conclusions

  6. Modulo Scheduling Example • This control path is taken 30% of the time • Assumed machine: • 1 ALU, 1 MEMORY and 1 BRANCH units br st +1 p1=cmpp 0 2 T freq=0.3 freq=0.3 +3ifp1 +2ifp1 1 1 1

  7. Traditional Modulo Schedule (Rau 94) Modulo Schedule II=5

  8. Probabilistic Predicate-Aware Modulo Scheduling br 0 st 2 +1 p1=cmpp freq=0.3 freq=0.3 1 / 2 +3if p1 1 +2if p1 1 / 2 II = 4 II = 4 II = 3.18

  9. Baseline Architecture Model Must-use Resources May-use • Predicate Register File is only accessed in EXECUTE stage • Resources from FETCH to EXECUTE are unconditionally reserved Predicate Register File REGISTER READ DECODE FETCH DISPATCH WRITE BACK PRED READ & EXECUTE

  10. Extended Predicate-Aware Architecture Must-use Resources May-use Resources • Conflict Detection and Recover Latency (CDRL) can be 0 or 1 cycles Predicate Register File (PRF) REGISTER READ FETCH PRED READ & DISPATCH DECODE WRITE BACK EXECUTE Resource Conflict Detection and Recovery Unit stall stall conflict detection conflict recovery

  11. Expected Delay Model • Example (assume 3 operations, one FU and CDRL=1) • ev is execution vector • delay_cycles(ev) = CDRL + dispatch_cycles(ev) – 1 • P(ev) is probability of occurrence of ev • P(ev) is computed using disjointness and implication, and assuming independence otherwise

  12. br 0 st 2 +1 p1=cmpp freq=0.3 freq=0.3 2 +3 if p1 +2 if p1 1 1 Modulo Scheduling using Expected Delay Model (scheduling operation +3 if p1) Time A may MEM may BR may Expected Delay due to Conflicts (CDRL = 1) 0 0 +1 0 1 1 0 2 2 p1=cmpp br 0 SRT 3 0 0 4 1 +2 if p1 0 5 2 +3 if p1 1  Pconf(p1=pred, +3) = 2 1.0  0.3 = 0.60 6 0 +3 if p1 1  Pconf(+1, +3) = 2 1.0  0.3 = 0.60 7 1 +3 if p1 1  Pconf(+2, +3) = 2 0.3  0.3 = 0.18 8 2 st 0 0 +1 0 MRT 1 +2 if p1 +3 if p1 0.18 2 p1=cmpp st br 0 total expected delay due to conflicts 0.18

  13. Use binary search to find • upper bound = • lower bound = • start with and increase till or sched. found • of schedule found becomes new upper bound • becomes new lower bound if no schedule found 13 Modulo Scheduling using Expected Delay Model (Finding Expected Initiation Interval, IIexp) • More than one way to achieve the same (eg. 3.2)

  14. Performance Results • Compare the performance of baseline (BASE), deterministic (DPAS) and probabilistic (PPAS) predicate-aware modulo scheduling • Compiler Support • Trimaran and ELCOR [Trimaran99] • Mediabench [Lee97] benchmark suite was evaluated • Processor Models (BA – base, PA – predicate-aware) 4-wide 6-wide

  15. Cyclic PPAS Speedup over BASE (4-wide machine) • 4-wide cyclic PPAS with CDRL=0 is 20% better than base and 10% better than cyclic DPAS • Increased CDRL has degraded performance

  16. IIcompile IIruntime Absolute Error EpilogueSize # Rotating Registers BASE 27.6 27.6 0.0% 1.2 14.8 DPAS 23.5 23.5 0.0% 2.2 18.4 PPAS 21.0 20.8 1.6% 4.7 29.9 Various Scheduling Measurements(4-wide machine, CDRL = 0) • Cyclic PPAS reduces II by 32% compared with BASE and by 12% compared with cyclic DPAS • Expected delay mode accurately predicts delay due to conflict • Predicate-aware scheduling increases the epilogue size and required more rotating registers than BASE

  17. Overall Speedup over BASE with Cyclic PPAS • Only 52% of regions are scheduled with cyclic PPAS • Overall 4-wide cyclic PPAS is 10% better than base and 6-wide cyclic PPAS is 4% better than base

  18. Summary of PPAS • PPAS significantly reduces resource requirements in predicated cyclic code but cause conflicts • compiler maximizes sharing in view of expected conflict • PPAS architecture detects and recovers from conflicts • PPAS improves performance by • For further discussion, see http://www.eecs.umich.edu/~msmelyan/publications.html Mikhail Smelyanskiy. Hardware/Software Mechanisms for Increasing Resource Utilization on VLIW/EPIC Processors. Ph.D. Dissertation, University of Michigan, 2004

  19. Questions?

  20. Backup Foils

  21. Resource Conflict Detection and Recovery Unit • Design alternatives to dispatch conflicting operations • Conflict Detection and Recovery Latency (CDRL)

  22. Cyclic PPAS Speedup for Training and Reference Input Sets (4-wide, CDRL=1)

More Related