Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System

Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System Kevin Fan, Manjunath Kudlur, Hyunchul Park, Scott Mahlke Advanced Computer Architecture Laboratory University of Michigan 1

20 GB HD Introduction • Emerging applications have high performance, cost, energy demands • H.264, wireless, software radio, signal processing • 10-100 Gops required • 200 mW power budget • Applications dominated by tight loops processing large amounts of streaming data 3.5G (HSDPA)WiMax Stereo Headset TV out Memory card [ARM 2005] PC / Mac 2

Loop Accelerators • Order-of-magnitude performance and efficiency wins • Viterbi: 100x speedup vs. ARM9 Automated C  gates solution • Correct by construction • Close designer productivity gap • Achieve short time-to-market .C 3

Loop Accelerator Template • Hardware realization of modulo scheduled loop • Parameterized execution resources, storage, connectivity 4

2 Modulo Schedule Scheduled Ops 3 Build Datapath 5 4 FUs Synthesize Instantiate Arch Op1 Op2 Op3 … time .v FU FU Loop Accelerator Verilog, Control Signals Concrete Arch Loop Accelerator Design Flow 1 FU Alloc FU FU .c RF C Code, Performance (Throughput) Abstract Arch 5

12 FU1 FU2 LOAD time 1 MEM + . . . ADD time 4 Schedule Datapath Modulo Scheduling andDatapath Derivation • Schedule to abstract architecture (FUs) • Determine register and interconnect requirements from schedule r1 = Mem[r2] r3 = r1 + 12 Source Code 6

FU1 FU2 FU3 0 +1 FU1 FU2 FU3 time 1 +2 LD1 2 LD2 Cost Sensitive Scheduling • Traditional scheduling is hardware unaware • Intelligent scheduling needed to reduce hardware cost • Different scheduling alternatives not equal FU1 FU2 FU3 0 +1 +2 FU1 FU2 FU3 1 time LD1 +1 +2 2 LD2 LD1 LD2 7

FU FU 3 4 Scheduling to Reduce Cost • Hardware cost is function of final schedule • Increased hardware sharing = reduced cost 1 • Reusing hardware is “free” • Traditional metrics (register pressure) not sufficient FU 2 No additional costfor longer lifetime 8

Hardware cost = FU cost + Storage cost + Wire cost + - * << Initial Approach: Greedy • Standard iterative modulo scheduler, augmented with hardware cost model • Choose alternative which increases cost the least while unscheduled ops remain { get valid alternatives for op for each alternative { get hardware cost } schedule op using min-cost alternative update hardware cost model } 9

FU Storage MUX Results – Greedy Scheduling • Local scope  local minima • Much more cost savings possible • 5% average cost savings 10

+1 +2 LD3 +4 -5 Optimal Modulo Scheduling Op1 Op2 Op3 (1,0) (1,1) (3,0) (3,1) (2,0) (2,1) (FU #, time) Loop Search Space Storage cost =  widthi  depthi FU cost =  cost(FUi) • Optimal modulo schedulingextends [Eichenberger ’97] 11

Results – Optimal Scheduling FU Storage MUX • 27% average cost savings 12

Problem Decomposition • Exact solutions are not practical • (#FU  II  stages) ^ #ops possible schedules • 20 lines of C code  100 hours • Excessive runtimes even for modest-size loops • Decompose into more manageable sub-problems • Partitioned scheduling • Time-space decomposition 13

Partitioned Scheduling • Partition the operations into small groups • Schedule groups of operations sequentially • Account for hardware contribution of previously scheduled groups • Backtrack if infeasible state reached 1 2 1 1 2 Optimal Modulo Scheduler Optimal Modulo Scheduler 3 4 3 3 4 5 5 5 14

+ + LD LD Operation Partitioning • Traditional partitioning: minimize edge cuts • Does not necessarily lead to good cost • Goal: maximize hardware sharing opportunities within a group + + LD << LD + * 15

Results – Partitioned Scheduling FU Storage MUX • 8% average cost savings • With large number of partitions, similar to greedy 16

Partition Size for Sharp • Improve cost by considering more ops at a time 17

Time-Space Decomposition • Reduce scheduling complexity • View all operations together FU1 FU2 FU3 time 0: 1 2 5 0 1 5 2 time Time, space time 1: 1 3 4 3 4 1 2 3 4 FU1 FU2 FU3 FU 1: 1 5 Space, time 5 0 1 2 FU 2: 2 4 time 1 5 4 3 FU 3: 3 • Optimize for register depth during time assignment, register width and FU cost during space assignment 18

Results – Time-Space Scheduling FU Storage MUX • Time, space: 19% average cost savings • Space, time: 20% average cost savings 19

Real Cost Savings Viterbi, space-time decomposed scheduler, 0.37 mm2 43.2% overall area savings Viterbi, naïve scheduler, 0.66 mm2 20

Conclusion • Automated C  loop accelerator synthesis system • Modulo scheduler must be cost aware • Decomposition methods make problem tractable • 20% average cost savings with space-time decomposition • Importance of global view of all operations • Individual savings up to 43% • Compile times of 1 minute – 30 minutes 21

Questions? • For more information: http://cccp.eecs.umich.edu 22

Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System