210 likes | 346 Views
Modulo Scheduling for Highly Customized Datapaths to Increase Hardware Reusability. Kevin Fan, Hyunchul Park, Manjunath Kudlur, Scott Mahlke Advanced Computer Architecture Laboratory University of Michigan April 8, 2008. Introduction.
E N D
Modulo Scheduling for Highly Customized Datapaths to Increase Hardware Reusability Kevin Fan, Hyunchul Park,Manjunath Kudlur, Scott Mahlke Advanced Computer Architecture Laboratory University of Michigan April 8, 2008 1
Introduction • Emerging applications have high performance, cost, energy demands • H.264, wireless, software radio, signal processing • 10-100 Gops required • 200 mW power budget • Applications dominated by tight loops processing large amounts of streaming data iPhone board 2
LD +/- * Loop Accelerators C Code Hardware Loop 3
Hardware Implementations • Customization gets order-of-magnitude performance and efficiency wins • Viterbi: 100x speedup vs. ARM9 FPGAs General PurposeProcessors DSPs CGRAs Flexibility Multifunction Loop Accelerators Loop Accelerators, ASICs Efficiency, Performance 4
What About Programmability? • Software changes – bug fixes, evolving standards • dct_8x8() from H.264 reference implementation for (coeff_ctr = 0; coeff_ctr < 64; coeff_ctr++) { i=pos_scan[coeff_ctr][0]; j=pos_scan[coeff_ctr][1]; run++; ilev=0; if (currMB->luma_transform_size_8x8_flag && input->symbol_mode == CAVLC) { MCcoeff = MC(coeff_ctr); runs[MCcoeff]++; } m7 = &curr_res[block_y + j][block_x]; level = iabs (m7[i]); if (img->AdaptiveRounding) { fadjust8x8[j][block_x+i] = 0; } if (level != 0) { nonzero = TRUE; if (currMB->luma_transform_size_8x8_flag && input->symbol_mode == CAVLC) { *coeff_cost += MAX_VALUE; img->cofAC[b8+pl_off][MCcoeff][0][scan_poss[MCcoeff] ] = isignab(level,m7[i]); img->cofAC[b8+pl_off][MCcoeff][1][scan_poss[MCcoeff]++] = runs[MCcoeff]; ++scan_pos; runs[MCcoeff]=-1; } else { *coeff_cost += MAX_VALUE; ACLevel[scan_pos ] = isignab(level,m7[i]); ACRun [scan_pos++] = run; run=-1; // reset zero level counter } level = isignab(level, m7[i]); ilev = level; } } for (coeff_ctr = 0; coeff_ctr < 64; coeff_ctr++) { i=pos_scan[coeff_ctr][0]; j=pos_scan[coeff_ctr][1]; run++; ilev=0; if (currMB->luma_transform_size_8x8_flag && input->symbol_mode == CAVLC) { MCcoeff = MC(coeff_ctr); runs[MCcoeff]++; } m7 = &curr_res[block_y + j][block_x]; level = iabs (m7[i]); if (img->AdaptiveRounding) { fadjust8x8[j][block_x+i] = 0; } if (level != 0) { nonzero = TRUE; if (currMB->luma_transform_size_8x8_flag && input->symbol_mode == CAVLC) { *coeff_cost += MAX_VALUE; img->cofAC[pl_off][MCcoeff][0][scan_poss[MCcoeff] ] = isignab(level,m7[i]); img->cofAC[pl_off][MCcoeff][1][scan_poss[MCcoeff]++] = runs[MCcoeff]; ++scan_pos; runs[MCcoeff]=-1; } else { *coeff_cost += MAX_VALUE; ACLevel[scan_pos ] = isignab(level,m7[i]); ACRun [scan_pos++] = run; run=-1; // reset zero level counter } level = isignab(level, m7[i]); ilev = level; } } for (coeff_ctr = 0; coeff_ctr < 64; coeff_ctr++) { i=pos_scan[coeff_ctr][0]; j=pos_scan[coeff_ctr][1]; run++; ilev=0; if (currMB->luma_transform_size_8x8_flag && input->symbol_mode == CAVLC) { MCcoeff = MC(coeff_ctr); runs[MCcoeff]++; } m7 = &curr_res[block_y + j][block_x]; level = iabs (m7[i]); if (img->AdaptiveRounding) { fadjust8x8[j][block_x+i] = 0; } if (level != 0) { nonzero = TRUE; if (currMB->luma_transform_size_8x8_flag && input->symbol_mode == CAVLC) { *coeff_cost += MAX_VALUE; img->cofAC[pl_off][MCcoeff][0][scan_poss[MCcoeff] ] = isignab(level,m7[i]); img->cofAC[pl_off][MCcoeff][1][scan_poss[MCcoeff]++] = runs[MCcoeff]; ++scan_pos; runs[MCcoeff]=-1; } else { *coeff_cost += MAX_VALUE; ACLevel[scan_pos ] = isignab(level,m7[i]); ACRun [scan_pos++] = run; run=-1; // reset zero level counter } level = isignab(level, m7[i]); ilev = level; } } Version 13.0 Version 13.1 Version 13.2 5
Programmable Loop Accelerator • Reusable hardware → reduced NRE costs • Generalize accelerator without losing efficiency FPGAs General PurposeProcessors DSPs CGRAs Programmable Loop Accelerators Flexibility Multifunction Loop Accelerators Loop Accelerators, ASICs Efficiency, Performance 6
Loop 1 Flexible Accelerators • Generalize accelerator architecture • Map new loops to existing hardware SynthesisSystem Hardware Compiler Loop 2 7
Loop Accelerator Architecture CRF Point-to-point Connections … … … … … … FSM Local Mem BR + & MEM Controlsignals • Hardware realization of modulo scheduled loop • Parameterized execution resources, storage, connectivity 8
Programmable Accelerator Architecture • Generalize architectural features that limit programmability CRF Literals Point-to-point Connections Bus … … … … … … Control Memory Local Mem BR +/- &/| MEM Controlsignals RR RR RR RR • ~50% area overhead vs. non-programmable accelerator 9
Mapping Loops onto Hardware Processor Accelerator FUs Storage Connectivity ALU ALU LD +/- * CRF 8 16 8 10
+3 +2 +5 +4 +3 +2 +5 +4 Scheduling Example MEM ADDER1 ADDER2 LD1 Time +2 +3 LD1 II=2 +4 +5 +2 +3 LD1 +4 +5 ? +2 +3 LD1 +4 11
Modulo Scheduling for LAs • Large search space, few solutions • Op-centric approaches unable to find solutions • Satisfiability Modulo Theory (SMT) formulation to solve linear and SAT constraints simultaneously Loop Move Insertion SMT Scheduling Register Allocation Control Signals Machine description Increment II 12
SMT Formulation • Boolean variables Xi,f,t are true if operation i is scheduled on FU f at time slot t. • Integer variables Si represent stage of operation i. i lat(i) dist(i,j) sched_time(j) sched_time(i) + lat(i) – dist(i,j) II j ( Xi,fi,ti Xj,fj,tj ) ( ) Sj II + tj Si II + ti + lat(i) – dist(i,j) II • More details in paper 13
FU FU Hardware Loop Loop Loop Loop Loop Loop Measuring Programmability • How well can different loops be mapped onto the same hardware? • Performance matters – how much does II increase? • Need set of loops with different degrees of similarity ? 14
Graph Perturbation • Synthetically generated graphs • More perturbations → less similar to original graph • Iteratively apply random transformations: 15
Results – Perturbed Graphs Average II increase Base II 4 8 7 2 4 4 4 4 6 9 MPEG4 Signal processing Image Math 16
Conclusion • Increase flexibility of customized hardware without sacrificing performance, efficiency • Successfully map loops to heterogeneous hardware • Compile times of 5 minutes – 1 hour • Software changing faster than hardware → patchable ASIC 18
Questions? 19