Modulo Scheduling for Highly Customized Datapaths to Increase Hardware Reusability

Modulo Scheduling for Highly Customized Datapaths to Increase Hardware Reusability Kevin Fan, Hyunchul Park,Manjunath Kudlur, Scott Mahlke Advanced Computer Architecture Laboratory University of Michigan April 8, 2008 1

Introduction • Emerging applications have high performance, cost, energy demands • H.264, wireless, software radio, signal processing • 10-100 Gops required • 200 mW power budget • Applications dominated by tight loops processing large amounts of streaming data iPhone board 2

LD +/- * Loop Accelerators C Code Hardware Loop 3

Hardware Implementations • Customization gets order-of-magnitude performance and efficiency wins • Viterbi: 100x speedup vs. ARM9 FPGAs General PurposeProcessors DSPs CGRAs Flexibility Multifunction Loop Accelerators Loop Accelerators, ASICs Efficiency, Performance 4

What About Programmability? • Software changes – bug fixes, evolving standards • dct_8x8() from H.264 reference implementation for (coeff_ctr = 0; coeff_ctr < 64; coeff_ctr++) { i=pos_scan[coeff_ctr][0]; j=pos_scan[coeff_ctr][1]; run++; ilev=0; if (currMB->luma_transform_size_8x8_flag && input->symbol_mode == CAVLC) { MCcoeff = MC(coeff_ctr); runs[MCcoeff]++; } m7 = &curr_res[block_y + j][block_x]; level = iabs (m7[i]); if (img->AdaptiveRounding) { fadjust8x8[j][block_x+i] = 0; } if (level != 0) { nonzero = TRUE; if (currMB->luma_transform_size_8x8_flag && input->symbol_mode == CAVLC) { *coeff_cost += MAX_VALUE; img->cofAC[b8+pl_off][MCcoeff][0][scan_poss[MCcoeff] ] = isignab(level,m7[i]); img->cofAC[b8+pl_off][MCcoeff][1][scan_poss[MCcoeff]++] = runs[MCcoeff]; ++scan_pos; runs[MCcoeff]=-1; } else { *coeff_cost += MAX_VALUE; ACLevel[scan_pos ] = isignab(level,m7[i]); ACRun [scan_pos++] = run; run=-1; // reset zero level counter } level = isignab(level, m7[i]); ilev = level; } } for (coeff_ctr = 0; coeff_ctr < 64; coeff_ctr++) { i=pos_scan[coeff_ctr][0]; j=pos_scan[coeff_ctr][1]; run++; ilev=0; if (currMB->luma_transform_size_8x8_flag && input->symbol_mode == CAVLC) { MCcoeff = MC(coeff_ctr); runs[MCcoeff]++; } m7 = &curr_res[block_y + j][block_x]; level = iabs (m7[i]); if (img->AdaptiveRounding) { fadjust8x8[j][block_x+i] = 0; } if (level != 0) { nonzero = TRUE; if (currMB->luma_transform_size_8x8_flag && input->symbol_mode == CAVLC) { *coeff_cost += MAX_VALUE; img->cofAC[pl_off][MCcoeff][0][scan_poss[MCcoeff] ] = isignab(level,m7[i]); img->cofAC[pl_off][MCcoeff][1][scan_poss[MCcoeff]++] = runs[MCcoeff]; ++scan_pos; runs[MCcoeff]=-1; } else { *coeff_cost += MAX_VALUE; ACLevel[scan_pos ] = isignab(level,m7[i]); ACRun [scan_pos++] = run; run=-1; // reset zero level counter } level = isignab(level, m7[i]); ilev = level; } } for (coeff_ctr = 0; coeff_ctr < 64; coeff_ctr++) { i=pos_scan[coeff_ctr][0]; j=pos_scan[coeff_ctr][1]; run++; ilev=0; if (currMB->luma_transform_size_8x8_flag && input->symbol_mode == CAVLC) { MCcoeff = MC(coeff_ctr); runs[MCcoeff]++; } m7 = &curr_res[block_y + j][block_x]; level = iabs (m7[i]); if (img->AdaptiveRounding) { fadjust8x8[j][block_x+i] = 0; } if (level != 0) { nonzero = TRUE; if (currMB->luma_transform_size_8x8_flag && input->symbol_mode == CAVLC) { *coeff_cost += MAX_VALUE; img->cofAC[pl_off][MCcoeff][0][scan_poss[MCcoeff] ] = isignab(level,m7[i]); img->cofAC[pl_off][MCcoeff][1][scan_poss[MCcoeff]++] = runs[MCcoeff]; ++scan_pos; runs[MCcoeff]=-1; } else { *coeff_cost += MAX_VALUE; ACLevel[scan_pos ] = isignab(level,m7[i]); ACRun [scan_pos++] = run; run=-1; // reset zero level counter } level = isignab(level, m7[i]); ilev = level; } } Version 13.0 Version 13.1 Version 13.2 5

Programmable Loop Accelerator • Reusable hardware → reduced NRE costs • Generalize accelerator without losing efficiency FPGAs General PurposeProcessors DSPs CGRAs Programmable Loop Accelerators Flexibility Multifunction Loop Accelerators Loop Accelerators, ASICs Efficiency, Performance 6

Loop 1 Flexible Accelerators • Generalize accelerator architecture • Map new loops to existing hardware SynthesisSystem Hardware Compiler Loop 2 7

Loop Accelerator Architecture CRF Point-to-point Connections … … … … … … FSM Local Mem BR + & MEM Controlsignals • Hardware realization of modulo scheduled loop • Parameterized execution resources, storage, connectivity 8

Programmable Accelerator Architecture • Generalize architectural features that limit programmability CRF Literals Point-to-point Connections Bus … … … … … … Control Memory Local Mem BR +/- &/| MEM Controlsignals RR RR RR RR • ~50% area overhead vs. non-programmable accelerator 9

Mapping Loops onto Hardware Processor Accelerator FUs Storage Connectivity ALU ALU LD +/- * CRF 8 16 8 10

+3 +2 +5 +4 +3 +2 +5 +4 Scheduling Example MEM ADDER1 ADDER2 LD1 Time +2 +3 LD1 II=2 +4 +5 +2 +3 LD1 +4 +5 ? +2 +3 LD1 +4 11

Modulo Scheduling for LAs • Large search space, few solutions • Op-centric approaches unable to find solutions • Satisfiability Modulo Theory (SMT) formulation to solve linear and SAT constraints simultaneously Loop Move Insertion SMT Scheduling Register Allocation Control Signals Machine description Increment II 12

SMT Formulation • Boolean variables Xi,f,t are true if operation i is scheduled on FU f at time slot t. • Integer variables Si represent stage of operation i. i lat(i) dist(i,j) sched_time(j)  sched_time(i) + lat(i) – dist(i,j)  II j ( Xi,fi,ti Xj,fj,tj )  ( ) Sj II + tj  Si  II + ti + lat(i) – dist(i,j)  II • More details in paper 13

FU FU Hardware Loop Loop Loop Loop Loop Loop Measuring Programmability • How well can different loops be mapped onto the same hardware? • Performance matters – how much does II increase? • Need set of loops with different degrees of similarity ? 14

Graph Perturbation • Synthetically generated graphs • More perturbations → less similar to original graph • Iteratively apply random transformations: 15

Results – Perturbed Graphs Average II increase Base II 4 8 7 2 4 4 4 4 6 9 MPEG4 Signal processing Image Math 16

Results – Restricted Datapath 17

Conclusion • Increase flexibility of customized hardware without sacrificing performance, efficiency • Successfully map loops to heterogeneous hardware • Compile times of 5 minutes – 1 hour • Software changing faster than hardware → patchable ASIC 18

Questions? 19

Results – Cross Compilation 21

Modulo Scheduling for Highly Customized Datapaths to Increase Hardware Reusability

Modulo Scheduling for Highly Customized Datapaths to Increase Hardware Reusability

Presentation Transcript

COMP541 Datapaths I

Wooden Barn Door hardware and accessories for Customized loo

RTL and Datapaths

EECS 583 – Class 14 Modulo Scheduling Reloaded

Modulo Scheduling

Pipelining (Dynamic Scheduling Through Hardware Schemes)

Modulo Scheduling

Probabilistic Predicate-Aware Modulo Scheduling

Datapaths

Hardware Support for Modulo Scheduling

Datapaths

Iterative Modulo Scheduling

Chapter 10-Datapaths

Reusability

Does Self Scheduling for Nurses Increase Job Satisfaction?

EECS 583 – Class 13 Modulo Scheduling

Bundled-Data Datapaths

HSA Reusability Issues

EECS 583 – Class 14 Modulo Scheduling Reloaded

Does Scheduling Management Software Customized Services for Your Business

Customized Door Hardware Solutions for Businesses