A Code Refinement Methodology for Performance-Improved Synthesis from C

A Code Refinement Methodology for Performance-Improved Synthesis from C Greg Stitt, Frank Vahid*, Walid Najjar Department of Computer Science and Engineering University of California, Riverside Also with the Center for Embedded Computer Systems, UC Irvine This research is supported in part by the National Science Foundation and the Semiconductor Research Corporation

Select Critical Region deblocking() motionComp() Synthesis * * * * * * * * + + + + + + Compiler Introduction • Previous work: In-depth hw/sw partitioning study of H.264 decoder • Collaboration with Freescale H.264 motionComp() filterLuma() filterChroma() deblocking() . . . . . . . . . . uP FPGA

Large gap between ideal and actual speedup Obtained 2.5x speedup Introduction • Previous work: In-depth hw/sw partitioning study of H.264 decoder • Collaboration with Freescale

motionComp’() filterLuma() filterChroma() deblocking’() . . . . . . . . . . Apply Guidelines Hw/Sw Partitioning Introduction • Noticed coding constructs/practices limited hw speed • Identified problematic coding constructs • Developed simple coding guidelines • Dozens of lines of code • Minutes per guideline • Refined critical regions using guidelines motionComp() filterLuma() filterChroma() deblocking() . . . . . . . . . .

Simple guidelines increased speedup to 6.5x Introduction • Noticed coding constructs/practices limited hw speed • Identified problematic coding constructs • Developed simple coding guidelines • Dozens of lines of code • Minutes per guideline • Refined critical regions using guidelines Can simple coding guidelines show similar improvements on other applications?

Conversion to Constants (CC) Conversion to Explicit Data Flow (CEDF) Conversion to Fixed Point (CF) Conversion to Explicit Memory Accesses (CEMA) Constant Input Enumeration (CIE) Loop Rerolling (LR) Conversion to Explicit Control Flow (CECF) Function Specialization (FS) Algorithmic Specialization (AS) Pass-By-Value Return (PVR) Coding Guidelines • Analyzed dozens of benchmarks • Identified common problems related to synthesis • Developed 10 guidelines to fix problems • Although some are well known, analysis shows they are rarely applied • Automation unlikely or impossible in many cases Coding Guidelines

Profiling Results Only several performance critical regions Apply guidelines to only the critical regions Fast Refinement Sample Application • Several dozen lines of code provide most performance improvement • Refining takes minutes/hours Idct() Memset() FIR() Sort() Search() ReadInput() WriteOutput() Matrix() Brev() Compress() Quantize() . . . . .

Array can’t change, prefetching won’t violate dependencies Conversion to Constants (CC) • Problem: Arrays of constants commonly not specified as constants • Initialized at runtime • Guideline: Use constant wrapper function • Specifies array constant for all future functions • Automation • Difficult, requires global def-use/alias analysis int coef[100]; void initCoef() { // initialize coef } void fir() { // fir filter using coef } void f() { initCoef() // other code fir(); } int coef[100]; void initCoef() { // initialize coef } void fir() { // fir filter using coef } void firConstWrapper(const int array[100]) { // misc code . . . fir(array); } void f() { initCoef() // other code fir(); } int coef[100]; void initCoef() { // initialize coef } void fir(const int array[100]) { // fir filter using const array } void constWrapper(const int array[100]) { // misc code . . . fir(array); } void f() { initCoef() // other code fir(); } int coef[100]; void initCoef() { // initialize coef } void fir(const int array[100]) { // fir filter using const array } void constWrapper(const int array[100]) { // misc code . . . fir(array); } void f() { initCoef() constWrapper(coef); } int coef[100]; void initCoef() { // initialize coef } void fir(const int array[100]) { // fir filter using const array } void constWrapper(const int array[100]) { prefetchArray( array ); // misc code . . . fir(array); } void f() { initCoef() constWrapper(coef); } Can also enable constant folding

a() and c() can execute in parallel after 1st iteration a(), b(), c() must execute sequentially because of global array dependencies Conversion to Explicit Data Flow (CEDF) int array[100]; void a() { for (i=0; i < 100; i++) array[i] = . . . . . } void b() { for (i=0; i < 100; i++) array[i] = array[i]+f(i); } int c() { for (i=0; i < 100; i++) temp += array[i]; } void d() { for (. . . . . ) { a(); b(); c(); } } void a(int array[100]) { for (i=0; i < 100; i++) array[i] = . . . . . } void b(int array1[100], int array2[100]) { for (i=0; i < 100; i++) array2[i] = array1[i]+f(i); } int c(int array[100]) { for (i=0; i < 100; i++) temp += array[i]; } void d() { int array1[100], array2[100]; for (. . . . . ) { a(array1); b(array1, array2); c(array2); } } • Problem: Global variables make determination of parallelism difficult • Requires global def-use/alias analysis • Guideline: Replace globals with extra parameters • Makes data flow explicit • Simpler analysis may expose parallelism • Automation • Been proposed [Lee01] • But, difficult because of aliases

Bounds not known, hard to unroll Specialized Versions: f(2,2), f(2,4), f(4,2), f(2,4) j i One iteration at a time + 0 0 1 0 2 0 c[i][j] + + + . . . . . c[0][0] c[0][1] c[0][2] Iterations can be parallelized in each version Constant Input Enumeration (CIE) • Problem: Function parameters may limit parallelism • Guideline: Create enum for possible values • Synthesis can create specialized functions • Automation • In some cases, def-use analysis may identify all inputs • In general, difficult due to aliases enum PRM { VAL1=2, VAL2=4 }; void f(enum PRM a, enum PRM b) { . . . . for (i=0; i < a; i++) { for (j=0; j < b; i++) { c[i][j]=i+j; } } } void f(int a, int b) { . . . . for (i=0; i < a; i++) { for (j=0; j < b; i++) { c[i][j]=i+j; } } }

Synthesis unlikely to determine possible targets of function pointer Synthesized Hardware ? Synthesized Hardware a[i] f1(i) f2(i) f3(i) fp 3x1 a[i] Conversion to Explicit Control Flow (CECF) • Problem: Function pointers may prevent static control flow analysis • Guideline: Replace function pointer with if-else, static calls • Makes possible targets explicit • Automation • In general, is impossible • Equivalent to halting problem void f( int (*fp) (int) ) { . . . . . for (i=0; i < 10; i++) { a[i] = fp(i); } } enum Target { FUNC1, FUNC2, FUNC3 }; void f( enum Target fp ) { . . . . . for (i=0; i < 10; i++) { if (fp == FUNC1) a[i] = f1(i); else if (fp == FUNC2) a[i] = f2(i); else a[i] = f3(i); } }

Can be parallelized in hardware Algorithmic Specialization (AS) • Algorithms targeting sw may not be fast in hw • Sequential vs. parallel • C code generally uses sw algorithms • Guideline: Specialize critical functions with hw algorithms • Automation • Requires higher level specification • Intrinsics void search(int a[], int k, const int s) { for (i=0; i < s; i++) { if (a[i] == k) return i; } return –1; } void search(int a[], int k, int l, int r) { while (l <= r) { mid = (l+r)/2; if (k > a[mid]) l = mid+1; else if (k < a[mid) r = mid-1; else return mid; } return –1; }

Local array can’t be aliased, can prefetch Can’t prefetch array for g(), may be aliased Pass-By-Value Return (PVR) • Problem: Array parameters cannot be prefetched due to potential aliases • Designer may know aliases don’t exist • Guideline: Use pass-by-value-return • Automation • Requires global alias analysis void f(int *a, int *b, int array[16]) { int localArray[16]; memcpy(localArray,array,16*sizeof(int)); … // misc computation g(localArray); … // misc computation memcpy(array, localArray,16*sizeof(int)); } int g(int array[16]) { // computation done on array } void f(int *a, int *b, int array[16]) { … // unrelated computation g(array); … // unrelated computation } int g(int array[16]) { // computation done on array }

Why Synthesis From C? • Why not use HDL? • HDL may yield better results • C is mainstream language • Acceptable performance in many cases • Learning HDL is large overhead • Approaches are orthogonal • This work focuses on improving mainstream • Guidelines common for HDL • Can also be applied to algorithmic HDL

uP FPGA Software Overhead • Refined regions may not be partitioned to hardware • Partitioner may select non-refined regions • OS may select software or hardware implementation • Based on state of FPGA • Coding guidelines have potential software overhead motionComp’() filterLuma() filterChroma() deblocking’() . . . . . . . . . . Hw/Sw Partitioning motionComp’() deblocking’() filterLuma() filterChroma() Problem - Refined code mapped to software

Is overhead of copying array acceptable? Does suitable hw algorithm exist and have acceptable sw performance ? Refinement Methodology Profile • Considerations • Reduce software overhead • Reduce refinement time • Methodology • Profile • Iterative-improvement • Determine critical region • Apply all except PVR/AS • Minimal overhead • Apply PVR if overhead acceptable • Apply AS if known algorithm and overhead acceptable Determine Critical Region Apply CC, CF, CEMA, CIE, CEDF, CECF, FS, LR Repeat until performance acceptable no yes Apply PVR no yes Apply AS

ARM9 Virtex II Experimental Setup Benchmarks • Benchmark suite • MediaBench, Powerstone • Manually applied guidelines • 1-2 hours • 23 additional lines/benchmark, on average • Target Architecture • Xilinx VirtexII FPGA with ARM9 uP • Hardware/software partitioning • Selects critical regions for hardware • Synthesis • High-level synthesis tool • ~30,000 lines of C code • Outputs register-transfer level (RTL) VHDL • RTL Synthesis using Xilinx ISE • Compilation • Gcc with –O1 optimizations Manual Refinement Refined Code Hw/Sw Partitioning Sw Hw Compilation Synthesis Bitfile

Explicit Dataflow + Algorithmic Specialization Speedup: 16.4x Total Time: 15 minutes No guidelines Speedup: 2x Conversion to constants Speedup: 3.6x Total Time: 5 minutes Speedups from Guidelines

Input Enumeration Speedup: 16.7x Total Time: 20 minutes Conversion to Constants Speedup: 14.4x Total Time: 10 minutes No Guidelines Speedup: 8.6x Speedups from Guidelines Algorithmic Specialization Speedup: 19x Time: 30 minutes Sw Overhead: 6000%

Speedups from Guidelines • Original code • Speedups range from 1x (no speedup) to 573x • Average: 2.6x (excludes brev) • Refined code with guidelines • Average: 8.4x (excludes brev) • 3.5x average improvement compared to original code

Speedups from Guidelines • Guidelines move speedups closer to ideal • Almost identical for mpeg2, fir • Several examples still far from ideal • May imply new guidelines needed

Guideline SW Overhead/Improvement • Average Sw performance overhead: -15.7% (improvement) • -1.1% excluding brev • 3 examples improved • Average Sw size overhead (lines of C code) • 8.4% excluding brev Overhead Improvement

Summary • Simple coding guidelines significantly improve synthesis from C • 3.5x speedup compared to Hw/Sw synthesized from unrefined code • Major rewrites may not be necessary • Between 1-2 hours • Refinement Methodology • Reduces software size/performance overhead • In some cases, improvement • Future Work • Test on commercial synthesis tools • New guidelines for different domains

A Code Refinement Methodology for Performance-Improved Synthesis from C

A Code Refinement Methodology for Performance-Improved Synthesis from C

Presentation Transcript

Improved Performance

Coaching Lean Methods for improved performance

Effective Feedback for Improved Performance

Training for Improved Performance

tuning CASCOT for improved performance

Coaching for Improved Performance:

Coaching for Improved Performance:

Coaching for Improved Performance:

Coaching for Improved Performance:

Performance Engineering Methodology

Isotrophic Quality steel for improved performance

Improved Performance

Improved Performance

Improved Financial Performance

DELS Analysis/Synthesis Methodology

Building Commissioning for Improved Performance

Motion Vector Refinement for High-Performance Transcoding

Automatic Synthesis of High-Performance Code for Tensor Contraction Expressions*

Summer Tires For Improved Performance

A Code Refinement Methodology for Performance-Improved Synthesis from C

Coaching for Improved Performance