280 likes | 416 Views
Algorithmic Skeletons for Stream Programming in Embedded Hetereogeneous Parallel Image Processing Applications. IPDPS 2006. Wouter Caarls , Pieter Jonker, Henk Corporaal. Quantitative Imaging Group, department of Imaging Science and Technology. Overview. Stream programming
E N D
Algorithmic Skeletons for Stream Programming in Embedded Hetereogeneous Parallel Image Processing Applications IPDPS 2006 Wouter Caarls, Pieter Jonker, Henk Corporaal Quantitative Imaging Group, department of Imaging Science and Technology
Overview • Stream programming • Writing stream kernels • Algorithmic skeletons • Writing algorithmic skeletons • Skeleton merging • Results • Conclusion & Future work
Stream Programming • FIFO-connected kernels processing series of data elements • Well suited to signal processing applications • Explicit communication and task decomposition • Ideal for distributed-memory systems • Each data element processed (mostly) independently • Ideal for data-parallel systems such as SIMDs
Kernel Examples from Image Processing Increasing generality & Architectural requirements • Pixel processing (color space conversion) • Perfect match • Local neighborhood processing (convolution) • Requires 2D access • Recursive neighborhood processing (distance transform) • Regular data dependencies • Stack processing (region growing) • Irregular data dependencies
Writing Kernels • The language for writing kernels should be restricted • To allow efficient compilation to constrained architectures • But also general • So many different algorithms can be specified • Solution: a different language for each type of kernel • User selects the most restricted language that supports his kernel • Retargetability • Efficiency • Ease-of-use
Algorithmic skeletons* as kernel languages • An algorithmic skeleton captures a pattern of computation • Is conceptually a higher-order function, repetitively calling a kernel function with certain parameters • Iteration strategy may be parallel • Kernel parameters restrict dependencies • Provides the environment in which the kernel runs, and can be seen as a very restricted DSL *M. Cole. Algorithmic Skeletons: Structured Management of Parallel Computation, 1989
NeighborhoodToPixelOp() Average(in stream float i[-1..1] [-1..1], out stream float *o) { int ky, kx; float acc=0; for (ky=-1; ky <=1; ky++) for (kx=-1; kx <=1; kx++) acc += i[ky][kx]; *o = acc/9; } void Average(float **i, float **o) { for (int y=1; y < HEIGHT-1; y++) for (int x=1; x < WIDTH-1; x++) { float acc=0; acc += i[y-1][x-1]; acc += i[y-1][x ]; acc += i[y-1][x+1]; acc += i[y ][x-1]; acc += i[y ][x ]; acc += i[y ][x+1]; acc += i[y+1][x-1]; acc += i[y+1][x ]; acc += i[y+1][x+1]; o[y][x] = acc/9; } } Sequential neighborhood skeleton Kernel definition Resulting operation Skeleton
Skeleton tasks • Implement structure • Outer loop, border handling, buffering, parallel implementation • Just write C code • Transform kernel • Stream access, translation to target language • Term rewriting • How to combine in a single language? • Partial evaluation
Term rewriting (1) Input *o = acc/9; Rewrite Rule (applied topdown to all nodes) replace(`o`, `&o[y][x]`); Output o[y][x] = acc/9;
Term rewriting (2) Using Stratego* Input acc += i[ky][kx]; Rewrite Rule (applied topdown to all nodes) RelativeToAbsolute: |[ i[~e1][~e2] ]| -> |[ i[y + ~e1][x + ~e2] ]| Output acc += i[y+ky][x+kx]; *E. Visser. Stratego: A language for program transformation based on rewriting strategies, 2001
PEPCI (1)Rule composition and code generation in C stratego RelativeToAbsolute(code i, code body) { main = <topdown(RelativeToAbsolute’)>(body) RelativeToAbsolute’: |[ ~i[~e1][~e2] ]| -> |[ ~i[y + ~e1][x + ~e2] ]| } for (a=0; a < arguments; a++) if (args[a].type == ARG_STREAM_IN) body = RelativeToAbsolute(args[a].id, body); else if (args[a].type == ARG_STREAM_OUT) body = DerefToArrayIndex(args[a].id, body); for (y=1; y < HEIGHT-1; y++) for (x=1; x < WIDTH-1; x++) @body; Rule definition Rule composition Code generation
PEPCI (2)Combining rule composition and code generation • How to distinguish rule composition from code generation? for (a=0; a < arguments; a++) body = DerefToArrayIndex(args[a].id, body); for (x=0; x < stride; x++) @body; • Partial evaluation: evaluate only the parts of the program that are known. Output the rest • arguments is known, DerefToArrayIndex is known, args[a].id is known, body is known -> evaluate • stride is unknown -> output
double n, x=1; int ii, iterations=3; scanf(“%lf”, &n); for (ii=0; ii < iterations; ii++) x = (x + n/x)/2; printf(“sqrt(%f) = %f\n”, n, x); double n; double x; int ii; int iterations; x = 1; iterations = 3; scanf(“%lf”, &n); ii = 0; x = (1 + n/1)/2; ii = 1; x = (x + n/x)/2; ii = 2; x = (x + n/x)/2; ii = 3; printf(“sqrt(%f) = %f\n”, n, x); PEPCI (3)Partial evaluation by interpretation Input Output Symbol table double n double x int ii int iterations ? 1 ? 1 ? 3 ? 1 0 3 ? ? 0 3 ? ? 1 3 ? ? 2 3 ? ? 3 3
Kernelization overheads • Kernelizing an application impacts performance • Mapping • Scheduling • Buffers management • Lost ILP • Merge kernels • Extract static kernel sequences • Statically schedule at compile-time • Replace sequence with merged kernel
Skeleton merging • Skeletons are completely general functions • Cannot be properly analyzed or reasoned about • Restrict skeleton generality be using metaskeletons • Skeletons using the same metaskeleton can be merged • Merged operation still uses the original metaskeleton, and can be recursively merged
Example • Philips Inca+ smart camera • 640x480 sensor • XeTaL 16MHz, 320-way SIMD • TriMedia 180MHz, 5-issue VLIW • Ball detection • Filtering, Segmentation, Hough transform
Results Buffers, Scheduling, ILP ILP not fully recovered
Conclusion • Stream programming is a natural fit for running image processing applications on distributed-memory systems • Algorithmic Skeletons efficiently exploit data parallelism, by allowing the user to select the most restricted skeleton that supports his kernel • Extensible (new skeletons) • Retargetable (new skeleton implementations) • PEPCI effectively combines the necessities of efficiently implementing algorithmic skeletons • Term rewriting (by embedding Stratego) • Partial evaluation (to automatically separate rule composition and code generation)
Future Work • Better merging of kernels • Merge more efficiently • Merge different metaskeletons • Implement on a more general architecture • Implement more demanding applications • And more involved skeletons
Partial evaluation (2)Free optimizations • Loop unrolling • If the conditions are known, and the body isn’t • Function inlining • Aggressive constant folding • Including external “pure” functions
Kernel translation • SIMD processors are not programmed in C, but in parallel derivatives • Skeleton should translate kernel to target language • Extend PEPCI with C derivative syntax • Though only minimally interpreted
NeighbourhoodToPixelOp() sobelx(in stream unsigned char i[-1..1][-1..1], out stream int *o) { int x, y, temp; temp = 0; for (y=-1; y < 2; y++) for (x=-1; x < 2; x=x+2) temp = temp + x*i[y][x]; *o = temp; } static lmem _in2; static lmem _in1; { lmem temp; temp = (0)+((-1)*(_in2[-1 .. 0])); temp = (temp)+((1)*(_in2[1 .. 2])); temp = (temp)+((-1)*(_in1[-1 .. 0])); temp = (temp)+((1)*(_in1[1 .. 2])); temp = (temp)+((-1)*(larg0[-1 .. 0])); temp = (temp)+((1)*(larg0[1 .. 2])); larg1 = temp; } _in2 = _in1; _in1 = larg0; Example: local neighborhood operation in XTC
Stream program void main(int argc, char **argv) { STREAM a, b, c; int maxval, dummy, maxc; scInit(argc, argv); while (1) { capture(&a); interpolate(&a, &a); sobelx(&a, &b); sobely(&a, &c); magnitude(&b, &c, &a); direction(&b, &c, &b); mask(&b, &a, &a, scint(128)); hough(&a, &a); display(&a); imgMax(&a, scint(0), &maxval, scint(0), &dummy, scint(0), &maxc); _block(&maxc, &maxval); printf(“Ball found at %d with strength %d\n”, maxc, maxval); } return scExit(); }
Programming with algorithmic skeletons (1) PixelToPixelOp() binarize(in stream int *i, out stream int *o, in int *threshold) { *o = (*i > *threshold); } NeighbourhoodToPixelOp() average(in stream int i[-1..1][-1..1], out stream int *o) { int x, y; *o = 0; for (y=-1; y < 2; y++) for (x=-1; x < 2; x++) *o += i[y][x]; *o /= 9; }
Programming with algorithmic skeletons (2) StackOp(in stream int *init) propagate(in stream int *i[-1..1][-1..1], out stream int *o) { int x, y; for (y=-1; y < 2; y++) for (x=-1; x < 2; x++) if (i[y][x] && !*o) { *o = 1; push(y, x); } } AssocPixelReductionOp() max(in stream int *i, out int *res) { if (*i > *res) *res = *i; }
<=t + = >t <=t <=t + = >t + = >t Algorithmic Skeletons
Term rewriting (1) From code to abstract syntax tree acc += i [ ] ky [ ] kx ; Stat AssignPlus Id ArrayIndex “acc” ArrayIndex Id Id Id “kx” “i” “ky” Stat(AssignPlus(Id("acc"),ArrayIndex(ArrayIndex(Id("i"),Id("ky")),Id("kx"))))