180 likes | 193 Views
This paper discusses the use of algorithmic skeletons and asynchronous RPC in embedded image processing for efficient parallelism and programmability. The implementation and results are presented, along with conclusions and future work.
E N D
Skeletons and Asynchronous RPC for Embedded Data- and Task Parallel Image Processing IAPR Conference on Machine Vision Applications Wouter Caarls, Pieter Jonker, Henk Corporaal Quantitative Imaging Group, department of Imaging Science & Technology
Overview • Introduction & Motivation • Approach • Algorithmic skeletons • Asynchronous RPC • Implementation • Run-time system • Prototype architecture • Results • Conclusions & Future work
Introduction • Efficient hardware implies parallelism and heterogeneity • Efficient programmability implies custom, application dependent hardware • Finding the best hardware configuration for an application requires hardware independent software • For wide acceptance, programming should be easy SmartCam: Integrating efficient user programmable image processing hardware within the camera or sensor itself. Philips CFT IV Inca+ 320 x 10-bit SIMD 5-issue VLIW
Locality of reference Data parallelism Algorithmic skeletons • Avoid parallel bookkeeping • Hide hardware implementation Heterogeneity Task parallelism Asynchronous RPC • Familiar • Hide processor configuration Our approach
<=t >t <=t + = >t <=t + = + = >t Algorithmic Skeletons Separating structure from computation
Algorithmic Skeletons Implicit parallel programming • Choice of skeleton implies set of constraints (dependencies) • System is free as long as constraints are not violated • Distribution • Scanning order • Consistent library interface facilitates between-skeleton dependency analysis • No side effects • Well-defined inputs and outputs
Algorithmic Skeletons Disadvantages • Inability to parallelize algorithms that cannot be expressed using one of the skeletons in the library • Inability to specify certain algorithmic optimizations • Inability to specify architecture-dependent optimizations • Solution: Allow the programmer to add his own (application-specific or architecture-specific) skeletons to the library
Control processor Coprocessor Function1 { Function1(…) } Function2 { Function2(…) } Remote procedure call • Just like a function call • Computes the function on a different processor • All data goes through the calling processor • Synchronous: stub returns when remote function is done; data is available immediately • Asynchronous: stub returns immediately; data is available later.
Futures Control processor Coprocessor Function1 { • Function returns reference to future result • Reference can be used in other RPC calls • Using the reference outside an RPC call requires an (implicit) block. Function1(&a) } Function2 { Function2(&a) }
Control processor Control processor Coprocessor 1 Coprocessor 1 Coprocessor 2 Coprocessor 2 Function1 { Function1 { Function1 { Function2 { Function1(…) Function1(…) } } Function2 { Function2 { Function2(…) Function2(…) } } } } Parallelism through RPC • RPC is not intrinsically parallel • Synchronous RPC calling parallel function: data parallelism • Asynchronous RPC calling (parallel) function: task parallelism
Control processor Coprocessor 1 Coprocessor 2 Function1 { Function2 { Function1(…) Function2(…) } } Optimizing communications • Real-time image processing requires vast amounts of bandwidth • Scatter-gather creates a bottleneck at the control processor. • Allow peer-to-peer communications between remote functions
Control processor Coprocessor 1 Coprocessor 2 Function1 { Function2 { Function1(…) Function2(…) } } Optimizing memory usage • In embedded applications, memory is scarce • Normal task parallelism requires a frame store per concurrent operation • Pipelining
Sensor SIMD GP GetImage {x, y, n}={0,0,0} gauss_5x5 binarize gravity x=x/n; y=y/n; SetMotorSpeed Example Object following /* Object following */ While (1) { GetImage(in); IsoWindowOp(WDW(5), gauss_5x5, in, filtered); IsoPixelOp(binarize, filtered, segmented, 50); {x, y, n} = {0, 0, 0}; AnisoPixelReductionOp(gravity, add, segmented, &{x, y, n}, 3*sizeof(int)); block(&{x, y, n}); x=x/n; y=y/n; SetMotorSpeed(WIDTH/2-x, HEIGHT/2-y); } GetImage {x, y, n}={0,0,0} gauss_5x5 binarize gravity x=x/n; y=y/n; SetMotorSpeed
Frontend Marshalling Resolving futures Mapper Find processor (user/static/dynamic) Dispatcher Set up FIFOs, channels Dispatch operations Gatherer Await completion Signal blocks Run-time system implementation Function calls Read(&a); Process(&a, &b);
Results Double thresholding edge detection
Conclusion The proposed programming model • Is easy to use • Skeletons hide data parallel bookkeeping • RPC hides task parallel implementation • Is architecture independent • A skeleton can be implemented for different architectures • RPC can map to heterogeneous system • Is optimized for embedded usage • Peer-to-peer communication: no scatter/gather bottleneck • Pipelined: no frame stores
Future work • Skeletons • Skeleton Definition Language • Skeleton merging • Mapping • Memory • Scalar dependencies • Evaluation • New prototype architecture • Dynamic, complex application