Skeletons and Asynchronous RPC for Embedded Data- and Task Parallel Image Processing

Skeletons and Asynchronous RPC for Embedded Data- and Task Parallel Image Processing IAPR Conference on Machine Vision Applications Wouter Caarls, Pieter Jonker, Henk Corporaal Quantitative Imaging Group, department of Imaging Science & Technology

Overview • Introduction & Motivation • Approach • Algorithmic skeletons • Asynchronous RPC • Implementation • Run-time system • Prototype architecture • Results • Conclusions & Future work

Introduction • Efficient hardware implies parallelism and heterogeneity • Efficient programmability implies custom, application dependent hardware • Finding the best hardware configuration for an application requires hardware independent software • For wide acceptance, programming should be easy SmartCam: Integrating efficient user programmable image processing hardware within the camera or sensor itself. Philips CFT IV Inca+ 320 x 10-bit SIMD 5-issue VLIW

Locality of reference Data parallelism Algorithmic skeletons • Avoid parallel bookkeeping • Hide hardware implementation Heterogeneity Task parallelism Asynchronous RPC • Familiar • Hide processor configuration Our approach

<=t >t <=t + = >t <=t + = + = >t Algorithmic Skeletons Separating structure from computation

Algorithmic Skeletons Implicit parallel programming • Choice of skeleton implies set of constraints (dependencies) • System is free as long as constraints are not violated • Distribution • Scanning order • Consistent library interface facilitates between-skeleton dependency analysis • No side effects • Well-defined inputs and outputs

Algorithmic Skeletons Disadvantages • Inability to parallelize algorithms that cannot be expressed using one of the skeletons in the library • Inability to specify certain algorithmic optimizations • Inability to specify architecture-dependent optimizations • Solution: Allow the programmer to add his own (application-specific or architecture-specific) skeletons to the library

Control processor Coprocessor Function1 { Function1(…) } Function2 { Function2(…) } Remote procedure call • Just like a function call • Computes the function on a different processor • All data goes through the calling processor • Synchronous: stub returns when remote function is done; data is available immediately • Asynchronous: stub returns immediately; data is available later.

Futures Control processor Coprocessor Function1 { • Function returns reference to future result • Reference can be used in other RPC calls • Using the reference outside an RPC call requires an (implicit) block. Function1(&a) } Function2 { Function2(&a) }

Control processor Control processor Coprocessor 1 Coprocessor 1 Coprocessor 2 Coprocessor 2 Function1 { Function1 { Function1 { Function2 { Function1(…) Function1(…) } } Function2 { Function2 { Function2(…) Function2(…) } } } } Parallelism through RPC • RPC is not intrinsically parallel • Synchronous RPC calling parallel function: data parallelism • Asynchronous RPC calling (parallel) function: task parallelism

Control processor Coprocessor 1 Coprocessor 2 Function1 { Function2 { Function1(…) Function2(…) } } Optimizing communications • Real-time image processing requires vast amounts of bandwidth • Scatter-gather creates a bottleneck at the control processor. • Allow peer-to-peer communications between remote functions

Control processor Coprocessor 1 Coprocessor 2 Function1 { Function2 { Function1(…) Function2(…) } } Optimizing memory usage • In embedded applications, memory is scarce • Normal task parallelism requires a frame store per concurrent operation • Pipelining

Sensor SIMD GP GetImage {x, y, n}={0,0,0} gauss_5x5 binarize gravity x=x/n; y=y/n; SetMotorSpeed Example Object following /* Object following */ While (1) { GetImage(in); IsoWindowOp(WDW(5), gauss_5x5, in, filtered); IsoPixelOp(binarize, filtered, segmented, 50); {x, y, n} = {0, 0, 0}; AnisoPixelReductionOp(gravity, add, segmented, &{x, y, n}, 3*sizeof(int)); block(&{x, y, n}); x=x/n; y=y/n; SetMotorSpeed(WIDTH/2-x, HEIGHT/2-y); } GetImage {x, y, n}={0,0,0} gauss_5x5 binarize gravity x=x/n; y=y/n; SetMotorSpeed

Frontend Marshalling Resolving futures Mapper Find processor (user/static/dynamic) Dispatcher Set up FIFOs, channels Dispatch operations Gatherer Await completion Signal blocks Run-time system implementation Function calls Read(&a); Process(&a, &b);

Prototype architecture

Results Double thresholding edge detection

Conclusion The proposed programming model • Is easy to use • Skeletons hide data parallel bookkeeping • RPC hides task parallel implementation • Is architecture independent • A skeleton can be implemented for different architectures • RPC can map to heterogeneous system • Is optimized for embedded usage • Peer-to-peer communication: no scatter/gather bottleneck • Pipelined: no frame stores

Future work • Skeletons • Skeleton Definition Language • Skeleton merging • Mapping • Memory • Scalar dependencies • Evaluation • New prototype architecture • Dynamic, complex application

Skeletons and Asynchronous RPC for Embedded Data- and Task Parallel Image Processing

Skeletons and Asynchronous RPC for Embedded Data- and Task Parallel Image Processing

Presentation Transcript

Distributed Data Storage and Parallel Processing Engine

F# for Parallel and Asynchronous Programming

Task Parallelism and Task Superscalar Processing

Parallel Image Processing

Embedded Image Processing on FPGA

Multicore and Parallel Processing

Image processing and opencv

Image Data Purchase and Pre-processing

Parallel Image Processing

Image Processing and Coding

Asynchronous Online Interviews and Image Elicitation

Multicore and Parallel Processing

Image Processing … computing with and about data,

Image processing and opencv

Image processing and opencv

Image Processing and Analysis

Image Processing for Microarray Data Analysis

Image Processing for cDNA Microarray Data

Image Processing and Analysis

Parallel Image Processing

Parallel and Pipelined Processing