230 likes | 375 Views
Mapping Irregular Algorithms in a Custom Computing Image Processing Framework. Frédéric Planque Ivan C. Kraljic Yvon Savaria MiroTech Microsystems Inc. 395 Ste-Croix suite 202 St-Laurent, Qc H4N 2L3 Canada research@mirotech.com www.mirotech.com. Contents. RTIP Framework Basic paradigms
E N D
Mapping Irregular Algorithms in a Custom Computing Image Processing Framework Frédéric Planque Ivan C. Kraljic Yvon Savaria MiroTech Microsystems Inc. 395 Ste-Croix suite 202 St-Laurent, Qc H4N 2L3 Canada research@mirotech.com www.mirotech.com
Contents • RTIP Framework • Basic paradigms • Application development • Operator library • Connected Components Labeling • Algorithm • Implementation • Image Warping • Affine inverse transformation • Application: rotation • Conclusion
Real-time image processing (RTIP) • Processing power • Billions of instructions per second • Bandwidth • External: 10-100 MBytes/s.; Internal: 100-1000 MBytes/s. • Embedded memories • Frame, line and pixels delays • Informal specification • Experimentation and heuristics • Adaptive behavior • Changing environment and scenario
A real-time image processing framework: Foundations • Execution model: Hardwired dataflow • An operation is “fired” as soon as all its operands are available (J.B. Dennis, Data flow supercomputers, Computer (13), 1980). • Hardwired dataflow: hardware operators statically connected by physical links • On-the-fly processing of incoming data (raster-scan data flows) • Programming model: Multiple Instructions Single Data (MISD) • All the operations are executed for a single pixel in one clock cycle (pipelined) • Functional parallelism
Hardwired dataflow paradigm Operation Data dependency Physical operator Median Edge Physical connection Sub Dataflow graph Median Edge Sub (J. Sérot, G. Quénot, B. Zavidovique, Functional Programming on a dataflow architecture, Machine Vision and Applications, 7(1), 1993.) Operator graph
MISD paradigm • If all operators have a throughput of one pixel/clock cycle, any cascade will have the same throughput • On-the-fly performance guaranteed • Execution time is constant (as long as there is enough hardware) • State-of-the-art million-gate reconfigurable device • Latency is affected
Application development • Library of configurable image processing operators • Convolvers, filters, edge detectors... • Better suited than register-transfer level for real applications • An application is decomposed into a cascade of library operators • One operation requires one operator • Physical operators are connected according to the static schedule of operations in the dataflow graph • Physical cascading of operators can be automated • Leverages reconfigurable computing • Application on demand • State-of-the art 1M+ gates FPGAs
Framework features • Uniform • Encapsulated operators with uniform interfaces for data and control • Modular • Local data-driven control to each operator • Stand-alone library operators • Adaptive • Video format, image size and resolution • Open • Support for user-specific proprietary operators • No support for resource sharing • One-to-one mapping between operation and operator
Linear 3x3/5x5/7x7/9x9/11x11 asymmetric convolutions 3x3/5x5 Kirsch, Sobel, Laplacian, Prewitt filters 3x3 sharpening, smoothing, mean, variance Non-linear 3x3/5x5 Median, minimum, maximum, gradient Noise filtering Morphological 3x3 erosion, dilation, closing, opening Binary 3x3 erosion, dilation, closing, opening, pruning, skeletonization Other Maximum tracking, motion detection, histogram, distance map RTIP operator library
Connected components labeling • Grouping operation (map pixels to blobs) • All connected “foreground” pixels are given the same label (= one blob) • Algorithms • Iterative • Two-pass • Applications • Blob analysis, machine vision, target tracking...
Two-pass algorithm Image of temporary labels Input binary image Equivalence table {3 <=> 2, 4 <=> 3} Output labeled image
CCL: first-pass • Left-to-right, top-to-bottom, label propagation If current pixel Px,y is in foreground: • If the current pixel Px,yhas no top Px,y-1 and left Px-1,y neighbors, create a new label and assign it to that pixel. • If the current pixel has only one labeled neighbor, give it the same label. • If the current pixel has two neighbors with the same label, give it that label. • If the current pixel has two neighbors with different labels, assign the minimum of the two labels to the current pixel and register in an equivalence table that the labels are equivalent. Px,y-1 Px-1,y Px,y 4-connectivity L-type mask
CCL: Equivalence resolution & 2nd pass • Determine equivalence classes from all pairs of equivalent labels • Assign a unique label to each equivalence class • Rescan image of temporary labels, and assign the final unique label to each temporary label
Equivalence resolution: Implementation • Content-addressable memory (CAM) • All labels equivalent to one label can be found in one cycle • + Fast equivalence resolution O(n) • - High memory consumption (Xilinx Virtex: one 4k block RAM implements a 16x8 CAM; Virtex 1000 has one 512x8 CAM) 0 1 CAM MxN 0 All addresses that contain “Label” are found in one cycle N M 1 Label Data Addr 1 0 1 0 • Depth-first search with RAM • + Low memory consumption • - Slow equivalence resolution O(n2)
CCL architecture RTIP framework compatible I/O Equivalence resolution • First pass • Generates image with temporary labels • Stores equivalent labels in the equivalence table • Frame delay • Delays the image of temporary labels • Equivalence resolution • Depth-first search of equivalent labels • Second pass • Remaps temporary labels into final unique labels Custom I/O Frame delay First pass Second pass
Parallelism for on-the-fly processing Even/odd image Even/odd image Image stream Labeled images stream Even labeler Images 2, 4, 6... Odd labeler Switch Mux 4 3 2 1 4 3 2 1 Images 1, 3, 5... Image 1 Image 2 Image 3 Image 4 Equiv. resolution image 3 Equiv. resolution image 1 2nd pass image 1 2nd pass image 3 1st pass image 1 1st pass image 3 Odd labeler Equiv. resolution image 2 Equiv. resolution image 4 2nd pass image 2 1st pass image 2 1st pass image 4 Even labeler
Limitations • Worst-case for equivalence resolution: Nequx (Nequ-1) where Nequ is the max. number of equivalences • Worst-case for 1st/2nd pass: X x Y (image size) • On-the-fly processing fails if Nequx (Nequ-1) > X x Y On-the-fly processing: correct On-the-fly processing: failed Nequx (Nequ-1) Nequx (Nequ-1) Equivalence resolution Image 1 Equivalence resolution Image 1 1st pass Image 1 1st pass Image 1 Image 2 Image 3 Image 2 Image 3 X x Y X x Y
CCL: status and future work • Status: • 512 x 1024 image size maximum • 254 temporary and final labels maximum (254 blobs) • Label 255 used for removing blobs touching image borders (optional) • 512 equivalences between temporary labels maximum after first pass • On-the-fly processing for images 512 x 511 and up • Add-on compatible cores: blob area, blob centroid, blob bounding box… • Future work: • Larger images • More labels (temporary, final and equivalences) • Optimize equivalence resolution for faster processing
Image warping • Geometric transformation • Input image pixel coordinates [u, v]; warped image pixel coord. [x, y]: [x, y] = [f(u, v), g(u, v)] (forward mapping) or [u, v] = [h(x, y), k(x, y)] (inverse mapping) • Affine transformation:
Inverse mapping: Architecture • Nearest-neighbor interpolation • Transformation matrix stored in dynamically reconfigurable LUT • Duplicate operator for on-the-fly processing • Same architecture as connected component labeler • Higher-order interpolation (bilinear, bicubic) • Inverse mapping may have reduced performance • Better suited to a forward mapping architecture (in development) RTIP framework compatible I/O [u,v] coordinates Affine transformation Custom I/O Frame buffer Warped pixel
Application: Rotation • Affine transformation Inverse Mapping
Application: Blob analysis • Transparent cascading thanks to RTIP framework • 50 Mpixels/s throughput RTIP framework compatible I/O Angle Threshold Area Blob features Centroid Conn. comp. labeling Rotation Binarization Camera angle correction Bound. box
Conclusion • Hardwired dataflow now feasible on single-chip reconfigurable devices • Operator library • Adaptive framework (frame rate, image size, pixel resolution) • 70+ operators and growing • Irregular algorithms (labeling, warping) feasible • Fast application development time thanks to modularity • Vision system on a chip • 20 to 90 basic RTIP operators on today’s state-of-the-art FPGA • Hundreds of instructions per pixel (3x3 convolution: 9 instructions/pixel) • 50 to 100 Mpixels per second throughput • Billions of instructions per second