August 28, 2013

How to Accelerate OpenCV Applications with the Zynq-7000 All Programmable SoC using Vivado HLS Video Libraries August 28, 2013

OpenCV Overview • Open Source Computer Vision (OpenCV) is widely used to develop Computer Vision applications • Library of 2500+ optimized video functions • Optimized for desktop processors and GPUs • Tens of thousands users • Runs out of the box on ARM processors in Zynq • However • HD processing with OpenCVis often limited by external memory • Memory bandwidth is a bottleneck for performance • Memory accesses limit power efficiency • Zynq All-programmable SOCs are a great way of implementing embedded computer vision applications • High performance and Low Power

Real-Time Computer Vision Applications Computer Vision Applications Real-time Analytics Function Advanced Drivers Assist for Safety Lane or Pedestrian detection Friend vs Foe recognition Surveillance for Security High velocity objectdetection Machine Vision for Quality Tumor detection Medical Imaging For non invasive surgery

Real-time Video Analytics Processing Pixel based Image Processing and Feature Extraction Frame based Feature processing and decision making Pixel based Image processing and Feature extraction 4Kx2K F1 F2 F3 ….. 1080p 720p 480p 10000s Ops/feature 1000s of features/sec = Mops 100s Ops/pixel 8MPx100 Ops/ frame = 100s Gops

Heterogeneous Implementation of Real-time Video Analytics Pixel based Image Processing and Feature Extraction Frame based Feature processing and decision making Hardware Domain (FPGA) Pixel based Image processing and Feature extraction Software Domain (ARM) 4Kx2K F1 F2 F3 ….. 1080p 720p 480p 10000s Ops/feature 1000s of features/sec = Mops 100s Ops/pixel 8MPx100 Ops/ frame = 100s Gops

Xilinx Real-time Image Analytics Implementation: Zynq All Programmable SoC Pixel based Image Processing and Feature Extraction Frame based Feature processing and decision making Frame based Feature processing and decision making Pixel based Image processing and Feature extraction 4Kx2K F1 F2 F3 ….. 1080p 720p 480p 10000s Ops/feature 1000s of features/sec = Mops 100s Ops/pixel 8MPx100 Ops/ frame = 100s Gops

Vivado: Productivity gains for OpenCV functions • C simulation of HD video algorithm ~1 fps • RTL simulation of HD video 1 frame per hour • Real-time FPGA implementation up to 60fps

Accelerating OpenCV Applications Driver Assist Broadcast Monitor HD Surveillance Cinema Projection Pixel processing interfaces and basic functions for analytics Frame-level processing Library for PS Video Conferencing Digital Signage Vivado HLS Studio Cinema Camera Consumer Displays Office-class MFP Machine Vision Medical Displays

Zynq Video TRD architecture DDR3 External Memory • Video access to external memory using 64-bit High Performance ports • Control register access using 32-bit General Purpose ports • Video streams implemented using AXI4-Stream DDR3 S_AXI_HP 64 bit SD Card S_AXI_GP 32b bit Processing System DDR Memory Controller AXI4 Stream IP Core Hardened Peripherals Dual Core Cortex-A9 AXI Interconnect AXI VDMA Xylon Display Controller Video Input HDMI HDMI HLS-generated pipeline

IP Centric Design flowAccelerated IP Generation and Integration C based IP Creation User Preferred System Integration Environment C, C++ or SystemC System Generator for DSP • C Libraries • Floating point math.h • Fixed point • Video VHDL or Verilog plus SW Drivers Vivado IP Integrator IP Subsystem Xilinx IP 3rd Party IP User IP Vivado RTL Integration

Using OpenCV in FPGA designs Pure OpenCV Application Integrated OpenCV Application Accelerated OpenCV Application OpenCV Reference Synthesizable Block Synthesized Block

Pure OpenCV Application DDR3 External Memory DDR3 Processing System DDR Memory Controller SD Card Hardened Peripherals Dual Core Cortex-A9 AXI Interconnect AXI VDMA Xylon Display Controller Video Input HDMI HDMI HLS-generated pipeline

Pure OpenCV Application DDR3 External Memory 1 DDR3 Processing System DDR Memory Controller SD Card Hardened Peripherals Dual Core Cortex-A9 AXI Interconnect AXI VDMA Xylon Display Controller Video Input HDMI HDMI HLS-generated pipeline

Pure OpenCV Application DDR3 External Memory 5 4 1 2 3 DDR3 Processing System DDR Memory Controller SD Card Hardened Peripherals Dual Core Cortex-A9 AXI Interconnect AXI VDMA Xylon Display Controller Video Input HDMI HDMI HLS-generated pipeline

Pure OpenCV Application DDR3 External Memory DDR3 Processing System DDR Memory Controller SD Card Hardened Peripherals Dual Core Cortex-A9 AXI Interconnect AXI VDMA Xylon Display Controller Video Input HDMI HDMI HLS-generated pipeline

Integrated OpenCV Application DDR3 External Memory 5 4 1 2 3 DDR3 Processing System DDR Memory Controller SD Card Hardened Peripherals Dual Core Cortex-A9 AXI Interconnect AXI VDMA Xylon Display Controller Video Input HDMI HDMI HLS-generated pipeline

OpenCV Reference / Software Execution DDR3 External Memory 5 4 1 2 3 DDR3 Processing System DDR Memory Controller SD Card Hardened Peripherals Dual Core Cortex-A9 AXI Interconnect AXI VDMA Xylon Display Controller Video Input HDMI HDMI HLS-generated pipeline

OpenCV Reference / In system Test DDR3 External Memory 2 1 DDR3 Processing System DDR Memory Controller SD Card Hardened Peripherals Dual Core Cortex-A9 AXI Interconnect AXI VDMA Xylon Display Controller Video Input HDMI HDMI HLS-generated pipeline

Accelerated OpenCV Application DDR3 External Memory 2 1 DDR3 Processing System DDR Memory Controller SD Card Hardened Peripherals Dual Core Cortex-A9 AXI Interconnect AXI VDMA Xylon Display Controller Video Input HDMI HDMI HLS-generated pipeline

OpenCV design flow Develop OpenCV application on Desktop Run OpenCV application on ARM cores without modification Abstract FPGA portion using I/O functions Replace OpenCV function calls with synthesizable code Run HLS to generate FPGA accelerator Replace call to synthesizable code with call to FPGA accelerator

Partitioned OpenCV Application • Synchronization Synthesizable

OpenCV Design Tradeoffs • OpenCV-based image processing is built around memory frame buffers • Poor access locality -> small caches perform poorly • Complex architectures for performance -> higher power • Likely ‘good enough’ for many applications • Low resolution or framerate • Processing of features or regions of interest in a larger image • Streaming architectures give high performance and low power • Chaining image processing functions reduces external memory accesses • Video-optimized line buffers and window buffers simpler than processor caches • Can be implemented with streaming optimizations in HLS • Requires conversion of code to be synthesizable

HLS Video Libraries • OpenCV functions are not directly synthesizable with HLS • Dynamic memory allocation • Floating point • Assumes images are modified in external memory • The HLS video library is intended to replace many basic OpenCV functions • Similar interfaces and algorithms to OpenCV • Focus on image processing functions implemented in FPGA fabric • Includes FPGA-specific optimizations • Fixed point operations instead of floating point • On-chip Linebuffersand window buffers • Not necessarily bit-accurate

Xilinx HLS Video Library 2013.2 For function signatures and descriptions, see the HLS user guide UG 902

Video Library Functions • C++ code contained in hls namespace. #include “hls_video.h” • Similar interface, equivalent behavior with OpenCV, e.g. • OpenCV library: • HLS video library: • Some constructor arguments have corresponding or replacement template parameters, e.g. • OpenCVlibrary: • HLS video library: • ROWS and COLS specify the maximum size of an image processed • cvScale(src, dst, scale, shift); • hls::Scale<...>(src, dst, scale, shift); • cv::Mat mat(rows, cols, CV_8UC3); • hls::Mat<ROWS, COLS, HLS_8UC3> mat(rows, cols);

Video Library Core Structures

Limitations • Must replace OpenCV calls with video library functions • Frame buffer access not supported through pointers • use VDMA and AXI Stream adapter functions • Random access not supported • data read more than once must be duplicated • see hls::Duplicate() • In-place update not supported • e.g. cvRectangle (img, point1, point2)

OpenCV Code • One image input, one image output • Processed by chain of functions sequentially … IplImage* src=cvLoadImage("test_1080p.bmp"); IplImage* dst=cvCreateImage(cvGetSize(src), src->depth, src->nChannels); cvSobel(src, dst, 1, 0); cvSubS(dst, cvScalar(100,100,100), src); cvScale(src, dst, 2, 0); cvErode(dst, src); cvDilate(src, dst); cvSaveImage("result_1080p.bmp", dst); cvReleaseImage(&src); cvReleaseImage(&dst); … test_opencv.cpp

Integrated OpenCV Application System provides pointer to frame buffers Synthesizable code can also be run on ARM void img_process(ZNQ_S32 *rgb_data_in, ZNQ_S32 *rgb_data_out, int height, int width, int stride, intflag_OpenCV) { // constructing OpenCV interface IplImage* src_dma= cvCreateImageHeader(cvSize(width, height), IPL_DEPTH_8U, 4); IplImage* dst_dma= cvCreateImageHeader(cvSize(width, height), IPL_DEPTH_8U, 4); src_dma->imageData = (char*)rgb_data_in; dst_dma->imageData = (char*)rgb_data_out; src_dma->widthStep = 4 * stride; dst_dma->widthStep = 4 * stride; if (flag_OpenCV) { opencv_image_filter(src_dma, dst_dma); } else { sw_image_filter(src_dma, dst_dma); } cvReleaseImageHeader(&src_dma); cvReleaseImageHeader(&dst_dma); }img_filters.c

Accelerated with Vivado HLS video library Top level function extracted for HW acceleration #include “hls_video.h” // header file of HLS video library #include “hls_opencv.h” // header file of OpenCV I/O // typedef video library core structures typedefhls::stream<ap_axiu<32,1,1,1> > AXI_STREAM; typedefhls::Scalar<3, uchar> RGB_PIXEL; typedefhls::Mat<1080,1920,HLS_8UC3> RGB_IMAGE; void image_filter(AXI_STREAM& src_axi, AXI_STREAM& dst_axi, int rows, int cols); top.h #include “top.h” … IplImage* src=cvLoadImage("test_1080p.bmp"); IplImage* dst=cvCreateImage(cvGetSize(src), src->depth, src->nChannels); AXI_STREAM src_axi, dst_axi; IplImage2AXIvideo(src, src_axi); image_filter(src_axi, dst_axi, src->height, src->width); AXIvideo2IplImage(dst_axi, dst); cvSaveImage("result_1080p.bmp", dst); cvReleaseImage(&src); cvReleaseImage(&dst); test.cpp

Accelerated with Vivado HLS video library • HW Synthesizable Block for FPGA acceleration • Consist of video library function and interfaces • Replace OpenCV function with similar function in hls namespace void image_filter(AXI_STREAM& input, AXI_STREAM& output, int rows, int cols) { //Create AXI streaming interfaces for the core #pragma HLS RESOURCE variable=input core=AXIS metadata="-bus_bundle INPUT_STREAM" #pragma HLS RESOURCE variable=output core=AXIS metadata="-bus_bundle OUTPUT_STREAM" #pragma HLS RESOURCE variable=rows core=AXI_SLAVE metadata="-bus_bundle CONTROL_BUS" #pragma HLS RESOURCE variable=cols core=AXI_SLAVE metadata="-bus_bundle CONTROL_BUS" #pragma HLS RESOURCE variable=return core=AXI_SLAVE metadata="-bus_bundle CONTROL_BUS" #pragma HLS INTERFACE ap_stable port=rows #pragma HLS INTERFACE ap_stableport=cols RGB_IMAGE img_0(rows, cols), img_1(rows, cols), img_2(rows, cols); RGB_IMAGE img_3(rows, cols), img_4(rows, cols), img_5(rows, cols); RGB_PIXEL pix(50, 50, 50); #pragma HLS dataflow hls::AXIvideo2Mat(input, img_0); hls::Sobel<1,0,3>(img_0, img_1); hls::SubS(img_1, pix, img_2); hls::Scale(img_2, img_3, 2, 0); hls::Erode(img_3, img_4); hls::Dilate(img_4, img_5); hls::Mat2AXIvideo(img_5, output); } top.cpp

Using Linux Userspace API Modify device tree to include register map Call from userspace after mmap() FILTER@0x400D0000 { compatible = "xlnx,generic-hls"; reg= <0x400d0000 0xffff>; interrupts = <0x0 0x37 0x4>; interrupt-parent = <0x1>; }; Ximage_filterxsfilter; intfd_uio = 0; if ((fd_uio = open("/dev/uio0", O_RDWR)) < 0) { printf("UIO: Cannot open device node\n"); } xsfilter.Control_bus_BaseAddress= (u32)mmap(NULL, XSOBEL_FILTER_CONTROL_BUS_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd_uio, 0); xsfilter.IsReady= XIL_COMPONENT_IS_READY; // init the configuration for image filter XImage_filter_SetRows(&xsfilter, sobel_configuration.height); XImage_filter_SetCols(&xsfilter, sobel_configuration.width); XImage_filter_EnableAutoRestart(&xsfilter); XImage_filter_Start(&xsfilter);

HLS Directives for Video Processing • Assign ‘input’ to be an AXI4 stream named “INPUT_STREAM” • Assign control interface to an AXI4-Lite interface • Assign ‘rows’ to be accessible through the AXI4-Lite interface • Declare that ‘rows’ will not be changed during the execution of the function • Enable streaming dataflow optimizations #pragma HLS RESOURCE variable=input core=AXIS metadata="-bus_bundle INPUT_STREAM" #pragma HLS RESOURCE variable=return core=AXI_SLAVE metadata="-bus_bundle CONTROL_BUS" #pragma HLS RESOURCE variable=rows core=AXI_SLAVE metadata="-bus_bundle CONTROL_BUS" #pragma HLS INTERFACE ap_stableport=rows #pragma HLS dataflow

A more complex OpenCV example: fast-corners • This code is not ‘streaming’ and must be rewritten • Random access and in-place operation on ‘dst’ void opencv_image_filter(IplImage* img, IplImage* dst ) { IplImage* gray = cvCreateImage(cvSize(img->width,img->height), 8, 1 ); cvCvtColor( img, gray, CV_BGR2GRAY ); std::vector<cv::KeyPoint> keypoints; cv::Mat gray_mat(gray,0); cv::FAST(gray_mat, keypoints, 20,true ); intrect=2; cvCopy(img,dst); for (inti=0; i<keypoints.size(); i++) { cvRectangle(dst, cvPoint(keypoints[i].pt.x,keypoints[i].pt.y), cvPoint(keypoints[i].pt.x+rect,keypoints[i].pt.y+rect), cvScalar(255,0,0),1); } cvReleaseImage( &gray ); }opencv_top.cpp

A more complex OpenCV example: fast-corners • This code is ‘streaming’ • Note that function correspondence is not 1:1! void opencv_image_filter(IplImage* src, IplImage* dst) { IplImage* gray = cvCreateImage( cvGetSize(src), 8, 1 ); IplImage* mask = cvCreateImage( cvGetSize(src), 8, 1 ); IplImage* dmask = cvCreateImage( cvGetSize(src), 8, 1 ); std::vector<cv::KeyPoint> keypoints; cv::Mat gray_mat(gray,0); cvCvtColor(src, gray, CV_BGR2GRAY ); cv::FAST(gray_mat, keypoints, 20, true); GenMask(mask, keypoints); cvDilate(mask,dmask); cvCopy(src,dst); PrintMask(dst,dmask,cvScalar(255,0,0)); cvReleaseImage( &mask ); cvReleaseImage( &dmask ); cvReleaseImage( &gray ); } opencv_top.cpp hls::FASTX hls::PaintMask

A more complex OpenCV example: fast-corners • Synthesizable code • Note ‘#pragma HLS stream” hls::Mat<MAX_HEIGHT,MAX_WIDTH,HLS_8UC3> _src(rows,cols); hls::Mat<MAX_HEIGHT,MAX_WIDTH,HLS_8UC3> _dst(rows,cols); hls::AXIvideo2Mat(input, _src); hls::Mat<MAX_HEIGHT,MAX_WIDTH,HLS_8UC3> src0(rows,cols); hls::Mat<MAX_HEIGHT,MAX_WIDTH,HLS_8UC3> src1(rows,cols); #pragma HLS stream depth=20000 variable=src1.data_stream hls::Mat<MAX_HEIGHT,MAX_WIDTH,HLS_8UC1> mask(rows,cols); hls::Mat<MAX_HEIGHT,MAX_WIDTH,HLS_8UC1> dmask(rows,cols); hls::Scalar<3,unsigned char> color(255,0,0); hls::Duplicate(_src,src0,src1); hls::Mat<MAX_HEIGHT,MAX_WIDTH,HLS_8UC1> gray(rows,cols); hls::CvtColor<HLS_BGR2GRAY>(src0,gray); hls::FASTX(gray,mask,20,true); hls::Dilate(mask,dmask); hls::PaintMask(src1,dmask,_dst,color); hls::Mat2AXIvideo(_dst, output); top.cpp

Streams and Reconvergent paths • hls::Mat conceptually represents a whole image, but is implemented as a stream of pixels • Fast-corners contains a reconvergent path • The stream of pixels for src1 must include enough buffering to match the delay through FASTX and Dilate (approximately 10 video lines * 1920 pixels) template<int ROWS, int COLS, intT> class Mat { public: HLS_SIZE_T rows, cols; hls::stream<HLS_TNAME(T)> data_stream[HLS_MAT_CN(T)]; }; hls_video_core.h Dilate FASTX CvtColor PaintMask src1 #pragma HLS stream depth=20000 variable=src1.data_stream

Performance Analysis • AXI Performance Monitor collects statistics on memory bandwidth • see /mnt/AXI_PerfMon.log • Video + fast corners • 1920*1080*60*32 = ~4 Gb/s per stream • HP0: Read 4.01 Gb/s, Write 4.01 Gb/s, Total 8.03 Gb/s • HP2: Read 4.01 Gb/s, Write 4.01 Gb/s, Total 8.03 Gb/s

Power Analysis • Voltage and Current can be read from the digital power regulators on the ZC702 board. • Custom, realtime HD video processing in 2-3 Watts total system power • FASTX is less than 200 mW incremental power

HLS and Zynq accelerates OpenCV apps • OpenCV functions enable fast prototyping of Computer Vision algorithms • Computer Vision applications are inherently heterogenous and require a mix HW and SW implementation • Vivado HLS video library accelerates mapping of openCV functions to FPGA programmable fabric • Zynqoffers power-optimized integrated solution with high performance programmable logic and embedded ARM

Additional OpenCV Collateral at Xilinx.com Download XAPP1167from Xilinx.com http://www.xilinx.com/hls QuickTake: Leveraging OpenCV and High-Level Synthesis with Vivado http://www.xilinx.com/getlicense

August 28, 2013

August 28, 2013

Presentation Transcript

August 28, 2013

Webinar: August 28, 2013

Administrative Services 4 August 28, 2013

Wednesday, August 28, 2013

Medicaid Updates August 28, 2013

Lecture 1: August 28 th , 2013

August 28, 2013--Fifty Years Ago

Warm Up, August 28 th , 2013

Wednesday August 28, 2013

FRA EMERGENCY ORDER 28 AUGUST 02, 2013

August 28, 2013

August 28, 2013

August 28, 2013

Gaming Control Board August 28, 2013

Classroom Organization Strategies PLYUSD August 28, 2013

Computer Performance 28 August 2013

Summer Construction Projects Update August 28, 2013

PRESENTATION TO KEY STAKEHOLDERS AUGUST 28, 2013

August 28, 2013

Bansko , August 28 , 2013