1 / 28

High Level Synthesis of Stereo Matching Productivity, Performance, and Software Constraints

High Level Synthesis of Stereo Matching Productivity, Performance, and Software Constraints. Kyle Rupnow, Yun Liang, Yinan Li, Dongbo Min, Minh Do, Deming Chen Advanced Digital Sciences Center University of Illinois at Urbana-Champaign. Big Picture. There are many HLS tools ...

emile
Download Presentation

High Level Synthesis of Stereo Matching Productivity, Performance, and Software Constraints

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. High Level Synthesis of Stereo MatchingProductivity, Performance, and Software Constraints Kyle Rupnow, Yun Liang, Yinan Li, Dongbo Min, Minh Do, Deming Chen Advanced Digital Sciences Center University of Illinois at Urbana-Champaign

  2. Big Picture There are many HLS tools ... But little un-biased study of their capabilities, limitations, performance, and how to achieve it … and little study of using HLS tools on real-world software For this study, we concentrate on one state-of-the-art tool, AutoPilot

  3. High Level Synthesis • Input languages: C/C++, SystemC, Haskell CUDA, OpenCL, .net languages Specialized languages

  4. High Level Synthesis Promises • Better productivity • Languages easier to learn • Algorithms easier to describe • Code concise, less susceptible to error • Faster design cycle, with similar design quality • Automate complex, tedious transformations

  5. High Level Synthesis Process • Code transformations • Dead code elimination, strength reduction … • Loop optimization, pipelining, memory optimization … • Scheduling • Resource binding • Architectural synthesis • Register transfer level code generation

  6. High Level Synthesis History • 1st Generation • Poor input languages, poor quality results • 2nd Generation • Wrong target user base • Poor input, quality of results • Dataflow vs. control, validation • Insights into the failures of 1st & 2nd Gen led to critical advances for the following generation • 3rd Generation • Starting to achieve success … • But new insights necessary to move to 4th Gen * G. Martin and G. Smith, “High-Level Synthesis: Past, Present, and Future,” IEEE Design & Test of Computers, vol. 26, no. 4, pp. 18-25, Aug. 2009.

  7. HLS Tools CatapultC Cynthesizer Feldspar DIME-C Esterel ImpulseC FCUDA PRET-C OpenRCL/MARC SpecC ROCC RapidMind/Ct Liquid Metal eXcite C2H AutoPilot Trident Synphony Accelerator NISC Signal SPARK LegUp Lava

  8. Study Motivation • Despite a wealth of success stories… little independent study • Berkeley Design Technology Inc. offers a HLS Certification program • 2 small, unavailable signal processing applications • Vendors optimize the design themselves, send to BDTI for verification • This study concentrates on AutoPilot, one of the recognized state-of-the-art HLS tools, and evaluate it in terms of: • Quality of results with typical software • Design effort • Software constraints • Quality & usefulness of synthesis features • Usability • Performance gap vs. manual design

  9. Typical Software • Why do we need to test with “typical software”? • Much more legacy SW than specially written SW • Many more SW experience than HW expertise • Software is widely available, needs acceleration, but authors often don’t have HW expertise • Transformation, annotation necessary • BUT … hopefully not a complete re-write • implementations of stereo matching • BFAS, CSBP, SO, SO w/o occlusion, CLM

  10. AutoPilot • C/C++, System C input • Function by function synthesis • Each function: private datapath & FSM-based control • Automatic: • Standard compiler, bit-width reduction, HW specific transformations in some contexts (e.g. unroll, balance) • AutoPilot is conservative with automatic transformations to ensure correctness, flexibility • Manual: • AP reduced bit-width datatypes • Pragmas & synthesis directives

  11. Our Optimization Process • Baseline compatibility • Minimum changes for compatibility • Structure changes • Combine loops, computation partition • Bit-width reduction and BRAM storage • Optimize storage, datapath resource use • Pipelining and loop optimization • Optimize throughput per computation item • Parallelization via resource duplication • Use more resources, compute multiple items

  12. Our Optimization Process • Baseline compatibility • Minimum changes for compatibility • Structure changes • Combine loops, computation partition • Bit-width reduction and BRAM storage • Optimize storage, datapath resource use • Pipelining and loop optimization • Optimize throughput per computation item • Parallelization via resource duplication • Use more resources, compute multiple items Resource Use Optimization Latency/Throughput Optimization

  13. Our Optimization Process • Baseline compatibility • Minimum changes for compatibility • Structure changes • Combine loops, computation partition • Bit-width reduction and BRAM storage • Optimize storage, datapath resource use • Pipelining and loop optimization • Optimize throughput per computation item • Parallelization via resource duplication • Use more resources, compute multiple items Optimization may open opportunity for iteration… For simplicity we present a single path

  14. 1. Baseline – Software Constraints • Eliminate use of libraries • Standard template library (STL), OpenCV • Convert dynamic memory allocation to static • Eliminate memory re-allocation • Convert memset/memcpy calls to loops • Eliminate use of arbitrary pointers • No pointer passing, pointer arithmetic • No run-time indirection (e.g. linked lists) • Produce compatible, but un-optimized source

  15. 2. Code Restructuring • Divide computation into independent units • Function/loop merging • Interchange nested loops • Share internal buffers • Manual … some have AutoPilot pragmas • But, manual can be more flexible at this point • And multiple may be required simultaneously • Expose potential for design space exploration

  16. 3. Bit-width Reduction and BRAM • Bit-width reduction can be automatic • But array pragmas override auto… • Array pragmas used to reorganize arrays, improve BRAM use efficiency • Reduce resource use, improve bandwidth • Allow flexibility for pipelining & parallelism

  17. 3. Bit-width/array Example ap_fixed<20,3> m_bI_F[101] = {…}; ap_fixed<20,3> m_bI1_F[101] = {…}; ap_fixed<20,3> m_bI2_F[101] = {…}; ap_fixed<20,3> m_bI3_F[101] = {…}; #pragma AP array_map instance=m_BI variable=m_BI_F,m_bI1_F,m_bI2_F,m_bI3_F vertical RecursiveGaussian_3D(…, m_BI_F[NumOfI-1], m_bI1_F[NumOfI-1], m_bI2_F[NumOfI-1], m_bI3_F[NumOfI-1]);

  18. 4. Pipelining & Loop Optimization • After steps 1-3: reduced resources • Unroll loops • Increase computation to loop overhead ratio • Small loops may be completely unrolled • Pipeline • Improve throughput, utilization of compute units • Target initiation interval of 1

  19. 4. Complete unroll example VoteDpr = 0; count = pDprCount[0]; for(d = 1; d < DprRange; d++){ #pragma AP unroll complete #pragma AP expression_balance if(pDprCount[d] > count){ count = pDprCount[d]; VoteDpr = d; } } • Initially, DprRange iterations… • But a simple inner loop  completely unroll • expression_balance converts sequential dependence into tree-based maximum search

  20. 5. Parallelization (resource duplication) • Now, each computation pipeline is optimized… • Use extra resources to duplicate pipelines • If we can compute multiple data items in parallel • AutoPilot implicitly determines parallelism • No parallel pragma to explicitly denote • Programmer can help expose parallelism • Because AutoPilot is conservative, may need to organize execution to make parallelism explicit

  21. 5. Parallelism example #define FS_UNROLL 2 ap_fixed<22,9> DispCost[FS_UNROLL][W* H]; ap_fixed<22,9> CostL[FS_UNROLL][W*H]; #pragma AP array partition complete dim=1 variable=DispCost,CostL for(int k=Min;k<=Max; k += FS_UNROLL){ FL_00:for(int l=0; l< FS_UNROLL; l++){ #pragma AP unroll complete SubSampling_Merge(…,DispCost[l],…, k+l); CostAgg(…,DispCost[l],CostL[l],…); Cost_WTA(…,CostL[l],…, k+l); } }

  22. Experiments • 5 implementations of stereo matching • BFAS, CSBP, SO, SO w/o occlusion, CLM • For each application, examine synthesis quality after each of the 5 optimization steps • Track design effort, transformation effectiveness, software constraints (as well as performance and area)

  23. Stereo Matching • Two spatially separated color cameras • Distance in pixels between the same object in the images infers depth • Complex algorithms to match pixels • Global • Match all pixels at the same time • Local • Match pixels in local region first

  24. Stereo Matching Results • Baseline not shown ( >> 100% resources ) • More complex benchmarks emphasize need for resource optimization • Parallel most important for these benchmarks

  25. Productivity • Embedded kernels – 2 weeks design effort • Small kernels  easy to apply transformations • Easy to analyze tradeoffs • Only subset of pragmas applicable • Stereo matching – 2.5 – 5 weeks per alg. Vs. • 4-6 months for manual FPGA designs

  26. Usability • AutoPilot has powerful optimizations … • But applies them conservatively • Some of the most powerful optimizations apply only to a few situations • array _stream + dataflow • Needs: • Improve unrolling/pipelining/parallelization • Robustness for dataflow transformation • BRAM port duplication/bandwidth optimization • Automatic tradeoff analysis

  27. Performance Gap • Performance between 4x and 126x over SW • Can be 40x difference to manual HW • Both pipelining and resource duplication can be important… depending on application • Needs: • Detect memory level dependence • Memory access reordering • Improved partitioning, streaming & pipelining • Automatic temporary buffers

  28. Interested in Studying HLS?ADSC is hiring for PhD-level HLS researchSee me for more details

More Related