280 likes | 412 Views
High Level Synthesis of Stereo Matching Productivity, Performance, and Software Constraints. Kyle Rupnow, Yun Liang, Yinan Li, Dongbo Min, Minh Do, Deming Chen Advanced Digital Sciences Center University of Illinois at Urbana-Champaign. Big Picture. There are many HLS tools ...
E N D
High Level Synthesis of Stereo MatchingProductivity, Performance, and Software Constraints Kyle Rupnow, Yun Liang, Yinan Li, Dongbo Min, Minh Do, Deming Chen Advanced Digital Sciences Center University of Illinois at Urbana-Champaign
Big Picture There are many HLS tools ... But little un-biased study of their capabilities, limitations, performance, and how to achieve it … and little study of using HLS tools on real-world software For this study, we concentrate on one state-of-the-art tool, AutoPilot
High Level Synthesis • Input languages: C/C++, SystemC, Haskell CUDA, OpenCL, .net languages Specialized languages
High Level Synthesis Promises • Better productivity • Languages easier to learn • Algorithms easier to describe • Code concise, less susceptible to error • Faster design cycle, with similar design quality • Automate complex, tedious transformations
High Level Synthesis Process • Code transformations • Dead code elimination, strength reduction … • Loop optimization, pipelining, memory optimization … • Scheduling • Resource binding • Architectural synthesis • Register transfer level code generation
High Level Synthesis History • 1st Generation • Poor input languages, poor quality results • 2nd Generation • Wrong target user base • Poor input, quality of results • Dataflow vs. control, validation • Insights into the failures of 1st & 2nd Gen led to critical advances for the following generation • 3rd Generation • Starting to achieve success … • But new insights necessary to move to 4th Gen * G. Martin and G. Smith, “High-Level Synthesis: Past, Present, and Future,” IEEE Design & Test of Computers, vol. 26, no. 4, pp. 18-25, Aug. 2009.
HLS Tools CatapultC Cynthesizer Feldspar DIME-C Esterel ImpulseC FCUDA PRET-C OpenRCL/MARC SpecC ROCC RapidMind/Ct Liquid Metal eXcite C2H AutoPilot Trident Synphony Accelerator NISC Signal SPARK LegUp Lava
Study Motivation • Despite a wealth of success stories… little independent study • Berkeley Design Technology Inc. offers a HLS Certification program • 2 small, unavailable signal processing applications • Vendors optimize the design themselves, send to BDTI for verification • This study concentrates on AutoPilot, one of the recognized state-of-the-art HLS tools, and evaluate it in terms of: • Quality of results with typical software • Design effort • Software constraints • Quality & usefulness of synthesis features • Usability • Performance gap vs. manual design
Typical Software • Why do we need to test with “typical software”? • Much more legacy SW than specially written SW • Many more SW experience than HW expertise • Software is widely available, needs acceleration, but authors often don’t have HW expertise • Transformation, annotation necessary • BUT … hopefully not a complete re-write • implementations of stereo matching • BFAS, CSBP, SO, SO w/o occlusion, CLM
AutoPilot • C/C++, System C input • Function by function synthesis • Each function: private datapath & FSM-based control • Automatic: • Standard compiler, bit-width reduction, HW specific transformations in some contexts (e.g. unroll, balance) • AutoPilot is conservative with automatic transformations to ensure correctness, flexibility • Manual: • AP reduced bit-width datatypes • Pragmas & synthesis directives
Our Optimization Process • Baseline compatibility • Minimum changes for compatibility • Structure changes • Combine loops, computation partition • Bit-width reduction and BRAM storage • Optimize storage, datapath resource use • Pipelining and loop optimization • Optimize throughput per computation item • Parallelization via resource duplication • Use more resources, compute multiple items
Our Optimization Process • Baseline compatibility • Minimum changes for compatibility • Structure changes • Combine loops, computation partition • Bit-width reduction and BRAM storage • Optimize storage, datapath resource use • Pipelining and loop optimization • Optimize throughput per computation item • Parallelization via resource duplication • Use more resources, compute multiple items Resource Use Optimization Latency/Throughput Optimization
Our Optimization Process • Baseline compatibility • Minimum changes for compatibility • Structure changes • Combine loops, computation partition • Bit-width reduction and BRAM storage • Optimize storage, datapath resource use • Pipelining and loop optimization • Optimize throughput per computation item • Parallelization via resource duplication • Use more resources, compute multiple items Optimization may open opportunity for iteration… For simplicity we present a single path
1. Baseline – Software Constraints • Eliminate use of libraries • Standard template library (STL), OpenCV • Convert dynamic memory allocation to static • Eliminate memory re-allocation • Convert memset/memcpy calls to loops • Eliminate use of arbitrary pointers • No pointer passing, pointer arithmetic • No run-time indirection (e.g. linked lists) • Produce compatible, but un-optimized source
2. Code Restructuring • Divide computation into independent units • Function/loop merging • Interchange nested loops • Share internal buffers • Manual … some have AutoPilot pragmas • But, manual can be more flexible at this point • And multiple may be required simultaneously • Expose potential for design space exploration
3. Bit-width Reduction and BRAM • Bit-width reduction can be automatic • But array pragmas override auto… • Array pragmas used to reorganize arrays, improve BRAM use efficiency • Reduce resource use, improve bandwidth • Allow flexibility for pipelining & parallelism
3. Bit-width/array Example ap_fixed<20,3> m_bI_F[101] = {…}; ap_fixed<20,3> m_bI1_F[101] = {…}; ap_fixed<20,3> m_bI2_F[101] = {…}; ap_fixed<20,3> m_bI3_F[101] = {…}; #pragma AP array_map instance=m_BI variable=m_BI_F,m_bI1_F,m_bI2_F,m_bI3_F vertical RecursiveGaussian_3D(…, m_BI_F[NumOfI-1], m_bI1_F[NumOfI-1], m_bI2_F[NumOfI-1], m_bI3_F[NumOfI-1]);
4. Pipelining & Loop Optimization • After steps 1-3: reduced resources • Unroll loops • Increase computation to loop overhead ratio • Small loops may be completely unrolled • Pipeline • Improve throughput, utilization of compute units • Target initiation interval of 1
4. Complete unroll example VoteDpr = 0; count = pDprCount[0]; for(d = 1; d < DprRange; d++){ #pragma AP unroll complete #pragma AP expression_balance if(pDprCount[d] > count){ count = pDprCount[d]; VoteDpr = d; } } • Initially, DprRange iterations… • But a simple inner loop completely unroll • expression_balance converts sequential dependence into tree-based maximum search
5. Parallelization (resource duplication) • Now, each computation pipeline is optimized… • Use extra resources to duplicate pipelines • If we can compute multiple data items in parallel • AutoPilot implicitly determines parallelism • No parallel pragma to explicitly denote • Programmer can help expose parallelism • Because AutoPilot is conservative, may need to organize execution to make parallelism explicit
5. Parallelism example #define FS_UNROLL 2 ap_fixed<22,9> DispCost[FS_UNROLL][W* H]; ap_fixed<22,9> CostL[FS_UNROLL][W*H]; #pragma AP array partition complete dim=1 variable=DispCost,CostL for(int k=Min;k<=Max; k += FS_UNROLL){ FL_00:for(int l=0; l< FS_UNROLL; l++){ #pragma AP unroll complete SubSampling_Merge(…,DispCost[l],…, k+l); CostAgg(…,DispCost[l],CostL[l],…); Cost_WTA(…,CostL[l],…, k+l); } }
Experiments • 5 implementations of stereo matching • BFAS, CSBP, SO, SO w/o occlusion, CLM • For each application, examine synthesis quality after each of the 5 optimization steps • Track design effort, transformation effectiveness, software constraints (as well as performance and area)
Stereo Matching • Two spatially separated color cameras • Distance in pixels between the same object in the images infers depth • Complex algorithms to match pixels • Global • Match all pixels at the same time • Local • Match pixels in local region first
Stereo Matching Results • Baseline not shown ( >> 100% resources ) • More complex benchmarks emphasize need for resource optimization • Parallel most important for these benchmarks
Productivity • Embedded kernels – 2 weeks design effort • Small kernels easy to apply transformations • Easy to analyze tradeoffs • Only subset of pragmas applicable • Stereo matching – 2.5 – 5 weeks per alg. Vs. • 4-6 months for manual FPGA designs
Usability • AutoPilot has powerful optimizations … • But applies them conservatively • Some of the most powerful optimizations apply only to a few situations • array _stream + dataflow • Needs: • Improve unrolling/pipelining/parallelization • Robustness for dataflow transformation • BRAM port duplication/bandwidth optimization • Automatic tradeoff analysis
Performance Gap • Performance between 4x and 126x over SW • Can be 40x difference to manual HW • Both pipelining and resource duplication can be important… depending on application • Needs: • Detect memory level dependence • Memory access reordering • Improved partitioning, streaming & pipelining • Automatic temporary buffers
Interested in Studying HLS?ADSC is hiring for PhD-level HLS researchSee me for more details