170 likes | 275 Views
A Flexible Interconnection Structure for Reconfigurable FPGA Dataflow Applications. Gianluca Durelli , Alessandro A. Nacci , Riccardo Cattaneo, Christian Pilato, Donatella Sciuto and Marco Domenico Santambrogio Politecnico di Milano
E N D
A Flexible Interconnection Structure for Reconfigurable FPGA Dataflow Applications Gianluca Durelli, Alessandro A. Nacci, Riccardo Cattaneo, Christian Pilato, Donatella Sciuto and Marco Domenico Santambrogio Politecnico di Milano Dipartimento di Elettronica, Informazione e Bioingegneria Milano, IT [durelli, nacci, rcattaneo, pilato, sciuto]@elet.polimi.it marco.santambrogio@polimi.it 20th Reconfigurable Architectures Workshop May 20-21, 2013, Boston, USA
Rationale • Strive for performance in computing intensive applications • Reconfigurable HW well suited for certain classes of applications • Multimedia, computational biology, physical simulation • FPGA used in HPC systems • High maintenance costs • need to share resources among users • Need to dynamically share and reuse components on FPGA among different users
Outline Goals State of Art Proposed Solution Design and Evaluation Case Study Conclusions and Future work
Goals • Design an interconnection able to: • Create different pipelines reusing available components on the FPGA • Share the resources between different applications • Not insert any stall in the pipeline • Target FPGA for HPC scenario
State of Art • Introduce unexpected delays in computation • Can’t assure performance when sharing the device between different users • BUS interconnection • Congestion problem • Does not scale • Network on Chip • Possible congestion problem • Good scalability
Proposed Solution • Switch based interconnection • Cores inputs connected to interconnection outputs • Cores outputs connected to interconnection inputs • Fully pipelined point-to-point communication • Data read/write only when all the inputs are available • Can be configured by setting for each input and output channels: • Switching configuration: • Multiplexer configuration to route information • From which clock cycle the channel is active • How much data have to be read/write through that channel
Proposed Solution 3 2 5 4 • Suited for Dataflow/Pipelined applications • Parameters can be extracted from an high level description of the application and pipeline structure: • Possibility to automate the parameter extraction and interconnection design
Implementation • Solution Implemented with HLS: • HLS well suited for dataflow/stencil loop synthesis • Simplify HW development • Generation of compatible interfaces • Maxeler Technologies: • HPC Dataflow computing exploiting FPGA • Proprietary HLS starting from Java-like description: • Proposed interconnection solution easily described in Java • MaxWorkstation 3A: • Intel i7 quad-core • Xilinx Virtex6 XC6VSX547T • PCIe communication: • Maximum 8 channels/streams
Evaluation: Area Occupation • Area increment (10-30%) due to increase in switching logic • The interconnection consumes up to 6% of the FPGA: • Lot of space remains for user cores
Evaluation: Frequency • Tested with pass-through cores to evaluate maximum working frequency of the interconnection (300MHz) • In case of real life applications (Brain network with cores working at 200MHz) the interconnection does not affect the critical path
Case Study (D) (B) (C) (A) • Application: • Image processing pipeline (up to 4 stages): • Gray scale (GS), Gaussian blur (GB), Edge detection (ED) filters • Their combinations • Tested architectures: • Experiments: • Single execution of a N stages pipeline • Batch execution of a workload of 100 random applications
Case Study: Single execution (D) (B) (C) (A)
Case Study: Single execution (D) (B) (C) (A)
Case Study: Batch execution • Proposed solution (D) does not introduce overhead in the overall execution timew.r.t. the other two architectures • Low system load: • Up to 30% reduction in the overall workload execution time
Case Study: Batch execution • Low system load (1-2 applications): • Proposed solution (D) does not introduce delays in the execution of a single application of the workload • Higher system loads (more than 2 applications): • 10%-30% reduction in single application execution time
Conclusions and Future work • Conclusion: • Design of a interconnection to support HW resource sharing in multi-application scenario • Solution suited for dataflow/pipelined systems • Possibility to realize different pipeline configurations at run-time • Future works: • Design of a mapping/reconfiguration strategy to allocate user cores and configure new core instances at run-time