230 likes | 382 Views
Re-configurable Parallel Stream Processor with self-assembling and self-restorable micro-architecture. Lev Kirischian, Irina Terterian, Pil Woo Chun and Vadim Geurkov Embedded and Re-configurable Systems Lab RYERSON University, CANADA. Example of Multi-task Data-Flow workload
E N D
Re-configurable Parallel Stream Processor with self-assembling and self-restorable micro-architecture Lev Kirischian, Irina Terterian,Pil Woo Chun and Vadim Geurkov Embedded and Re-configurable Systems Lab RYERSON University, CANADA
Example of Multi-task Data-Flow workload where each task can run in different modes Tasks Task 4: Mode 1 Mode 3 Mode 4 Mode 7 Task 3 Task 2: Mode 1 Task 2: Mode 2 Task 1: Mode 1 Mode 2 Mode 3 Time
Usual Approach: Conventional Processors with Software-to-Task Optimization (Compilers +OS) Software-to-task optimization allows using conventional computing platforms with fixed architecture (Superscalar, VLIW, etc.) coupled with software compilers and OS. Limitations of the conventional processors If tasks are executed on sequential computing system – processing time often cannot fit specification requirements If tasks are executed on parallel computing system with fixed architecture – cost-effectiveness of these parallel computers strongly depend on the tasks algorithm or data structure
Alternative Approach: Application Specific Processors (ASP) with Static Hardware-to-Task Optimization ASP allows reaching required cost-performance parameters because ASP-architecture is optimized on data-flow graph of the task and task data structure Limitations for the Application Specific Processors • Decrease of performance if task algorithm or data structure changes • Limited possibility for further modernization • High cost for multi-task or multi-mode custom computing systems
Proposed Approach: Reconfigurable Processor with Dynamic Architecture-to-Task Optimization High-performance computing system for multi-task data-flow applications should contain two major components: 1. Dynamically Re-configurable Computing Platform based on partially-configurable FPGA devices to provide maximum possible hardware flexibility. 2. Library of Application Specific Virtual Processors (ASVP) – configuration bit-streams to program On-Chip Application Specific Processor’s circuitry for the period of time while Application (Task) is active.
Architecture of Partially Reconfigurable FPGA devices (Xilinx “Virtex” Family) Configuration Data Files Internal Configuration SRAM In Out I / O Frame CLBs Frame # 1 Block RAM CLBs Frame # i Block RAM CLBs Frame # N I / O Frame Internal (Virtual BUS) CLB - Configurable Logic Block - Uniform Logic Element of a Frame, smallest individually configurable component in the FPGA
Concept of Application Specific Virtual Processor (ASVP) • Application Specific Virtual Processor (ASVP) – • a group of logic resources dedicated and optimally configured to reflect the algorithm and data structure of the task. • ASVP is presented in a form of configuration data file (configuration bit-stream) to be downloaded into the FPGA when task should be activated
Life-cycle of Application Specific Virtual Processor 1. ASVP-core downloads to the Reconfigurable platform before task activation 2. ASVP performs the task data processing as long as it is necessary without interruption or time sharing of dedicated logic resources with any other task 3. After task completion all resources included in the ASVP can be re-configured for any other task.
ASVP Architecture-to-Task Optimization in Partially Reconfigurable FPGA FPGA Slots: 1 2 3 ... Data-Flow Graph FPGA X O R X O R + Virtual Hardware Component XOR Data In XOR XOR + Input Output Data Out Internal (Virtual) BUS
Virtual Hardware Component & Virtual Bus Interconnection Virtual Bus Virtual Hardware Component Boundary
Micro-architecture of Application Specific Virtual Processor (ASVP) Micro-architecture of ASVP is based on Virtual Hardware Components interconnected via Virtual Bus lines
Parallel Task Processing on the Dynamically Re-configurable Stream Processor (DRSP) Data out #2 Data out #3 Data in #2 ASVP1 for Task 1 ASVP 2 ASVP 3 Data out #1 I/O 1 I/O 2 I/O 3 I/O 4 Data in #1 FU 1 FU 2 FU 3 FU 4 RIM 1 RIM 2 RIM 3 RIM 4 Virtual Bus
DRSP: System Level Architecture Host PC Data Stream Source Task Memory Task 1:{Afix+Amodes} …………………. Task h:{Afix+Amodes} PRCP-base Reconfigurable Functional Unit Afix i + … Cache Memory {Amodes i} P C I - Bus PCI-Interface Module Configuration & Data Bus RT-HOS Data Out
Architecture of Reconfigurable Computing Module SPI 2 x 3.43 Gbit / S (12 bit*300 MHz) Input LVDS ports Real-Time Hardware Operating System Based on XCV50E Vertex FPGA 8.12 Gbit /S LVTTL BUS (64 bit x 133MHz) PCI Inter face 800 Mbit/S Reconfig. Functional Unit [ RFM 0111-002] Config.Files / Data Cache (4x512KB) SPI 2 x 3.43 Gbit / S (12 bit*300 MHz) Output LVDS Ports
Reconfigurable Computing Module based on Xilinx “Virtex-E family of FPGA Devices
Restoration of ASVP using spare CLB-column Column # 1 2 3 ... If hardware fault occurs the damaged Virtual Hardware Component can be relocated to the reserved CLB-column. AP i X O R X O R + + Input Output Communication Field
When the proposed technology is most beneficial? • Workload consists of many tasks, where each task can run in different modes. • Each task requires high-speed data-stream processing • Task algorithmsmay be modified within life cycle of a system • Active tasks must run in parallel and should not be interrupted in any case when one of the tasks switches its mode or terminates. • System can be remotely or self-restored even if some hardware fault occurs
DRSP Application for Networked Intelligent Manufacturing Systems High performance parallel data-stream processing (up to thousands of billions operations / sec.) of big volume of data (up to hundreds of Giga bits) for: a) Complex image processing and image recognition, b)Spectrum analysis and digital signal processing, c)Data transmission via LAN with data compression / decompression and encryption / decryption, d)Control of high performance manufacturing equipment and robotic systems.
25 20 Acceleration 15 10 5 0 1 2 3 4 5 6 7 8 9 10 Number of CLB-slots in Virtual Component Acceleration of Task / Mode Switching Acceleration of task or mode switching comparing with Entire FPGA-based system increases when number of CLB-columns in ASVP is minimal and can be over that 20 times faster
Minimization of Hardware Resources Minimization of Logic resources in DRSP approach Comparing with entire FPGA-based systems: When number of tasks and task modes increases in a workload, respectively increases the cost-effectiveness of DRSP
SUMMARY: RDSP Comparing with Conventional CPU, DSP or ASP Platforms DRSP Conv. CPU DSP ASP Performance Flexibility Reliability Much lower than DRSP Lower than DRSP Much lower than DRSP Somewhat higher None, or very little Lower than DRSP Much lower than DRSP Much lower than DRSP Lower than DRSP