10 likes | 102 Views
Automating Resource Optimisation in Reconfigurable Design. {nx210, cpc10, qj04 , wl} @doc.ic.ac.uk qiangliu@tju.edu.cn. Function Level. Design Flow. Xinyu Niu, Thomas C.P Chau, Qiwei Jin, Wayne Luk and Qiang liu. ABSTRACT. Partition Level. Configuration Level.
E N D
Automating Resource Optimisation in Reconfigurable Design {nx210, cpc10, qj04 , wl}@doc.ic.ac.uk qiangliu@tju.edu.cn Function Level Design Flow Xinyu Niu, Thomas C.P Chau, Qiwei Jin, Wayne Luk and Qiang liu ABSTRACT Partition Level Configuration Level Function properties extracted from algorithm details: resource consumption, bandwidth requirement, memory architectures data dependency New design approach: automatically identify and exploit run-time reconfiguration for optimising resource utilisation Configuration Data Flow Graph: a hierarchical graph structure, for synthesis of reconfigurable designs in three steps Evaluation: barrier option pricing (finance), particle filter (robotics), reverse time migration (oil and gas) Improvement: 1.61 to 2.19 times faster than optimised static FPGA designs, up to 28.8 times faster than optimised CPU reference designs, and 1.55 times faster than optimised GPU designs; up to 29 times more energy efficient than CPU/GPU Generate configurations based on assigned ALAP and ATAP levels Optimise configurations to fully utilise available resources Adopt analysis of resource consumption and bandwidth in the design model Motivating Example Function nodes are assigned: As Late As Possible (ALAP) levels based on interactions between function nodes, As Timely As Possible (ATAP) levels based on data dependency inside a function Function level Results Group Configurations into partitions by mapping configurations into a Configuration Graph Assign design rules to build the search space Generate valid partitions within the pruned search space by recursive search algorithm Scalable design method: Design rules to eliminate redundant search space Design algorithm to only explore valid design space Improvement over static designs: FPGA devices: 4 Xilinx Virtex-6 SX475T FPGAs, hosted in a Maxeler MPC-C500 computing node, running at 100MHz 1.94, 2.19 and 1.61 times faster respectively for barrier option pricing, particle filter, reverse time migration Static design: 34% to 59% of theoretical performance Reconfiguration removing idle functions: 98% of theoretical performance Inefficiency for dynamic design: due to reconfiguration overhead Improvement over optimised CPU and GPU designs: CPU: 24 Intel Xeon X5660 cores running at 2.67 GHz GPU: an NVIDIA Tesla C2070 card running at 1.15 GHz, linearly scaled by 4 for comparison with multi-FPGA designs Up to 27 times faster than CPU, 1.55 times faster than GPU Up to 29 times more energy efficient than CPU and GPU designs Proposed approach applicable to multi-chip environment Scalability: limited by reconfiguration overhead, can overcome by parallel reconfiguration of multiple FPGAs Acknowledgement: This work was supported in part by UK EPSRC, by the European Union Seventh Framework Programme under Grant agreement number 257906, 287804 and 318521, by the HiPEAC NoE, by Maxeler University Programme, and by Xilinx. Exploring original search space Algorithm level Design issues are addressed at multiple levels: Function level: extract function information, assign data dependency levels Configuration level: generate configuration based on designs rules, optimise configurations Partition level: generate partitions recursively, select the optimal run-time solutions Static design: accommodate all functions to accomplish the application Dynamic design: reconfigure design dynamically, implement only active functions Benchmarks: Barrier Option Pricing (BOP) Particle Filter (PF) Reverse-Time Migration (RTM) Exploring pruned search space 4n 3n 2n n 7n/3 8n/3 2n n