210 likes | 406 Views
ARC: A Performance Analysis Environment for Adaptive Computing Systems. Ranga Vemuri Jeff Walrath Digital Design Environments Laboratory ECECS Department, ML. 30 University of Cincinnati Cincinnati, Ohio 45221-0030 Phone: (513)-556-4784 Fax: (513)-556-7326 Email: ranga.vemuri@uc.edu
E N D
ARC: A Performance Analysis Environment for Adaptive Computing Systems Ranga Vemuri Jeff Walrath Digital Design Environments Laboratory ECECS Department, ML. 30 University of Cincinnati Cincinnati, Ohio 45221-0030 Phone: (513)-556-4784 Fax: (513)-556-7326 Email: ranga.vemuri@uc.edu Web: http://www.ececs.uc.edu/~ddel
What need does ARC address? ACS Compilers ACS Applications ACS Design Tools ACS Performance Modeling & Analysis ACS Operating Systems ACS Developers
ACS Performance Analysis Field Programmable Device Configure Reconfigure Reconfigure Power Time
ACS Abstractions Adaptive Software Layers Adaptive Systems (H/w and S/w) Reconfigurable Computer Reconfigurable Devices
Power Time ACS Applications Config 1 Forward Discrete Cosine Transform Config 2 Config 0 Adaptive System for Imaging Applications Quantization No More Blocks More Blocks Config 3 Config 5 Config 4 Zig-Zag Transform Huffman Encoding Run Length Encoding
ARC System ACS Element Performance Models (PDL+) ACS Structure (PNF, GUI, VHDL, EDIF..) ARC ACS APPLICATION (C++) API RESULTS DATABASE VISUALIZATION TOOL ( Gnuplot/Khoros)
ACS Element Performance Models Module a Module d Module b Module c Mode c1 Ports port a b W: carrier X: carrier Carriers Module c Mode c2 d c Y: carrier
ACS System Performance Model e c d a b
Summary of ARC • Performance Description Language, PDL+ • ACS Architecture Specifications (GUI, PNF, VHDL..) • ARC Software (Compiler, Composer, Evaluator and • Scheduler) • API for Application Interaction • Visualization Interface.
Illustrative Example • Simple LUT-style FPGA architecture • RC with multiple FPGAs and fixed interconnect. • ACS with host processor and RC coprocessor. • Hardware/Software Tasks executing on the ACS. Typical Performance Related Questions: 1. How does a proposed hardware/software binding of tasks perform? 2. Which member of an FPGA family should be selected to meet desired throughput?
Simple FPGA Architecture CLB CLB FPGA Inputs FPGA Outputs CLB CLB
0 1 1 0 LUT-Based FPGA Cell LUT Bit Mode : <<[0, 1, 1, 0], 1>> Flip Flop Inputs Output 0 LUT Bits 1 k MUX Look-Up Table Function Generator • Delay Through The CLB
FPGA Performance Model module fpga : fpga_mode := [] modules clbs{}: clb; ports inputs{}: ioport; outputs{}: ioport; rules clbs{x}'mode = mode[x'id]; clbs{}'trigger = trigger; attributes primitive id: int; time: real; qdynamic clock_period: real := 0.0; rules inputs{}'time = 0; time = max_real(foreach x:clb in clbs {x'time}); clock_period = max_real([time, curr clock_period]); endmodule; port ioport attributes time: real; endport; module lut_bit : boolean := 0 attributes primitive id: int; endmodule; type mode_list[]: boolean; clb_mode: record fg_mode : mode_list; ff_mode : boolean; endrecord; fpga_mode[]: clb_mode; endtype; module function_generator : mode_list:=[] modules lut{}: lut_bit; ports inputs{}: ioport; output : ioport; attributes primitive delay_per_lut_bit: real; delay: real; rules lut{x}'mode = mode[x'id]; lut{x}'trigger = trigger; delay = #lut * delay_per_lut_bit; output'time = max_real( foreach x:ioport in inputs {x'time}) + delay; endmodule; module flip_flop : boolean := 0 ports input, output: ioport; attributes primitive delay: real; time: real; rules output'time = 0.0; time = input'time + delay; endmodule; module multiplexer ports input1, input2, output: ioport; attributes primitive delay: real; select: boolean; rules output'time = if select then input1'time + delay else input2'time + delay endif; endmodule; module clb : clb_mode := <[],0>> modules fg : function_generator; ff : flip_flop; mux: multiplexer; ports inputs{}: ioport; output : ioport; attributes primitive id: int; time: real; rules fg'mode = mode.fg_mode; ff'mode = mode.ff_mode; ff'trigger = trigger; fg'trigger = trigger; mux'select = curr ff'mode; time = ff'time; endmodule;
RC Coprocessor MEMORY MEMORY FPGA FPGA • Maximum clock • speed for a given • board configuration. • Critical path delay • (clock speed) for a • given configuration. INTERCONNECT FPGA FPGA MEMORY MEMORY
ACS Architecture Processor Memory Reconfigurable Co-Processor
Software Task • Hardware Task • Channel Task1 Task2 SW HW • Instruction Sequence Channel Task3 Task4 SW HW • Configuration Data • Time Steps Task5 Task6 HW SW • Bandwidth Task7 SW Codesign Application
ARC Demonstrations • Small Scale Examples: • LUT-based FPGAs • Context-switching FPGAs • Programmable interconnects. • Simple processor and memory models. • ACS Software performance models. • Large Scale Demonstrations: • Xilinx 4000 series FPGA performance model. • AMS WildForce RC models for use in the SPARCS • partitioning and synthesis system.
SPARCS Synthesis and Partitioning System for RCs Behavioral-Level Specification RT-Level Specification Gate-Level Specification High-Level Synthesis (UC) Logic Synthesis (Synopsys) Layout Synthesis (UC/Xilinx) Bitstreams Partitioning System Temporal Partitioning Spatial Partitioning Light-Weight Behavioral/Logic/Layout Synthesis Algorithms Architecture Specification ARC - ACS Performance Analysis AMS WildForce Board
Further Information…. Visit http://www.ececs.uc.edu/~ddel/arc.html Would like software? Have ideas for demonstrations? Please Contact: ranga.vemuri@uc.edu 513-556-4784