130 likes | 137 Views
Architectural Improvement for Field Programmable Counter Array: Enabling Efficient Synthesis of Fast Compressor Trees on FPGA. Alessandro Cevrero 1,2. Panagiotis Athanasopoulos 1,2. Hadi Parandeh-Afshar 2. Philip Brisk 2. Frank K. Gurkaynak 1. Ajay K. Verma 2. Yusuf Leblebici 1.
E N D
Architectural Improvement for Field Programmable Counter Array: Enabling Efficient Synthesis of Fast Compressor Trees on FPGA Alessandro Cevrero1,2 Panagiotis Athanasopoulos1,2 Hadi Parandeh-Afshar2 Philip Brisk2 Frank K. Gurkaynak1 Ajay K. Verma2 Yusuf Leblebici1 Paolo Ienne2 1 2 16th ACM/SIDA International Symposium on FPGAs Monterey, California, USA, February 26, 2008
Motivation and Contribution [Brisk et al., DAC 2007] Programmable IP core to accelerate compressor trees Hybrid FPGA/FPCA device Contributions: Field Programmable Counter Array (FPCA): Goal: Improve FPGA performance for arithmetic circuits. • Completely new FPCA architecture • Reduced routing delay • More flexibility and better mapping • Simplified integration process 1/11
FPGA Commentary Logic cells with dedicated addition circuitry and fast carry chains Support for ternary addition [Altera Stratix II/III, Xilinx Virtex-5] Parallel accumulation uses adder trees ASIC designers use compressor trees! Compressor tree synthesis on FPGAs via GPC mapping [Parandeh-Afshar et al., ASPDAC 2008, DATE 2008] Faster than ternary adder trees IP Cores DSP48, BlockRAM, etc. [Xilinx, Altera] FP cores [Beauchamp et al., TVLSI 2008] Mismatches in bitwidth limit gains [Kuon and Rose, FPGA 2006, TCAD 2007] 2/11
∑ + Methodology and Solution • Transform circuit to merge disparate addition and multiplication operations to expose compressor trees • [Verma and Ienne, ICCAD 2004] • Synthesize compressor tree onto FPCA • [Brisk et al., DAC 2007] • Map everything else onto traditional FPGA • Standard approach • Integrate FPGA+FPCA onto same die • Ongoing research at EPFL FPCA : programmable compressor tree 3/11
Previous Work Initial FPCA architecture [Brisk et al., DAC 2007] • Routing network delay • Performance bottleneck • Poor area utilization • Many resources unused • Large counters implement the functionality of smaller counters • “Pitch matching” problem • FPCA routing channels must align with FPGA routing channels • Leads to unnecessarily large counters 4/11
15 15:4 4 4:3 3 3:2 2 CPA Recurring Patterns in Compressor Tree Synthesis New FPCA architecture: • Counter Slice (CSlice) • Compress one column at a time • Propagate carry bits to neighboring CSlices • Eliminates FPGA-style routing network • No routing delay between counters • Pitch matching problem disappears 5/11
FPCA v2.0 Area Utilization CSlice CSlice CSlice CSlice 15:4 15:4 15:4 15:4 4:3 4:3 4:3 4:3 3:2 3:2 3:2 3:2 CPA CPA CPA CPA Si+3 Si+2 Si+1 Si CSlice Architecture Configurable GPC 6/11
FPCA V2.0 Mapping Heuristic FPCA FPCA FPCA FPCA FPCA … • FPCA synthesis heuristic: • Map columns of input bits onto FPCA • Minimize the height of the compressor tree • Avoid vertical configurations, when possible Multi-FPCA Configurations Routing Delay Vertical Horizontal 7/11
CSlice Synthesis FPCA Synthesis: 90nm Artisan standard cell library • Rank-3 CSlices used in experiments • 8 CSlices per FPCA • Similar to dimensions of a DSP block in current FPGAs • Simplifies integration process • DFFs store configuration bitstream • Semi-custom design • Standard cells are predominant CSlice V2.0 rank-3 with 16 input bits per CSlice 8/11
FPCA Delay Extraction SUM F* FPCA FPCA FPCA SUM F* SUM F* Input Pins Methodology: • Methodology: • Each FPCA instance is replaced with F* instance (same I/0) • Extract Delay Between F* instances • Combined these Delay with Combinational Delay extracted for the FPCA • Define a pre-placed soft IP core : F* • Same dimensions and I/O as FPCA • Map onto Stratix II FPGA • Extract critical path delay • Replace all sum operations with F* • Map compressor tree onto FPCA • Configuration DFF values set to constant values ; not optimized • Measure critical path delay • For each compressor tree in the circuit • Subtract delay of F* • Add FPCA delay Output Pins 9/11
Experimental Results Comparison • GPC Mapping [Parandeh-Afshar et al., ASP-DAC 2008] • FPCA mapping (6 FPCAs per device) 2.40x 1.60x 10/11
Conclusion Conclusion • New FPCA architecture • Hardwired connections between counters • Counters of multiple sizes organized into CSlices • Carry chains between CSlices • Avg./Max. speedups of 1.60x/2.40x compared to GPC mapping Future Work • Add pipeline registers to FPCA • Increase latency, increase clock frequency, throughput • Demonstrator chip taped out in October 2007 • Returned from the foundry in January 2008; PCBs ready next week • Measure power consumption, clock frequency, I/O interface, etc. 11/11