770 likes | 788 Views
Design Methodology for High-Density FPGA Design. Selecting an Architecture High-Density Software Methodology Implementation and Integration of Cores. V400 FPGA. Transmitter. Channel Interface. PCI. CPU and Software. Channel Manager. Spectral Analysis. A/D. System Level FPGA.
E N D
Design Methodology for High-Density FPGA Design • Selecting an Architecture • High-Density Software Methodology • Implementation and Integration of Cores
V400 FPGA Transmitter Channel Interface PCI CPU and Software Channel Manager Spectral Analysis A/D System Level FPGA Spread Spectrum Frequency Channel Allocation Design
Challenges of High-Density FPGA Design • How to Implement? • What Architecture? • SoftwareAccess to Architectural Features? • Verification Strategy? • Use IP Cores? Virtex V400 FPGA Transmitter Channel Interface PCI CPU and Software Channel Manager Spectral Analysis A/D
Agenda • Selecting an architecture • system level FPGA • Smart-IP technology • High-density FPGA software methodology • design flow • accessing the architecture specific features • design verification • Implementation and integration of cores • CORE Generator • LogiCORE • AllianceCORE • design series • Software demo • Roadmap
Transmitter Channel Interface PCI CPU and Software Channel Manager Spectral Analysis A/D System-Level FPGA • Integrates with software tools? • High performance I/O standards? • Million system gates? • Performance? • 100 MHz • Memory? • SRAM, FIFO • IP friendly? • 133 MHz SDRAM 1 Gbit Ethernet66 MHz PCI
Xilinx Smart-IP Technology • Xilinx Smart-IP Technology • architectures tailored to cores • intelligent software implementation • flexible core technology • Delivers: • high predictability • high performance • high flexibility Only available from Xilinx
Xilinx Smart-IP Technology Architecture Tailored to Accept Cores Xilinx Segmented Routing Non-Segmented Routing Core1 Core2 • Advantages • Efficient Routing • Predictable Timing • Low Power
Xilinx Smart-IP Technology Architecture Tailored to Accept Cores Distributed Memory Local RAM available to the Core • Advantages • Portable RAM-based cores • 16x improved logic efficiency • High-performance cores
Xilinx Smart-IP Technology Pre-defined Placement & Routing Fixed Placement & Pre-defined Routing Fixed Placement Relative Placement I/Os Guarantees Performance Guarantees I/O & Logic Predictability Other Logic Does Not Affect on the Core Enhances Performance & Predictability
Xilinx Smart-IP TechnologyDelivers Design Predictability Performance is independent of core placement and number of cores used in the device 80 MHZ 80 MHZ 80 MHZ 80 MHZ Avoids the performance loss of non-segmented architectures
Xilinx Smart-IP TechnologyDelivers Design Predictability Performance is independent of device size Avoids the performance loss of non-segmented architectures
Virtex V400 FPGA Transmitter Channel Interface PCI CPU and Software Channel Manager Spectral Analysis A/D Virtex Enables • Integrates with software tools? • High performance I/O standards? • Million system gates? • Performance? • 100 MHz • Memory? • SRAM, FIFO • IP friendly? • 133 MHz SDRAM 1 Gbit Ethernet66 MHz PCI System Level FPGA
Agenda • Selecting an architecture • system level FPGA • Smart-IP technology • High-density FPGA software methodology • design flow • accessing the architecture specific features • design verification • Implementation and integration of cores • CORE Generator • LogiCORE • AllianceCORE • design series • Software demo • Roadmap
The Value of Xilinx PartnershipsThe most comprehensive “Open System” solution • Early software support for new devices • New product development maximizing architectural and synthesis capabilities • efficient timing constraints integration • high performance optimization engines tuned for new Xilinx devices • direct optimization & mapping of Carry logic, complex I/O, LUTs, CE, arithmetic operator • Joint definition of next-generation Solutions
Sim.Model Constraints Design Flow Design Entry Design Verification Source Code Symbol/HDL Design Reuse AllianceCORE LogiCORE Top LevelHDL or Schematic Functional Simulation HDL Editor Synthesis User design only Timing Simulation Schematic Entry Netlist Netlist Place & Route Netlist Xilinx FPGA Static Timing Analysis Design Implementation
Software Features (ASIC-Like) • Minimum-delay reporting • hold-time analysis • finds hazards in asynchronous logic • min delay option “-s min” for TRCE and NGDANNO • Voltage and temperature pro rating • can specify a higher voltage than worst case • specify 3.3V instead of 3.0V • can specify a lower temperature than worst case • specify 55°C instead of 85°C • First SRAM based device to support temp & voltage pro rating and minimum delays XC4000XL family supported in A1.5, Virtex to follow
Minimum DelaySystem-Level Analysis Flip-Flop Hold time 1 ns • Internally, Xilinx guarantees 0ns hold times • Identify board-level hold time violations for synchronous designs Q D FPGA Inst_A System Clock System Clock SDRAM 1 ns System Clock } D With max tco (for Inst_A) = 5 ns Valid data on Q for worst case delay Q } D With min tco (for Inst_A) = 2 ns Hold Time violation for best case delay Q Data not latched
Parameter [ns] Internal Period Clock-to-Out Input Setup XLA–09 V = 3.3V T = 70°C 9.4 3.9 5.4 System Requirements 3.3V, 70°C 10.0 4.0 6.0 XLA–08 V = 3.0V T = 85°C 10.6 4.2 5.8 XLA–08 V = 3.3V T = 70°C 9.0 3.7 5.2 Lowest Cost Meets Requirements Temperature and Voltage Pro rating • Delays based on worst case process • Adjust temperature and voltage to reflect system operating conditions • Reduce system cost by targeting a slower speed grade
A 1.5 1 Million GatesIn Less Than 5 Hours Compile Times Gates Per Hour • New place & route algorithms • Abundant & flexible vector based interconnect • 4x routing resource vs XC4000XL • fully populated switch matrix • Buffering of high fanout and long distance interconnects • 8 ns across 250K system gates • Up to 40% smaller interface netlist 200k 200k Gates/ hour 150k Timing Driven Implementation 100k 50k Gates / hour 35k Gates / hour 50k 0 A 1.4 XC4000XL A 1.5 XC4000XL
Faster Compiles with Virtex“Tough” Customer Designs Virtex -4 XC4000XL-09 8 0 0 7 0 0 6 0 0 5 0 0 4 0 0 Compile Time (minutes) 3 0 0 2 0 0 1 0 0 0 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 1 4 1 5 1 6 1 7 1 8 Design Suite Virtex compiles, on average, 28 times faster
Faster Systems with Virtex“Tough” Customer Designs • Faster Virtex speeds with silicon characterized speeds files • Virtex is faster for 84% of the designs • Designs from ATM, PCI, Networking & ISDN applications Virtex -4 XC4000XL-09 200% Normalized Clock Speed 100% 0 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 1 4 1 5 1 6 1 7 1 8 19 D e s i g n S u i t e
Accessing Technology-Specific Features • By inference • technology mapping using behavioral constructs that allow code portability • operators • RAM • By instantiation • use gates in the target technology making the code technology specific • Block RAM • CLKDLL • special I/Os.
Inferring Technology-Specific Features • Fast arithmetic carry chains • Wide input muxes, “case vs. priority encoder” • RTL flexibility for register configurations • Area-efficient muxes using TBUFs • Distributed RAM inferencing • Registered I/O buffer inference • Timing-driven register IOB mapping
Fast Arithmetic Functions Using Carry Chains Virtex Logic Block Carry • 180 MHz 32-bit arithmetic/counters • Small 16-bit adders using 16 LUTs • 51 for XC4000XL • 60MHz 16x16 multipliers • 30% area reduction compared to XC4000XL • 160MHz with pipeline stages • Operator Inferencing from synthesis • Pipelined multipliers fromthe CORE Generator tool 0 1 LUT if (!reset) count = 32’b0; else count = count + 1’; 0 1 LUT 0 1 LUT 0 1 LUT Sum = a_in + b_in mult = a_in * b_in
Priority Encoder “if-then-else”When to use? • Assign highest priority to a late arriving critical signal • Nested “if-then-else” might increase area and delay • Use “case” statement if possible to describe the same function always @(sel or in) begin if (sel == 3'h0) out = in[0]; else if (sel == 3'h1) out = in[1]; else if (sel == 3'h2) out = in[2]; else if (sel == 3'h3) out = in[3]; else if (sel == 3'h4) out = in[4]; else out = in[5]; end in [4] in [3] in [2] in [1] in [0] S S S S
Benefits of “Case” Statement 8:1 Mux always @(C or D or E or F or S) begin case (S) 2’b000 : Z = C; 2’b001 : Z = D; 2’b010 : Z = E; 2’b011 : Z = F; 2’b100 : Z = G; 2’b101 : Z = H; 2’b110 : Z = I; default : Z = J; endcase C D E F Z G H I J S • Compact and delay-optimized implementation • implemented in a single CLB • Synthesis maps to MUXF5 and MUXF6 functions • 8:1 multiplexor is implemented in a single CLB
RTL Flexibility for Register Configurations Positive-Edge Triggered Flip-Flop with clock enable, sync reset and preset preset always @(posedge clk or posedge preset) begin if (preset) q = 1; else if (reset) q = 0; else if (CE) q = data; end q data ce clk reset • Register mapping for • registers with sync/async set and reset • clocks, inverted clocks, and clock enable
Area Efficient Muxes Using TBUFs A[7:0] case (E) 4’b0001 : Q[7:0] = A[7:0]; 4’b0010 : Q[7:0] = B[7:0]; 4’b0100 : Q[7:0] = C[7:0]; 4’b1000 : Q[7:0] = D[7:0]; endcase • Improve area efficiency by using tri-states • Each CLB has 2 TBUFs • Place-and-route can connect tri-states on multiple horizontal Longlines to build wide muxes Z[7:0] B[7:0] C[7:0] D[7:0] E[3:0] E0 A[7:0] E1 assign Q[7:0] = E0 ? A[7:0] : 8'bzz..z; assign Q[7:0] = E1 ? B[7:0] : 8'bzz..z; assign Q[7:0] = E2 ? C[7:0] : 8'bzz..z; assign Q[7:0] = E3 ? D[7:0] : 8'bzz..z; B[7:0] Z[7:0] E2 C[7:0] E3 D[7:0]
Distributed RAM InferencingSystem Memory Synplicity (RAM 8x4) RAM 16x1S • Synplify and Leonardo Spectrum can infer distributed RAM • FPGA Express will support RAM inferencing in the future q [3:0] AO A1 A2 A3 D WCLK WE Addr [2:0] module ramtest(q, addr, d, we, clk); output [3:0] q; input [3:0] d; input [2:0] addr; input we; input clk; reg [3:0] mem [7:0]; assign q = mem[addr]; always @(posedge clk) begin if(we) mem[addr] = d; end endmodule D [3:0] clk we RAM 16x1S Q AO A1 A2 A3 D WCLK WE
Registered I/O MappingSystem Interfaces S/R • System timing • chip-to-chip performance often limits system speeds • registered I/O improves performance • No need to instantiate IOB register cells • implementation tools will pack registers in the IOBs • map -pr b • b (both input and output) • i (input only) • o (output only) • IOB = TRUE attribute • Mapping for data and enable ports Q D CE OBUF CLK S/R D Q CE OBUF CLK S/R D Q CE IBUF CLK
Controlling the Inferenceof Output Registers • Technology mapping will not duplicate registers • Critical signal will not be absorbed in the IOB register TRI_R TRI D Q process (Tri, Clk) begin if (clk’event and clk =`1`) then Tri_R <= Tri; end if; end process; process (Tri, Data_in) begin if (Tri_R = ‘1’) then Out <= Data_in; else Out <= (others => ‘Z’); end if; end process; fanout = 24 CLK DATA [23:0] OUT [23:0]
Controlling the Inferenceof Output Registers TRI TRI_R1 • Duplicates register on critical path for fanout of 1 • Mapping will absorb register in IOB D Q fanout = 1 process (Tri_, Clk) begin if (clk’event and clk =`1`) then Tri_R1 <= Tri; Tri_R2 <= Tri; end if; end process; process (Tri_R1, Data_in) begin if (Tri_R1 = ‘1’) then Out(23) <= Data_in(23); else Out(23) <= ‘Z’); end if; end process; process (Tri_R2, Data_in) begin if (Tri_R2 = ‘1’) then Out(22:0) <= Data_in(22:0); else Out(22:0) <= (others => ‘Z’); end if; end process; CLK DATA [23] OUT [23] TRI_R2 TRI Q D fanout = 23 CLK OUT [22:0] DATA [22:0]
Instantiating Technology-Specific Features • Block RAM • system memory • CLKDLL • minimizes clock skew • Special I/Os • interfacing with standard buses • LUTs for datapath pipelining • add latency with minimal area impact
Block RAM System Memory component RAMb4_S1 port(WE,EN,RST,CLK: in STD_LOGIC; ADDR: in STD_LOGIC_VECTOR(11 downto 0); DO: out STD_LOGIC; DI: in STD_LOGIC_VECTOR(0 downto 0)); end component; begin U1: RAMB4_S1 port map(WE=>WE, EN=>EN, RST=>RST, CLK=>CLK, DI=>DI, ADDR=>ADDR, DO=>DO); • Instantiate single- and dual-port RAM • Use the CORE Generator to build RAM and FIFO (Q1 ‘99) RAMB4_S1 addr ADDR we WE en EN rst RST CLK clk RAMB4_S1 U1 (.WE(WE), .EN(EN), .RST(RST), .CLK(CLK), .ADDR(ADDR), .DI(DI), .DO(DO)); do DO di D
CLKDLL Minimize Clock-to-Out System Timing • One use of a CLKDLL is to minimize clock to outpad delay • removes all delay from external GCLKPAD pin to the registers and RAM • BUFGDLL is available for instantiation • Other configurations can be built by instantiating the CLKDLL macro BUFG CLKDLL IBUFG clk_fb CLK0 CLK90 CLK180 CLK270 CLK2X CLKDV LOCKED CLKIN CLKFB RST clkin rst U4 Verilog wire clk_fb; BUFGDLL U4 (.I(clkin), .O(clk_fb));
Special I/O BuffersSystem Interfaces • Default I/O buffer is LVTTL (12mA), available via inference • process technology leads to mixed voltage systems • high-performance, low-power signal standards emerging • Instantiate I/O buffers for non default current drive • non-default voltage standard • non-default slew • Advanced Graphics Port bus interface (Pentium II graphics app) • Fast slew rate and 24 mA drive strength oport awire OBUF_AGP U0 (.I(awire), .O(oport)); U0 oport awire OBUF_F_24 U1 (.I(awire), .O(oport)); U1
LUTs for Datapath Pipelining • LUT can be used in place of registers to balance pipeline stages • area efficient implementation • SRL16E can delay an input value up to 16 clock cycles • Synchronized operands before the next operation SRL16E 7 D CE CLK A3 A2 A1 A0 Q A[31:0] B[31:0] 32 LUTs replace 256 registers G Z H F C[31:0] 5 cycles 1 cycle 8 cycles SRL16E 12 D CE CLK A3 A2 A1 A0 Q 32 LUTs replace 416 registers
Design Verification • Trends • Stages • Xilinx solutions
What’s Driving the Verification Trends? Cost of Design Error Over Time 10,000X 1000X 100X 10X 1X $$$ Functional Simulation Synthesis PAR System Test End Product Design Cycle Stages Functional simulation should eliminate 95% of the bugs
Stages to Verify the Design Gate-level Functional Simulation • Create testbench • Verifies syntax & functionality • Majority of design cycle time • Errors found are inexpensive to fix VHDL or Verilog 4 testbench Synthesis 4 Gate-level Functional Simulation • Checks the synthesis implementation to gates • Test initialization states • Analyze ‘don’t care’ conditions Implementation 4 Gate-level Timing Simulation • Post implementation timing simulation • Test race conditions • Test set-up and holds violations based on operating conditions
What Does Xilinx Provide? UNISIMLibrary 4Simulation • Libraries and interfaces for simulation throughout the design flow • functional simulation with UNISIM • timing simulation with SIMPRIM • Mixed-mode simulation • schematic and HDL • Minimum-delay analysis • Voltage and temperature prorating • Unique VHDL simulation of global set/reset capabilities VHDL or Verilog 4 Synthesis 4 SIMPRIMLibrary SIMPRIMLibrary Implementation 4
Benefits of the Xilinx FPGASoftware Development Methodology • ASIC-like design flow and features • open development system • minimum delays and temp pro rating • robust Verification Flow • Improve designer productivity • faster compile times, better performance • Utilizing device resources • technology independence since most technology features are accessible via inference • use techniques to reduce area and increase performance
Agenda • Selecting an architecture • system level FPGA • Smart-IP technology • High-density FPGA software methodology • design flow • accessing the architecture specific features • design verification • Implementation and integration of cores • CORE Generator • LogiCORE • AllianceCORE • design series • Software demo • Roadmap
Implementation and Integration of Cores • PCI • PCMCIA • HDLC • Reed-Solomon • MPEG • T1 Framer • DRAM Controller • DMA • Viterbi Decoder • FIR Filter A B C IP IP IP
High-Density FPGA Design Implementation • Xilinx CORE Generator • reduces time to market • delivers parameterizable cores • optimized usingSmartIP technology • LogiCORE products • licensed and supported by Xilinx • highly optimized for Xilinx FPGAs results in best possible performance, area and predictability • AllianceCORE products • licensed and supported by Xilinx’ partners • 25 partners provides industry’s widest selectionof cores and design expertise • Design services • 3rd party and Xilinx design centers • local expertise and services
Virtex V400 FPGA Transmitter Channel Interface PCI CPU and Software Channel Manager Spectral Analysis A/D Xilinx CORE Generator IP Delivery System
Benefits of Using Xilinx Cores 12 Months 2 Months 9 Months Design From Scratch Learn Design Implement Verify Reference Design, Generic Core L D I V Complete FPGA Core Solution L D I V Pre-verified Designs Area & Timing Optimized Complete & Flexible Design Little Knowledge of Function Required
Benefits of Using Xilinx Cores “75% of all new designs will have Cores in them” - Designer feedback from IP usage survey “The high performance of the Xilinx PCI LogiCORE solution combined with the short time to market and flexibility of a programmable FPGA solution, made Xilinx the obvious choice." - Tony Clark, R&D Mgr. - Management Graphics, Inc “By using ‘Design Reuse’ as part of our design consulting services, on average we are able to save our customers 18-24 weeks” - Tim Smith of Memec Design Services
CORE Generator Delivery SystemXilinx Smart-IP Technology Data sheets Parameterized Cores CoreLINX: Web Mechanism to Download New Cores SystemLINX: Third-Party System Tools Directly Linked With Core Generator Free Software & Free Cores Included As Part of The Alliance and Foundation Software Packages