1 / 44

FPGA Tools Training Class Hardware Optimization

FPGA Tools Training Class Hardware Optimization. In This Chapter, You Will Learn. Design techniques to optimize performance Logic Techniques Special Xilinx Hardware Features Topics apply to both synthesis and schematic users. Outline. CLB Combinatorial Logic CLB Register Resources

Download Presentation

FPGA Tools Training Class Hardware Optimization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.


Presentation Transcript

  1. FPGA Tools Training ClassHardware Optimization

  2. In This Chapter, You Will Learn • Design techniques to optimize performance • Logic Techniques • Special Xilinx Hardware Features • Topics apply to both synthesis and schematic users

  3. Outline • CLB Combinatorial Logic • CLB Register Resources • Memory Usage • Input/Output Block Usage • Tips and Guidelines • Summary

  4. FFX Combinatorial Resource Review • How is a 9 - input AND gate implemented in a CLB? • Three stages shown below explain the mapping process o CLB O FFY

  5. S<1:0> D2_4E D<3> D<0> D<1> D<2> BUFT BUFT BUFT BUFT O Three state MUX • Wide MUXes implemented in LUTs have many levels of logic • BUFT Multiplex function uses SRAMs to decode select signals and internal tri-state buffers • Fewer CLBs are used and routing congestion is decreased • BUFT delay varies with size of FPGA • Small 4-to-1 MUX is shown below • Example: BUFT implementation

  6. BUFT Multiplexers • BUFT can be used to build large MUXes • Wide MUXes composed of LUTs need multiple levels of logic • Wide MUXes composed of BUFTs use SRAMs to decode select signals and internal tri-state buffers MUX should be built across one row of CLBs • Standard library Multiplexer macros use Look-Up Tables • Example: 4 to 1 MUX with enable, M4_1E, is built with CLBs • LogiCORE MUXes with Style = WAND use BUFTs • Xilinx Unified library BUFT components • BUFT, BUFT4, BUFT8, BUFT16 • Synthesis tools: a BUFT MUX will be generated in all synthesizers whenever an IF-THEN type statement drives a high-Z. Otherwise CLB MUXes are generated.

  7. Z<3> A<3> B<3> Z<2> A<2> B<2> ADD4 Z<1> A<1> B<1> Z<0> A<0> B<0> Carry Logic • Each CLB contains dedicated arithmetic logic for fast carry and borrow signals • Carry logic is associated with F and G function generators • Carry logic components have a vertical orientation • Needed for speed and utilization • Known as RPM or “Relationally Placed Macro” • Examples: • ADDx adders • ADSUx adder/subtractors • CCxcounters • COMPMCx magnitude comparators

  8. Counters • Libraries support a wide variety of fast and efficient counters • Counters offer trade-offs between speed, utilization, and complexity • Example: LogiBlox counter styles • Binary: slow and large • Johnson: fastest practical counter, uses few Flip-Flops • LFSR: fast & dense, but pseudo-random outputs • One-Hot: useful for generating series of enables • Carry Chain:High speed and utilization • Synthesis tools select a component based on the design, or the designer can instantiate a component using LogiBLOX.

  9. Outline • CLB Combinatorial Logic • CLB Register Resources • Memory Usage • Input/Output Block Usage • Tips and Guidelines • Summary

  10. Global Clock Buffers • Global Buffers are low-skew, high-drive buffers • Drive low-skew, high-speed long line resources • Drive all Flip-Flops and Latches in FPGA • Can also be used for high-fanout non-clock signals • Check device for number of clocks • To use the global buffer, instantiate the BUFG component • For synthesis: Clocks are identified by different means depending on Vendor • Example: Synopsys FPGA compiler connects clock buffers to all fan-in of clock pins • Control clock buffer insertion with separate commands • Consult Synthesis interface guide or vendor

  11. Each register can be configured as a Flip-Flop or Latch Independent clock polarity Asynchronous Preset or Clear Synchronous Set or Reset Clock Enable Direct input from CLB input (Connections bypass LUTs) S/R S/R Control DIN F SET G D QX Q H K (CLOCK) EC RESET 1 S/R Control F SET D G QY Q H EC (CLOCK ENABLE) EC RESET 1 CLB Registers

  12. Set and Reset Capabilities • CLB Flip-Flop features include Asynchronous Preset/ Clear or Synchronous Set/Reset • Synchronous Set/Reset is implemented in LUT • Asynchronous Clear/Preset has two sources • Dedicated Global Set/Reset (GSR) net • Local Asynchronous Preset/Clear D D D LUT Q Synch. Set/Reset Q Reset CLK Local Async. Preset/Clear FDC GSR CLB

  13. GSR Q2 GTS Q3 STARTUP Q1 Q4 CLK DoneIn Global Reset (1) • All Flip-Flops are always initialized during power up • Via the Global Set/Reset network • You can access this network by instantiating the STARTUP primitive • GSR is automatically connected to all CLB Flip-Flops using dedicated routing resources - in general you don’t need to connect Startup to Flip-Flops • GSR, GTS, and Clock can be driven by internal signals or pins • Assert GSR for global set or reset, GTS controls Tri-state buffer in IOBs • Can be driven by internal signals or pins • Saves general use routing resources for the design

  14. GSR Startup Or GSR Good for simulation; extra connections will be trimmed by Design Manager Startup GSR Startup To Flip- Flops To Flip- Flops Bad example: general purpose routing is used Improved example: general purpose routing is not used Global Reset (2) • Use Global Reset whenever possible • Local asynchronous reset is routed on general purpose interconnects • Global Set/Reset is routed on dedicated interconnects • Any signal or pin can drive the global set/reset pin • To use global reset network, Register Reset and Startup RST pin must be driven by the same signal. Examples:

  15. SET D QX Q EC RESET Flip-Flop Clock Enable (1) • Register output does not change when clock enable is disabled • Allows synchronous design • Use instead of gating the clock signal • Clock enable is implemented in two ways: • Directly inside the flip-flop via dedicated CE pin • In a Look-Up Table

  16. FDxE D Q CE D Q CE Clock Enable Example • Use Clock Enables Instead of Gating Clock • Use clock enable when using most of or all logic inputs • Avoid gating of clock signal directly • Use MUXed data when using only 1-2 logic inputs or for a gated clock enable • Or, when two different clock enables must drive Flip-Flops in one CLB

  17. CLK2 FF1 FF2 OUT1 X CLK1 FF1 FF2 OUT1 X CLK1 CE CLK2 Minimize the Number of Clocks • Use Clock enable to reduce the number of clocks. • Example with two clocks: • Consider using clock enable instead of a clock • Useful when: • CLK2 is much slower than CLK1 • Or, CLK1 and CLK2 have a definite phase relationship

  18. Outline • CLB Combinatorial Logic • CLB Register Resources • Memory Usage • Input/Output Block Usage • Tips and Guidelines • Summary

  19. CLB CLB D1 32 bits 2 bits D1 D Q Q1 A0 O1 A1 A2 D2 D Q Q2 A3 A4 CLK WE RAM Provides 16X the Storage of Flip-Flops • 32 bits versus 2 bits of storage • Two 16x1 RAMS or One 32X1 Single-Port Ram fit in one CLB • One 16x1 Dual-Port RAM fits in one CLB • 32x8 shift register with RAM = 11 CLBs • Using Flip-Flops, takes 128 CLBs for data alone

  20. General RAM Guidelines • Less than 32 words gives fastest performance • 32x1 or 16x2 RAM fits in one CLB • Delays are short (one level of logic) • Data and output MUXes are required to expand depth • Less than 256 words recommended per RAM • Exceptions include T1 Framers, which use RAMS as a shift register • Width easily expanded • Connect the address lines to multiple blocks • Recommendation: Use less than 1/2 of max memory resources • Maximum memory uses all logic resources of CLBs

  21. WE D[7:0] O[7:0] WCLK A0 A1 RAM16X8S A2 A3 Memory Use • Most synthesis tools can synthesize ROM from behavioral HDL code • RAM memories may be synthesized • Synplicity can synthesize RAMs • Use library primitives and macros for standard size memory • RAM/ROM16X1S to 32X8S • Use S suffix for Synchronous RAM • Use D suffix for Dual-Port RAM • Use LogiBLOX to generate custom size memories

  22. Outline • CLB Combinatorial Logic • CLB Register Resources • Memory Usage • Input/Output Block Usage • Tips and Guidelines • Summary

  23. IOB Block Diagram • Three-state output • Registered Input or Output • Bi-directional I/O • Output Slew Rate control • Programmable setup/hold delay SLEW RATE CONTROL PULL-UP PULL-DOWN OUT FF PAD IN DELAY FF or LATCH FAST LATCH

  24. IOB Flip-Flops and Latches • Synthesis tools and Design Manager can move internal registers into IOBs to meet timing constraints • Flip-Flops and Latches can be used in unbonded IOBs • Use IOB Flip-Flops: • When all CLB Flip-Flops are used • To minimize the Flip-Flop-to-PAD delay • Minimize skew between outputs • IO Blocks contain minimal combinatorial logic • IOB Flip-Flops can be used as part of an internal shift register • Do not use IOB Flip-Flops as part of a pipeline • Library components begin with I - Examples: ILD, IFD16 • Outputs components begin with O - Examples: OFD, OFDT16

  25. OBUFE T OE IN T OUT X 1 Z IN 0 IN Output Three-State Control • Instantiation: Use OBUFE and OBUFT components • OBUFT output is in the high impedence state when OE is low • Synthesis: If-Then statements driving a Hi-Z value onto an output may be synthesized into an OBUFE or OBUFT • Three-state control also via a dedicated global net • Needed for configuration • Also controlled by GST on STARTUP primitive

  26. OAND2 F OPAD IO FAST Output Combinatorial Logic • Small functions can be built into the IOB • Can be used as a generic two-input function generator or MUX • One input can be driven by IOB output clock signal • Requires library components beginning with “O”. • Examples: OAND, OMUX • F input pin is faster than IO pin • Does not apply to all FPGAs

  27. Guidelines for IOB use • Unused IOBs: • Outputs of unused IOBs are automatically disabled • Pull-ups are automatically connected on unused IOBs • Used IOBs: • A PULLUP or PULLDOWN primitive can be connected to used IOBs • Inputs should not be left floating • Add a pull-up to design inputs that may be left floating to reduce power and noise • Output drive • 12 mA Sink current per output on most families • Two adjacent outputs can be tied together to double the drive off chip

  28. Outline • CLB Combinatorial Logic • CLB Register Resources • Memory Usage • Input/Output Block Usage • Tips and Guidelines • Summary

  29. CLB CLB CLB CLB tCO tNET tPD tNET tPD tNET tSU Pipeline for Speed • Use synchronous design • Pipelining improves speed • Consider wherever latency is not an issue • Use for terminal counts, carry lookahead, etc. • How to estimate the number of logic levels per stage Example for 100 MHz clock frequency in XC4013XL-09: Clock period 10 ns One level - 4.1 ns (tCO + tNET + tSU ~= .9 + 1.2 + 2 ns) Delay allowance 7.9 ns Each added level / 3.2 ns (tPD + tNET ~= 1.2 + 2 ) Additional levels of logic allowed 2 CLBs • Why isn’t the SRAM in the CLB included in the delay calculation?

  30. a * b + out c a * b + out c Pipeline Example • Break up combinatorial logic into separate stages • Clock frequency increases • Latency also increases - extra cycle(s) are added • Example: Frequency can double by adding another stage, but an extra cycle is added

  31. Keep Related Logic Together (1) No Hierarchy in Combinational Path • The path from Reg A to Reg C is divided between three • different block descriptions A B C Reg C C B Reg A A • Example - Optimization is limited because • hierarchical boundaries prevent sharing of common • terms

  32. Keep Related Logic Together (2) Good Example A C Reg C Reg A B & C A • Related combinational logic drive registers in the same block • No hierarchical boundaries between combinational logic and • registers • Allows for improved sequential mapping

  33. a0 clk + sum a1 clk a0 + a1 sum clk Register All Block Outputs • Align block boundaries on Register outputs • Helps floorplanning • Poor partitioning • Sum is not registered, and • may becomea critical path. • Good partitioning • Why is performance improved • when combinatorial logic drives • a register in the same CLB?

  34. en clk [23:0]out en . . . clk en . . . clk . . . [23:0]out Duplicate Registers to Reduce Fanout • Why does fanout reduction improve performance? Register has 24 loads Each Register has 12 loads

  35. Large Dense Counter with Slower Carry CE Fast Small Counter TC Counter Tips (1) • Do not use binary sequence if unnecessary • Consider higher performance or smaller counter types • Examples: LFSR, Pre-scaled, Gray • Use Pre-Scaling on non-loadable counters to increase speed • LSBs toggle quickly • See Application Notes XAPP001 and XAPP014

  36. 10-bit SR Q0 Q6 Q9 Counter Tips (2) • Use Gray code counters if decoding outputs • Glitch free, because one-bit changes per transition • Consider Linear Feedback Shift Register for speed when terminal count is all that is needed • Or when any regular sequence is acceptable (e.g., FIFO)

  37. Qx I1 Qn D Q In FF Qy I1 D Q Qn + 1 In FF Qz I1 D Q Qn + 2 In FF State Machine Design Tips(1) • Use One-Hot Encoding for small state machines • Shift-register like structure • One Flip-Flop is assigned to each state • Works well in Xilinx “register-rich” FPGAs • Number of required Flip-Flops may be higher than other state machines, but logic to generate state is less complex • RAMs can be used to encode large state machine Prototype OHE State Machine: Qx, Qy, and Qz are composed of state variables from previous states

  38. State A State Machine Design Tips(2) • Split complex states • Need to minimize number of inputs, not number of Flip-Flops, in FPGAs • Use One-Hot encoding for medium to large state machines (greater than 12 states) • Complex states may be improved by breaking up into additional simpler states State A1 State A2 cond1 cond1 cond1 State B State B

  39. State A State B State A State C State Machine Design Tips(3) • Consider a pipeline: break the state machine into two or more clock cycles • Two clock cycles for a state is better than having to slow the clock for the entire state machine • This basically means to breakup wide input equations using intermediate nodes in the state diagram. State C

  40. Outline • CLB Combinatorial Logic • CLB Register Resources • Memory Usage • Input/Output Block Usage • Tips and Guidelines • Summary

  41. Summary • Use Tri-state buffers for multiplexing • Carry Logic is not the only way to create fast arithmetic functions • Use the GSR net to save routing resources and use global routing resources • Use Clock Enable port on registers to design synchronously and save logic • Best memories are <=32 words • Use LogiBLOX to customize memories • Use IOB registers for modules that do not require logic, such as shift registers • Refer to LogiBLOX or Design Manager Help for more information on LogiBLOX

  42. Binary Counter D Q Q0 TC Q1 CK Q2 Questions (1) • What problem may occur in this circuit? • How can the circuit be improved?

  43. Questions (2) • What does GSR stand for? • What component sources the GSR net? • When should the GSR net be used? • What component is instantiated to use the Global Clock? • Can the Global Clock be synthesized?

  44. Questions (3) • How many global clocks can be used in an XC4085XL-3? • See the data sheet for the XC4000XL family, available on WEB or the AppLINX CD. • Why is one hot encoding a good way to encode a small state machine? • When should IOB registers be used? When should they be avoided?

More Related