440 likes | 466 Views
Learn design techniques to optimize performance with logic and Xilinx hardware features. Topics apply to both synthesis and schematic users. Covers CLB logic, memory usage, I/O block usage, and tips.
E N D
In This Chapter, You Will Learn • Design techniques to optimize performance • Logic Techniques • Special Xilinx Hardware Features • Topics apply to both synthesis and schematic users
Outline • CLB Combinatorial Logic • CLB Register Resources • Memory Usage • Input/Output Block Usage • Tips and Guidelines • Summary
FFX Combinatorial Resource Review • How is a 9 - input AND gate implemented in a CLB? • Three stages shown below explain the mapping process o CLB O FFY
S<1:0> D2_4E D<3> D<0> D<1> D<2> BUFT BUFT BUFT BUFT O Three state MUX • Wide MUXes implemented in LUTs have many levels of logic • BUFT Multiplex function uses SRAMs to decode select signals and internal tri-state buffers • Fewer CLBs are used and routing congestion is decreased • BUFT delay varies with size of FPGA • Small 4-to-1 MUX is shown below • Example: BUFT implementation
BUFT Multiplexers • BUFT can be used to build large MUXes • Wide MUXes composed of LUTs need multiple levels of logic • Wide MUXes composed of BUFTs use SRAMs to decode select signals and internal tri-state buffers MUX should be built across one row of CLBs • Standard library Multiplexer macros use Look-Up Tables • Example: 4 to 1 MUX with enable, M4_1E, is built with CLBs • LogiCORE MUXes with Style = WAND use BUFTs • Xilinx Unified library BUFT components • BUFT, BUFT4, BUFT8, BUFT16 • Synthesis tools: a BUFT MUX will be generated in all synthesizers whenever an IF-THEN type statement drives a high-Z. Otherwise CLB MUXes are generated.
Z<3> A<3> B<3> Z<2> A<2> B<2> ADD4 Z<1> A<1> B<1> Z<0> A<0> B<0> Carry Logic • Each CLB contains dedicated arithmetic logic for fast carry and borrow signals • Carry logic is associated with F and G function generators • Carry logic components have a vertical orientation • Needed for speed and utilization • Known as RPM or “Relationally Placed Macro” • Examples: • ADDx adders • ADSUx adder/subtractors • CCxcounters • COMPMCx magnitude comparators
Counters • Libraries support a wide variety of fast and efficient counters • Counters offer trade-offs between speed, utilization, and complexity • Example: LogiBlox counter styles • Binary: slow and large • Johnson: fastest practical counter, uses few Flip-Flops • LFSR: fast & dense, but pseudo-random outputs • One-Hot: useful for generating series of enables • Carry Chain:High speed and utilization • Synthesis tools select a component based on the design, or the designer can instantiate a component using LogiBLOX.
Outline • CLB Combinatorial Logic • CLB Register Resources • Memory Usage • Input/Output Block Usage • Tips and Guidelines • Summary
Global Clock Buffers • Global Buffers are low-skew, high-drive buffers • Drive low-skew, high-speed long line resources • Drive all Flip-Flops and Latches in FPGA • Can also be used for high-fanout non-clock signals • Check device for number of clocks • To use the global buffer, instantiate the BUFG component • For synthesis: Clocks are identified by different means depending on Vendor • Example: Synopsys FPGA compiler connects clock buffers to all fan-in of clock pins • Control clock buffer insertion with separate commands • Consult Synthesis interface guide or vendor
Each register can be configured as a Flip-Flop or Latch Independent clock polarity Asynchronous Preset or Clear Synchronous Set or Reset Clock Enable Direct input from CLB input (Connections bypass LUTs) S/R S/R Control DIN F SET G D QX Q H K (CLOCK) EC RESET 1 S/R Control F SET D G QY Q H EC (CLOCK ENABLE) EC RESET 1 CLB Registers
Set and Reset Capabilities • CLB Flip-Flop features include Asynchronous Preset/ Clear or Synchronous Set/Reset • Synchronous Set/Reset is implemented in LUT • Asynchronous Clear/Preset has two sources • Dedicated Global Set/Reset (GSR) net • Local Asynchronous Preset/Clear D D D LUT Q Synch. Set/Reset Q Reset CLK Local Async. Preset/Clear FDC GSR CLB
GSR Q2 GTS Q3 STARTUP Q1 Q4 CLK DoneIn Global Reset (1) • All Flip-Flops are always initialized during power up • Via the Global Set/Reset network • You can access this network by instantiating the STARTUP primitive • GSR is automatically connected to all CLB Flip-Flops using dedicated routing resources - in general you don’t need to connect Startup to Flip-Flops • GSR, GTS, and Clock can be driven by internal signals or pins • Assert GSR for global set or reset, GTS controls Tri-state buffer in IOBs • Can be driven by internal signals or pins • Saves general use routing resources for the design
GSR Startup Or GSR Good for simulation; extra connections will be trimmed by Design Manager Startup GSR Startup To Flip- Flops To Flip- Flops Bad example: general purpose routing is used Improved example: general purpose routing is not used Global Reset (2) • Use Global Reset whenever possible • Local asynchronous reset is routed on general purpose interconnects • Global Set/Reset is routed on dedicated interconnects • Any signal or pin can drive the global set/reset pin • To use global reset network, Register Reset and Startup RST pin must be driven by the same signal. Examples:
SET D QX Q EC RESET Flip-Flop Clock Enable (1) • Register output does not change when clock enable is disabled • Allows synchronous design • Use instead of gating the clock signal • Clock enable is implemented in two ways: • Directly inside the flip-flop via dedicated CE pin • In a Look-Up Table
FDxE D Q CE D Q CE Clock Enable Example • Use Clock Enables Instead of Gating Clock • Use clock enable when using most of or all logic inputs • Avoid gating of clock signal directly • Use MUXed data when using only 1-2 logic inputs or for a gated clock enable • Or, when two different clock enables must drive Flip-Flops in one CLB
CLK2 FF1 FF2 OUT1 X CLK1 FF1 FF2 OUT1 X CLK1 CE CLK2 Minimize the Number of Clocks • Use Clock enable to reduce the number of clocks. • Example with two clocks: • Consider using clock enable instead of a clock • Useful when: • CLK2 is much slower than CLK1 • Or, CLK1 and CLK2 have a definite phase relationship
Outline • CLB Combinatorial Logic • CLB Register Resources • Memory Usage • Input/Output Block Usage • Tips and Guidelines • Summary
CLB CLB D1 32 bits 2 bits D1 D Q Q1 A0 O1 A1 A2 D2 D Q Q2 A3 A4 CLK WE RAM Provides 16X the Storage of Flip-Flops • 32 bits versus 2 bits of storage • Two 16x1 RAMS or One 32X1 Single-Port Ram fit in one CLB • One 16x1 Dual-Port RAM fits in one CLB • 32x8 shift register with RAM = 11 CLBs • Using Flip-Flops, takes 128 CLBs for data alone
General RAM Guidelines • Less than 32 words gives fastest performance • 32x1 or 16x2 RAM fits in one CLB • Delays are short (one level of logic) • Data and output MUXes are required to expand depth • Less than 256 words recommended per RAM • Exceptions include T1 Framers, which use RAMS as a shift register • Width easily expanded • Connect the address lines to multiple blocks • Recommendation: Use less than 1/2 of max memory resources • Maximum memory uses all logic resources of CLBs
WE D[7:0] O[7:0] WCLK A0 A1 RAM16X8S A2 A3 Memory Use • Most synthesis tools can synthesize ROM from behavioral HDL code • RAM memories may be synthesized • Synplicity can synthesize RAMs • Use library primitives and macros for standard size memory • RAM/ROM16X1S to 32X8S • Use S suffix for Synchronous RAM • Use D suffix for Dual-Port RAM • Use LogiBLOX to generate custom size memories
Outline • CLB Combinatorial Logic • CLB Register Resources • Memory Usage • Input/Output Block Usage • Tips and Guidelines • Summary
IOB Block Diagram • Three-state output • Registered Input or Output • Bi-directional I/O • Output Slew Rate control • Programmable setup/hold delay SLEW RATE CONTROL PULL-UP PULL-DOWN OUT FF PAD IN DELAY FF or LATCH FAST LATCH
IOB Flip-Flops and Latches • Synthesis tools and Design Manager can move internal registers into IOBs to meet timing constraints • Flip-Flops and Latches can be used in unbonded IOBs • Use IOB Flip-Flops: • When all CLB Flip-Flops are used • To minimize the Flip-Flop-to-PAD delay • Minimize skew between outputs • IO Blocks contain minimal combinatorial logic • IOB Flip-Flops can be used as part of an internal shift register • Do not use IOB Flip-Flops as part of a pipeline • Library components begin with I - Examples: ILD, IFD16 • Outputs components begin with O - Examples: OFD, OFDT16
OBUFE T OE IN T OUT X 1 Z IN 0 IN Output Three-State Control • Instantiation: Use OBUFE and OBUFT components • OBUFT output is in the high impedence state when OE is low • Synthesis: If-Then statements driving a Hi-Z value onto an output may be synthesized into an OBUFE or OBUFT • Three-state control also via a dedicated global net • Needed for configuration • Also controlled by GST on STARTUP primitive
OAND2 F OPAD IO FAST Output Combinatorial Logic • Small functions can be built into the IOB • Can be used as a generic two-input function generator or MUX • One input can be driven by IOB output clock signal • Requires library components beginning with “O”. • Examples: OAND, OMUX • F input pin is faster than IO pin • Does not apply to all FPGAs
Guidelines for IOB use • Unused IOBs: • Outputs of unused IOBs are automatically disabled • Pull-ups are automatically connected on unused IOBs • Used IOBs: • A PULLUP or PULLDOWN primitive can be connected to used IOBs • Inputs should not be left floating • Add a pull-up to design inputs that may be left floating to reduce power and noise • Output drive • 12 mA Sink current per output on most families • Two adjacent outputs can be tied together to double the drive off chip
Outline • CLB Combinatorial Logic • CLB Register Resources • Memory Usage • Input/Output Block Usage • Tips and Guidelines • Summary
CLB CLB CLB CLB tCO tNET tPD tNET tPD tNET tSU Pipeline for Speed • Use synchronous design • Pipelining improves speed • Consider wherever latency is not an issue • Use for terminal counts, carry lookahead, etc. • How to estimate the number of logic levels per stage Example for 100 MHz clock frequency in XC4013XL-09: Clock period 10 ns One level - 4.1 ns (tCO + tNET + tSU ~= .9 + 1.2 + 2 ns) Delay allowance 7.9 ns Each added level / 3.2 ns (tPD + tNET ~= 1.2 + 2 ) Additional levels of logic allowed 2 CLBs • Why isn’t the SRAM in the CLB included in the delay calculation?
a * b + out c a * b + out c Pipeline Example • Break up combinatorial logic into separate stages • Clock frequency increases • Latency also increases - extra cycle(s) are added • Example: Frequency can double by adding another stage, but an extra cycle is added
Keep Related Logic Together (1) No Hierarchy in Combinational Path • The path from Reg A to Reg C is divided between three • different block descriptions A B C Reg C C B Reg A A • Example - Optimization is limited because • hierarchical boundaries prevent sharing of common • terms
Keep Related Logic Together (2) Good Example A C Reg C Reg A B & C A • Related combinational logic drive registers in the same block • No hierarchical boundaries between combinational logic and • registers • Allows for improved sequential mapping
a0 clk + sum a1 clk a0 + a1 sum clk Register All Block Outputs • Align block boundaries on Register outputs • Helps floorplanning • Poor partitioning • Sum is not registered, and • may becomea critical path. • Good partitioning • Why is performance improved • when combinatorial logic drives • a register in the same CLB?
en clk [23:0]out en . . . clk en . . . clk . . . [23:0]out Duplicate Registers to Reduce Fanout • Why does fanout reduction improve performance? Register has 24 loads Each Register has 12 loads
Large Dense Counter with Slower Carry CE Fast Small Counter TC Counter Tips (1) • Do not use binary sequence if unnecessary • Consider higher performance or smaller counter types • Examples: LFSR, Pre-scaled, Gray • Use Pre-Scaling on non-loadable counters to increase speed • LSBs toggle quickly • See Application Notes XAPP001 and XAPP014
10-bit SR Q0 Q6 Q9 Counter Tips (2) • Use Gray code counters if decoding outputs • Glitch free, because one-bit changes per transition • Consider Linear Feedback Shift Register for speed when terminal count is all that is needed • Or when any regular sequence is acceptable (e.g., FIFO)
Qx I1 Qn D Q In FF Qy I1 D Q Qn + 1 In FF Qz I1 D Q Qn + 2 In FF State Machine Design Tips(1) • Use One-Hot Encoding for small state machines • Shift-register like structure • One Flip-Flop is assigned to each state • Works well in Xilinx “register-rich” FPGAs • Number of required Flip-Flops may be higher than other state machines, but logic to generate state is less complex • RAMs can be used to encode large state machine Prototype OHE State Machine: Qx, Qy, and Qz are composed of state variables from previous states
State A State Machine Design Tips(2) • Split complex states • Need to minimize number of inputs, not number of Flip-Flops, in FPGAs • Use One-Hot encoding for medium to large state machines (greater than 12 states) • Complex states may be improved by breaking up into additional simpler states State A1 State A2 cond1 cond1 cond1 State B State B
State A State B State A State C State Machine Design Tips(3) • Consider a pipeline: break the state machine into two or more clock cycles • Two clock cycles for a state is better than having to slow the clock for the entire state machine • This basically means to breakup wide input equations using intermediate nodes in the state diagram. State C
Outline • CLB Combinatorial Logic • CLB Register Resources • Memory Usage • Input/Output Block Usage • Tips and Guidelines • Summary
Summary • Use Tri-state buffers for multiplexing • Carry Logic is not the only way to create fast arithmetic functions • Use the GSR net to save routing resources and use global routing resources • Use Clock Enable port on registers to design synchronously and save logic • Best memories are <=32 words • Use LogiBLOX to customize memories • Use IOB registers for modules that do not require logic, such as shift registers • Refer to LogiBLOX or Design Manager Help for more information on LogiBLOX
Binary Counter D Q Q0 TC Q1 CK Q2 Questions (1) • What problem may occur in this circuit? • How can the circuit be improved?
Questions (2) • What does GSR stand for? • What component sources the GSR net? • When should the GSR net be used? • What component is instantiated to use the Global Clock? • Can the Global Clock be synthesized?
Questions (3) • How many global clocks can be used in an XC4085XL-3? • See the data sheet for the XC4000XL family, available on WEB or the AppLINX CD. • Why is one hot encoding a good way to encode a small state machine? • When should IOB registers be used? When should they be avoided?