160 likes | 166 Views
Design strategies for maximizing performance and overcoming challenges in Virtex devices. Learn about power integrity analysis, signal integrity analysis, and high-speed design. Contact Mayo Clinic SPPDG for more information.
E N D
Design for PerformanceOptimization of Virtex Devices Steve Currie – 6/26/2006 Mayo Clinic SPPDG 507-538-5460 currie.steven@mayo.edu
About the Mayo SPPDG(Special Purpose Processor Development Group) • Not generally a “product delivering” organization • Risk reduction efforts, proof-of-concept, prototypes, test vehicles • Evaluate emerging technologies • Push existing technology to the limits • Commonly-known strengths • Power integrity analysis (power delivery design/analysis) • Signal integrity analysis • High-speed design/test • e.g., DC to 80 Gbps logic
Experience with Virtex FPGAs • Done: 10Gbps in V2 (XC2V6000 -6, BF957) • 16-bit LVDS busses @ 690 Mbps • Soft-SERDES implementation • Multiple clock domains • Doing: 50Gbps in V4 (FX100 & FX140 in FF1517 pacakge) • SPI-5 (16+1 RocketIO @ 3.125 Gbps) • Dual 50 Gbps “lite” interfaces: 8+8 RocketIO @ 6.25 Gbps • 400+ bit, ~200 MHz SRAM interface • DDR2, QDR2, 84+ Gbps • Using nearly all IO in all banks • Phase-shifted IO reaching the different memory modules • Heavy internal resource utilization (% TBD)
19837 POE Prototype Two front panel and top down view
General Concerns – 1 • % utilization of IO (SSO), core (timing/placement) • Higher speed processing and throughput could require intense IO operation (either “more” or “each-faster”) • Complex core processing at high speed requires extensive pipelining, or perhaps duplicate processing functions – internal timing becomes challenging! • Jitter/clocking • High speed clock sources, fanout (routing, buffering), multiplication, clean-up • Clock recovery circuits • SSO, power delivery • Aggressive decoupling competes with power supply stability • Rules of thumb break down as utilization % increases
General Concerns – 2 • Power-on reset, initial conditions • Large current spikes coming out of configuration/”reset” • Defining initial conditions of “analog” elements • Large/wide bus termination • Discrete termination wastes less power than active termination, but at the cost of large footprint • Competes with power delivery system components • Could move to buried resistors, but there lies another set or problems • “Secret” how-tos and inconsistent documentation • Many details of RocketIO operation were [mis]documented in the various documents available • We utilized an existing Titanium Support contract to get the “truth” • 3rd-party IP often needed to push basic capability to acceptable performance • Attempting to saturate gigabit-ethernet with Xilinx “included” TCP/IP stack vs pay-for option • Appnotes and their boundaries (assumptions/limitations) should be thoroughly understood before being used – don’t expect “cut and paste” simplicity
Our V2-Specific Challenges • Multiple clock domains inside the part • Location of global clock pins vs DCMs, etc. • Unusable jitter with clock multipliers • Clean-up PLLs off-chip • LVDS busses near their speed limits • Needed soft-SERDES macro and precise clock-to-data alignment • Using a large % of the chip resources complicates timing • Hand-placement often required to make timing • Xilinx Titanium Support provided very valuable in-depth knowledge and, hence, solutions to some problems • Having consultation during the design phase is better than having them debug/patch after the problems exist
Our V4-Specific Challenges – 1 • Core speed didn’t scale up from V2 as other capabilities did • We were hoping for 400 MHz, which appears unlikely • Requires “dual-parallel” data path at ½-rate inside, which increases the core-usage • Package design is good, but power delivery recommendations don’t suit complex designs • Evaluation boards don’t follow these recommendations • SSO is still a problem, and the somewhat cumbersome SSO calculator is critical to make this work • Thorough power-delivery system analysis (HFSS, SiWave) requires knowledge of the package construction (and on-chip/package decoupling) which is difficult to acquire (NDA, etc.) • Crosstalk analysis shows need for painful routing of memory IO • RocketIO require a significant power filtering network for each transceiver (whether each transceiver is used, or not), further complicating an already dense layout
Our V4-Specific Challenges – 2 • Power consumption • RocketIO were planned to be 10+ Gbps, hence they consume more power than if they had been designed for the current-errata maximum: 3.125 Gbps (and “Step 1” maximum: 6.25 Gbps) • Initial estimates showed 35 Watts per-FPGA for our desired capability – now a cooling challenge as well • Power delivery system • No room for discrete termination AND decoupling, thus active termination (even with the power cost) is preferred over the problems with buried resistors (cost/debug) • RocketIO usage requires 8b/10b per latest errata • Effectively reduces throughput capacity by 20% • Eliminates SPI-5 and 8-bit, 50Gbps interfaces • Run-length problem, but 8b/10b also is DC-free: overkill • Could consider custom encoding scheme, but the 8b/10b is a “free” hard-macro in the RocketIO (fast, no extra resources used) • Limited channel-bonding capabilities • Must do channel bonding in the core for unsupported interface protocols (increased power, core-usage)
Our V5–Specific Concerns • NDA-protected conversations have made us fond of the V5 roadmap, but there are concerns • Schedule and feature-set reliability • V4 slipped/changed repeatedly… what to expect from V5? • Implied SEE sensitivity with the addition of configuration frame ECC (post-configuration checking) – a 65nm problem?
Problem Summary • Signal Integrity Analysis • Lots of SSO, dense routing, crosstalk (non LVDS data paths) • RocketIO link analysis • All require I/O spice models from Xilinx which must first be validated against hardware • Also require interconnect models (transmission lines) • Power analysis/integrity • Power supply selection must tie in with decoupling design • Very low impedance power delivery helps with SSO, but is problematic for power supplies (extensive analysis of package, board, decoupling, supply required) • Internal timing constraints and problems • Need for “hands on” place/route inside FPGAs to get peak performance • Design consultation might be appropriate (we used Xilinx Titanium Service) • Architecture design for lowest clock jitter • Clock circuitry is different from V2-V2P-V4-V5 • Need “inside” knowledge: design consultation, again • ChipScope is a good internal debugging tool
One Specific Problem:High Speed Bus Clock/Data Alignment • Problem: Multi-bit data bus and clock are captured at target FPGA with imperfect alignment • A V2 solution: xapp268 • Assumes all clock and data signals that make up a bus arrive “close” in phase, and uses DCM-delay to sample the clock with itself to find the “middle” of the clock for capture alignment • Clever, but isn’t finding the center of the data window • Requires global clock input and DCM for xapp to work as intended • Global clock input needed per bus – not so easy • A more data-centric solution • Measure goodness of DLL setting by checking bit error rates on the data bits • Identifies the best clock to data alignment based off an “averaged” data window • Uses upstream data generation, local data-compare • More core resources used, but supports very high speeds, large/small busses, worse-matched routing, etc.
High Speed BusClock to Data Alignment • Problem: Multibit data bus and clock are captured at target FPGA with imperfect alignment • V4 solutions • IDELAY and ISERDES • Per-bit clock to data alignment capability and hard SERDES macro • New clock resources compared to V2 (PMCD) • V5 solutions • IDELAY + ODELAY • New/changed clock resources compared to V4
Summary • Rules of thumb don’t cut it • Analysis and design are required to provide the kind/quantity of clean power needed for large, heavily utilized devices at high speed • Signal integrity analysis is required for dense routing and fast signals • Devices change significantly from family to family • Unless you want to be an expert with each, hire design consultation • What was once hard may become easy, but it also means that what once worked might not any longer (design reuse) • Data paths get more complicated with speed • Must manage clock/data alignment • Framing is required to properly align busses • Proper signal-integrity methodology becomes mandatory • Power consumption is significant, but must be clean as well • Requires simultaneous analysis of package, board, and other power-delivery system components • RocketIO require an extensive power filter network • Clock architecture • RocketIO: Recovered clocks, dedicated MGTCLK inputs (what frequency is best for the PLLs?) and the problems (e.g., jitter) with each requires intimate knowledge of the FPGA architecture • General: must understand on-chip clock resources and their limitations (“geographic” restrictions, jitter requirements OR jitter generated) • Communications protocol implementations are somewhat limited • Hard-macros cater to a select set of protocols • Intrinsic performance limitations make some implementations improbable (E.g., SPI-5)