130 likes | 138 Views
This investigation compares the Virtex-E and Virtex-6 FPGA families, exploring the feasibility of porting the CP FPGA design to Virtex-6 and determining the resource utilization and timing differences between them.
E N D
Virtex-6 Investigations Motivation Virtex-E/Virtex-6 Comparison Porting the CP FPGA to Virtex-6 Next Steps Conclusion Ian Brawn
Virtex-6 Investigations • Motivation • Virtex-6 is a 'baseline' technology choice for the Phase 1 upgrade • Phase 2, Level 0 Trigger Processor is also likely to use a comparable technology • Role similar to current Level 1 trigger processor: • Low, fixed latency, • Algorithms of comparable form and complexity (ie, not several orders of magnitude more complex) • Desirable to learn capabilities of Virtex-6 • May also provide some guide to future development of FPGAs • (Virtex-7 devices have been anounced but data sheets not yet available) Ian Brawn
Virtex-E Resources • Virtex-E family released ~2000 • Used as benchmark for Virtex-6 investigations because used widely in current L1 Calo Trigger • XCV1000E used for CPFPGA (+CMM Crate & System Merger FPGAs) Ian Brawn
LXT devices SXT devices extra DSP resources HXT devices extra MGT resources Virtex-6 Resources • Virtex-6 Family released 2009 • Compare to XCV1000E, XC6VLX75T has ~ x2 LUTs, x4FFs, x10 RAM • Only approximate comparison possible because structure of families is different • eg, 6-input LUTs in Virtex-6; 4-input LUTs in Virtex-E Ian Brawn
DSP48E1 slice: 25-bit pre-adder 25 x 18 mulitplier 48-bit accumulator Pattern detect logic Optional pipelining @ 600 MHz 3 cycles necessary for pre-add, multiply, accumulate Manipulate algorithms to implement divisions as multiplications, eg Threshold < Esum1/Esum2 Esum2 x Threshold < Esum1 Possible to have multiply/divide operations in V6 without breaking resource/latency budget (50-pade manual on this component, which I've not fully digested) Multiplication in a Virtex 6 Ian Brawn
Virtex E / Virtex 6 Speed Comparison • Virtex E • We use them mostly at 40 MHz, in some places at 160 MHz • Data sheet: “synchronous system clock rates up to 240 MHz” • Virtex 6 • Data sheet quotes a maximum frequency of 600 MHz for internal logic • Implies Virtex-6 is 2.5 x faster, but only if we assume the same amount of processing can be squeezed into each clock cycle in the two families. Clock Data Transformations B C D A Set-up time Ian Brawn
Porting CP FPGA design to a Virtex-6 • Main motivation is to understand how much processing we can fit into each Virtex-6 clock cycle • Better latency estimate • Also provide a guide to resource usage on Virtex-6 • Sam has ported CMM-Jet design to Virtex-6 for different reasons: baseline design for CMM++ • Chose CP FPGA design to port • Time-critical design on real-time path • Most complex algorithms • Thank you to Richard for providing the source code • Porting • Upgraded Mentor tools.... • Re-implemented block RAMs & • Re-implemented relationally-placed macros (time-critical areas of design with fixed placing) • Minor re-design of clock tree • Virtex-6 components allowed & required simplified design • All straight forward • Caveats • Interested in what I could learn, not in producing working design • Ignored fine timing constraints on IO • Specific to this design with no wider implications for latency • Probably not most efficient implementation of design in Virtex-6 Ian Brawn
CPFPGA: Virtex-E/6 Resource Utilization • Virtex-E Virtex-6 • XCV1000E-6BG560 XC6VLX75T-3FF784 • LUTs 62% 28% • Flip-Flops 27% 6% • BLOCKRAMs 20% 4% • External IOBs 46% 52% • No. flip-flops in Virtex-6 lower than expected • Haven't yet investigated why • Less logic duplication to meet timing requirements? • RAM not used efficiently in Virtex-6 • IO is the one area where things haven't improved • Will need to rely more heavily on Serialised data • Can use GTX for incoming calorimeter data (eg, from SNAP12 @ 6.24 Gb/s/channel) • Also use GTX for data sharing required by overlapping algorithm windows? (latency considerations) Ian Brawn
CPFPGA: Virtex-E/6 Timing Comparison • Ideally, re-time registers in Virtex-6 to optimise for 40 MHz clock and recalculate latency • Lot of work for a purely academic exercise • A quicker exercise, which yields approximately the same information, is to shrink the clock period • No changes were made to design here; just tightened clock constraints • For fair comparison, also established how fast design can be run on Virtex-E • Normally run at 40 MHz ( 160 MHz clock for some logic) • This doesn't mean it can't be run faster Ian Brawn
CPFPGA: Virtex-E/6 Timing Results (1) • In Virtex-E, CPFPGA minimum clock period is 19 ns • Remember the caveats - fine timing of IO would be destroyed at this speed • This is just a measure of how fast we can run the internal algorithmic processing • In Virtex-6, CPFPGA minimum clock period is 10 ns • ~x2 fast at Virtex-E • However, showed these figures dominated by x4 clock speed logic • Very little processing performed between clock cycles here • Signal latency is dominated by routing delays • Therefore these aren't good measures of comparitive speeds for algorithmic operations Ian Brawn
CPFPGA: Virtex-E/6 Timing Results (2) • To get better measurement of speed for algorithmic operations, implemented vertical slice through algorithm block • No logic at x4 clock speed • (Not enough IO to implement whole alogirthm block, hence vertical slice) • Results • In Virtex-E, CPFPGA minimum clock period is 11 ns • In Virtex-6, CPFPGA minimum clock period is 4 ns • Which is ~x2.5 increase in speed estimated from the data sheet • Re-optimising design for structure of Virtex-6 would probably provide a further increase in speed, but not by 100% • In 9 years from Virtex-E to Virtex-6 speed has increased by x2.5 • Seems unlikely speed is going to increase by order of magnitude during lifetime of this project Ian Brawn
Implement potention Phase-2 L0 algorithms to calculate latency e- coincidence veto Use hit map from e/g algorithm Assume arrive as list of candidates with PT, & (worst case for concurrent processing) Use Virtex-6 CAM Ternary mode 1–512 wide, 16–4096 deep e-jet coincidence veto Invarient mass Investigate Virtex-6 IO as part of wider investigation into data transport in Phase-2 L0. e/g m 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 Next steps • e- veto algorithm • (not intended to show true size of CAM): m list qualified e/g hit map PT Threshold & Map CAM PT PT e/g hit map example data in: output Ian Brawn
Conclusion • FPGA archirecture has advanced in the decade since we implemented the current L1 Calo Trigger Processor • Particular features such as DSP blocks are of interest • Size of devices have risen by > order of magnitude • But speed has increased more slowly: ~ x2.5 • No. IO pins hasn’t increased at all • High-speed serial IO available, but at latency cost • For the Phase 1 Upgrade and Phase 2 Level 0 processor • More complex algorithms (e.g., involving multiplication) are within our scope • But latency concerns haven’t been eliminated by FPGA progress Ian Brawn