290 likes | 302 Views
This paper discusses the implementation of an external scrubber for the ALICE Inner Tracking System (ITS) Readout Unit, addressing radiation challenges and SEU mitigation in the FPGA design.
E N D
TWEPP '19, Santiago de Compostela External scrubber implementation for the ALICE ITS Readout Unit Magnus Rentsch Ersdal magnus.ersdal@uib.no TWEPP '19, Santiago de Compostela
University of Bergen Inner Tracking System (ITS) Upgrade Inner barrel half-layers ITS upgrade cutaway
University of Bergen Readout Electronics
University of Bergen Readout Unit
University of Bergen Radiation environment Readout Units Sit here Design for 1 kHz/cm2 ~ 4 orders of magnitude more than normal radiation background Total Ionizing Dose (TID) and Non-Ionizing Energy Loss(NIEL) are such that they pose no concern
Universityof Bergen SEUs and CMOS circuits • Single Event Upsets (SEU) • SEU = LET changing the state of a node (bitflip) • SEUs in configuration cell SRAM
University of Bergen Radiation challenges • SEUs interrupt operations by: • Upsets in configuration memory in SRAM FPGAs (Main concern1) • Upsets in flash memory • Upsets in registers / state-machines • Potentially, a disruption of the clock / reset nets can stop all activity on the FPGA • Some space projects utilize anti-fuse devices, not an option in our case. • There is a potential for single event functional interrupts 1:New Developments in Error Detection and Correction Strategies for Critical Applications, Melanie Berg 2017
University of Bergen Mitigation, generally • In our environment, we can ignore dose effects for our FPGAs because TID will be low enough • Tolerates expected doses • We cannot ignore soft errors • Mitigation techniques are applied to our FPGA designs • Triple Modular Redundancy (TMR) on logic • For protecting against configuration memory SEUs, this is not sufficient1 1:New Developments in Error Detection and Correction Strategies for Critical Applications, Melanie Berg 2017
University of Bergen Readout Unit Additional system components; Flash FPGA, Proasic3 (Pa3) for increased radiation tolerance
Universityof Bergen SEU mitigation for the main FPGA • In FPGA design: TMR (see poster* by M.Lupi) • Scrubbing: • "Scrubbing is the act of simultaneously writing into FPGA configuration memory as the device’s functional logic area is operating with the intent of correcting configuration memory bit errors." 1 • External scrubber that is radiation tolerant • Flash FPGA configuration memory is rad-tolerant 1:New Developments in Error Detection and Correction Strategies for Critical Applications, Melanie Berg 2017 *https://indico.cern.ch/event/799025/contributions/3486415/
University of Bergen Requirements for ExternalScrubber • Initial configurationof Xilinx Ultrascale (XKCU - mainfpga) usingconfigurationstored in on-board flash memory • Scrubbingof XKCU configuration Memory • Configuration and Scrubbingareboth operating ontheSelectMAP bus • Additionalrequirements: • Scrubbing and initial configuration must be «fast enough» • Scrubbingcyclesshould have a significantlyhigherfrequencythan SEU rate, ruleofthumb: 10x (Xilinx application note xapp216*) • Worst case SEU rate: ~0.04 SEU/s per Readout Unit. (8/s for all 192 RUs) • Radiation tolerant • Efficientcontrolinterface • Two I2C interfacesareavailable in hardware • Efficientuploadof files *https://www.xilinx.com/support/documentation/application_notes/xapp216.pdf
University of Bergen Flash FPGA Design
University of Bergen Config and Scrubbing
University of Bergen File upload
University of Bergen Control
University of Bergen Key numbers • Initial config : 2s (197 Mb) • Scrubbing : 1.7s (151 Mb) • Writing to flash memory done via scripts • I2C: ~230 kb/s • SWT* (Xilinx FIFO): ~4 Mb/s • Resource utilization • Logic cells: 79% • RAM: 4 of 24 *Single Word Transaction, the slow-control protocol for the main FPGA
University of Bergen SEU mitigation in the PA3 design • Local TMR on registers • Recommended method for flash-based FPGAs1 • Needs 3x DFFs and some additional logic cells for voting Reproduced from 1 1:New Developments in Error Detection and Correction Strategies for Critical Applications, Melanie Berg 2017
University of Bergen SEU mitigation in the Flash memory • Scenario: writing a faulty configuration bit can theoretically stop the Xilinx FPGA from functioning • 1048/1024bit hamming error correcting codes (ECC), interleaved with data before loading the flash. (python3 sw) • Implementation of TN2908* • Gitlab CI creates and encodes the files on every commit • Single-bit correction, double-bit detection. More than 2 bitflips undefined. • Device has two distinct chips inside the same package. Writing to both in case of critical error on one. *https://www.micron.com/-/media/Documents/Products/Technical%20Note/ NAND%20Flash/tn2908_NAND_hamming_ECC_code.pdf
University of Bergen SEU mitigation in the Flash memory • Based on irradiation campaigns the SEU cross section in the Flash Memory is estimated at: • (0 1) 10-16 cm2/bit • (1 0) 10-21 cm2/bit • A typical scrubbing file has a 1:20 ratio ofOnes vs Zeros • A typical programming file has a 1:50 ratio of Ones vs Zeros • given no default values written to BRAM • Because of this, the bits of the files are inverted before writing these to the flash memory Weste, Harris: CMOS VLSI Design, p.127
University of Bergen SEU mitigation in the Flash memory • Three measures have beenimplemented: • Storing theprogramming file inverted • Adding Hamming encodingofthebitstream • Store twocopiesof all the files in the Flash memory • This gives: P(fatal error) == P(double bitflip in one ECC encodedblock in bothcopiesofthe file) • P(fatal error) = 7E-26 during 10h spill
Universityof Bergen Additional feature for commissioning and design qualification • Fault injection • A tool for tabletop "beam-testing" • To be used for commissioning and design qualification only. • This can be exploited to improve rad tolerance and add design recovery routines.
University of Bergen Fault injection HW top level • Select random number -> count down -> flip bit • 14x faster rate than worst case design SEU rate
University of Bergen PRBS "random" functions • Pseudorandom Binary sequence • Linear Feedback Shift Register (LFSR), 32 bits long • scaled to fit memory layout (4504 pages x 4096 bytes)
University of Bergen Status • Design is verified and tested; all mandatory features of the FPGA design are ready. • Work in progress: • Finalize fault injection • Remote programming of ProASIC3 • Thank you
ITS Plenary Meeting 28th Feb - 1st Mar 2018 Probabilityof fatal error • Combinedcrosssection: • CS1:20 = 4.76E-18 cm2/bit • Probabilityof double bitflip in ECC block flash#0: • P(double#0) ≈ (CS1:20*ECC_size*ECC_blocks)2 = 1.61E-14 • Probabilityof double bitflip in same ECC block flash #1: • P(double#1 | double#0) ≈ P(double#0)/ECC_blocks = 6.33E-22 • CombinedProbability: • P(double#1 ꓵ double#0) = P(double#0) * P (double#1 | double#0) = 1E-35 • 7E-26 double bitflips in same ECC block in both flash ICs during 10h run • Importantnumbers: • ECC blocksize: 1048 bits • # ECC blockson Flash: 2.52E+07 • Est. Flux Run 3: 1 kHz/cm2 • Fluence 10h spill: 3.6E+07 cm-2 • Cross-section (10): 1.0E-21 cm2/bit • Cross-section (01): 1.0E-16 cm2/bit • Ratio 1:0 scrub-file: 1:20
ITS Plenary Meeting 28th Feb - 1st Mar 2018 Resource usage & timing
University of Bergen How random is prbs