310 likes | 322 Views
RHIC Run 14 Radiation Upsets. Kevin Brown C-AD Control Systems. Acknowledgements. Nick Franco, Bill Eisele Had TLDs put into alcove 11B (help from Dana Beavis & Paul Bergh) Monitored Network problems Coordinated with controls HW to add test network switch to 11B alcove
E N D
RHIC Run 14 Radiation Upsets Kevin Brown C-AD Control Systems
Acknowledgements • Nick Franco, Bill Eisele • Had TLDs put into alcove 11B (help from Dana Beavis & Paul Bergh) • Monitored Network problems • Coordinated with controls HW to add test network switch to 11B alcove • Controls HW (Charles Theisen, Ralph Schoenfeld) • Monitor, maintain, & replace FEC’s • Added network switch and test FEC to 11B alcove • John Morris • Analysis scripts and web data pages for network and FEC reset statistics • Al Marusic • Performed ram pattern measurements • Watches FEC’s very closely • Kin Yip (w/ help from Dana Beavis & Angelica Drees) • Simulations of radiation field in alcoves • Peter Ingrassia – operations statistics RHIC Run 14 Retreat
Outline Alcove Radiation Upsets – network switches Quick Tour of an Alcove Comparison of Upsets for Au-Au runs History of Radiation upsets Simulations of beam losses near an alcove Ram pattern & TLD measurements Possible mitigations RHIC Run 14 Retreat
Run 14, 100 GeV/n Au-Au Alcove network switch radiation upset locations 54 43 32 21 11 0 RHIC Run 14 Retreat
255 GeV p-p Alcove network switch radiation upset locations 12 10 7 5 2 0 RHIC Run 14 Retreat
Impact on Operations There is an impression that resets are just an annoyance! But they can develop into real downtime. • For Run 11: 2.8 hours were charged for network failures. • For Run 14: 13.2 hours were charged. When a switch is out of communication, we are not in control of part of the accelerator! RHIC Run 14 Retreat
Outline Alcove Radiation Upsets – network switches Quick Tour of an Alcove Comparison of Upsets for Au-Au runs History of Radiation upsets Simulations of beam losses near an alcove Ram pattern & TLD measurements Possible mitigations RHIC Run 14 Retreat
Alcoves 8 9 10 18 Alcoves in RHIC, 3 per ARC. Two basic layouts – rack use is always the same RHIC Run 14 Retreat
Rack 8 RHIC Run 14 Retreat
Outline Alcove Radiation Upsets – network switches Quick Tour of an Alcove Comparison of Upsets for Au-Au runs History of Radiation upsets Simulations of beam losses near an alcove Ram pattern & TLD measurements Possible mitigations RHIC Run 14 Retreat
Run 14 averaged 1.8 resets/day Run 11 averaged 0.5 resets/day For 250 GeV p-p, average 0.15 resets/day RHIC Run 14 Retreat
Outline Alcove Radiation Upsets – network switches Quick Tour of an Alcove Comparison of Upsets for Au-Au runs History of Radiation upsets Simulations of beam losses near an alcove Ram pattern & TLD measurements Possible mitigations RHIC Run 14 Retreat
History – 2004 • Front End Processors • Power 3E processors lacked ECC memory • Some locations had CPUs with ECC memory • 1007C, 1009A, 1009C • Began process of replacing select CPUs with MVME2112 processors, that have ECC memory • VME Chassis (250 in field, 40 in Alcoves) • 13 VME PS Failures in Alcoves (38%) • 3 VME PS Failures in Service buildings (<1%) • Collaboration: VME Chassis Rad Tested for LHC at CERN (Wiener Chassis now in use) • Began looking at Rad Resistant PSs = found one that performed well (Vicor PS now in use) RHIC Run 14 Retreat
V115 (WFGs) • Tried ECC memory in WFGs (in 2004) • To replace 128K x 32 static ram • Plugged into existing memory footprint • Initially performed well in tests • Didn’t perform as well as expected in alcoves • After 2004 switched to MRAM • Finished MRAM installation by 2011 • In addition • Alcove FECs (for PSs) were upgraded to use ECC memory • Stopped using RAM disks to save files • Code was modified to save 3 copies of data that had to remain in RAM disks (to avoid corruption) • RAM disks for FECs that did not need them were removed (i.e., BPMs and QD FECs) RHIC Run 14 Retreat
Outline Alcove Radiation Upsets – network switches Quick Tour of an Alcove Comparison of Upsets for Au-Au runs History of Radiation upsets Simulations of beam losses near an alcove Ram pattern & TLD measurements Possible mitigations RHIC Run 14 Retreat
Simulations of beam losses near an alcove • Simulations from Kin Yip • Single loss point near alcove, based on beam loss patterns during operations and optics model for accelerator • Terminology: “Reduction factor” is Ratio of flux with shielding to that without shielding. • Therefore: 100% reduction factor == no reduction. RHIC Run 14 Retreat
scraping Reduction factors in % 1.4×10-3 /cm2 102% (n+)Flux per gold-ion Detection points 101% 2.6×10-4 /cm2 119% 91% 87% 125% 88% 91%
46% 75% 57% 90% Reduction factors in % 76% 80%
Outline Alcove Radiation Upsets – network switches Quick Tour of an Alcove Comparison of Upsets for Au-Au runs History of Radiation upsets Simulations of beam losses near an alcove Ram pattern & TLD measurements Possible mitigations RHIC Run 14 Retreat
Memory Upset Measurements • Began taking measurements Apr. 20 • Only using Ram Disks in Alcove PS FECs • Place 64 bit pattern repeatedly into 2 Mbytes of memory • Creates 262,144 samples per FEC • Do periodic reads of patterns to look for changes • Almost always changes are at the single bit level RHIC Run 14 Retreat
Memory Pattern Upsets: Across Alcove Test FEC • Test FEC placed in corner of room • Compared to PS1 & PS2 near door. • After 1 RHIC store • 41 errors in PS1 • 36 errors in PS2 • 32 errors in TEST RHIC Run 14 Retreat
TLD Results 324 325 4 wk 322 323 2 wk 4 wk 2 wk • TK322 & TK324 installed 4-30-14 & kept in place for 4 weeks • TK323 & TK325 installed 4-30-14 & kept in place for 2 weeks RHIC Run 14 Retreat
Outline Alcove Radiation Upsets – network switches Quick Tour of an Alcove Comparison of Upsets for Au-Au runs History of Radiation upsets Simulations of beam losses near an alcove Ram pattern & TLD measurements Possible mitigation RHIC Run 14 Retreat
Possible Mitigation • Could try network switches that use ECC memory • Costs are high, >$30k/switch • Could add shielding • Kin’s simulations suggest reduction factors are not large • Could move sensitive equipment to “quiet” alcoves • Some new equipment required & labor intensive • Could move some network switches out of alcoves to service buildings • Distances are large (~1400 ft) • Technology issues need investigation • Use copper cable instead of fiber? • Need to use converters, which may be susceptible and may pose operational challenges. RHIC Run 14 Retreat
Plans under discussion • Improve on the data collection to better monitor the “health” of the equipment in the alcoves. • Automate & Log data from memory pattern scans • Log network statistics and consider additional methods to monitor the health of network switches (directly or indirectly) • Improve ability to deal with switch problems • Riding through a problem more gracefully • Improve how we recover • Recovering automatically • Identify equipment impacted and make more robust • Improve alarming mechanisms to alert operations of switch problems (distinguish more clearly between FEC problems and switch problems) • Next run explore mitigation strategies further (see previous slide). RHIC Run 14 Retreat
Backup/Auxiliary Slides RHIC Run 14 Retreat
Rack 9 Rack 10 RHIC Run 14 Retreat
Some Terminology • ECC memory: Error-correcting code memory • Detects and corrects most common kinds of internal data corruption. • Is immune to single-bit errors • Makes use of additional circuitry that checks accuracy of data during I/O operations. • E.g., data read from a word of memory will remain uncorrupted even if a single bit in that memory was flipped. • MRAM: Magneto-resistive random-access memory • Non-volatile memory • Data is not stored as electrical charge, but by magnetic storage elements (via two ferromagnetic plates), so works on spin, not charge. • Single Event Upsets, Single Event Latchups, Single Event Gate Rupture, & Single Event Burnouts • radiation striking a sensitive node in a micro-electronic device • Most common are upsets, which are not permanent damage • Result is at least a single bit change RHIC Run 14 Retreat