1 / 31

RHIC Run 14 Radiation Upsets

RHIC Run 14 Radiation Upsets. Kevin Brown C-AD Control Systems. Acknowledgements. Nick Franco, Bill Eisele Had TLDs put into alcove 11B (help from Dana Beavis & Paul Bergh) Monitored Network problems Coordinated with controls HW to add test network switch to 11B alcove

leaht
Download Presentation

RHIC Run 14 Radiation Upsets

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. RHIC Run 14 Radiation Upsets Kevin Brown C-AD Control Systems

  2. Acknowledgements • Nick Franco, Bill Eisele • Had TLDs put into alcove 11B (help from Dana Beavis & Paul Bergh) • Monitored Network problems • Coordinated with controls HW to add test network switch to 11B alcove • Controls HW (Charles Theisen, Ralph Schoenfeld) • Monitor, maintain, & replace FEC’s • Added network switch and test FEC to 11B alcove • John Morris • Analysis scripts and web data pages for network and FEC reset statistics • Al Marusic • Performed ram pattern measurements • Watches FEC’s very closely • Kin Yip (w/ help from Dana Beavis & Angelica Drees) • Simulations of radiation field in alcoves • Peter Ingrassia – operations statistics RHIC Run 14 Retreat

  3. Outline Alcove Radiation Upsets – network switches Quick Tour of an Alcove Comparison of Upsets for Au-Au runs History of Radiation upsets Simulations of beam losses near an alcove Ram pattern & TLD measurements Possible mitigations RHIC Run 14 Retreat

  4. Run 14, 100 GeV/n Au-Au Alcove network switch radiation upset locations 54 43 32 21 11 0 RHIC Run 14 Retreat

  5. 255 GeV p-p Alcove network switch radiation upset locations 12 10 7 5 2 0 RHIC Run 14 Retreat

  6. Impact on Operations There is an impression that resets are just an annoyance! But they can develop into real downtime. • For Run 11: 2.8 hours were charged for network failures. • For Run 14: 13.2 hours were charged. When a switch is out of communication, we are not in control of part of the accelerator! RHIC Run 14 Retreat

  7. Outline Alcove Radiation Upsets – network switches Quick Tour of an Alcove Comparison of Upsets for Au-Au runs History of Radiation upsets Simulations of beam losses near an alcove Ram pattern & TLD measurements Possible mitigations RHIC Run 14 Retreat

  8. Alcoves 8 9 10 18 Alcoves in RHIC, 3 per ARC. Two basic layouts – rack use is always the same RHIC Run 14 Retreat

  9. Rack 8 RHIC Run 14 Retreat

  10. Outline Alcove Radiation Upsets – network switches Quick Tour of an Alcove Comparison of Upsets for Au-Au runs History of Radiation upsets Simulations of beam losses near an alcove Ram pattern & TLD measurements Possible mitigations RHIC Run 14 Retreat

  11. Run 14 averaged 1.8 resets/day Run 11 averaged 0.5 resets/day For 250 GeV p-p, average 0.15 resets/day RHIC Run 14 Retreat

  12. Outline Alcove Radiation Upsets – network switches Quick Tour of an Alcove Comparison of Upsets for Au-Au runs History of Radiation upsets Simulations of beam losses near an alcove Ram pattern & TLD measurements Possible mitigations RHIC Run 14 Retreat

  13. History – 2004 • Front End Processors • Power 3E processors lacked ECC memory • Some locations had CPUs with ECC memory • 1007C, 1009A, 1009C • Began process of replacing select CPUs with MVME2112 processors, that have ECC memory • VME Chassis (250 in field, 40 in Alcoves) • 13 VME PS Failures in Alcoves (38%) • 3 VME PS Failures in Service buildings (<1%) • Collaboration: VME Chassis Rad Tested for LHC at CERN (Wiener Chassis now in use) • Began looking at Rad Resistant PSs = found one that performed well (Vicor PS now in use) RHIC Run 14 Retreat

  14. V115 (WFGs) • Tried ECC memory in WFGs (in 2004) • To replace 128K x 32 static ram • Plugged into existing memory footprint • Initially performed well in tests • Didn’t perform as well as expected in alcoves • After 2004 switched to MRAM • Finished MRAM installation by 2011 • In addition • Alcove FECs (for PSs) were upgraded to use ECC memory • Stopped using RAM disks to save files • Code was modified to save 3 copies of data that had to remain in RAM disks (to avoid corruption) • RAM disks for FECs that did not need them were removed (i.e., BPMs and QD FECs) RHIC Run 14 Retreat

  15. Outline Alcove Radiation Upsets – network switches Quick Tour of an Alcove Comparison of Upsets for Au-Au runs History of Radiation upsets Simulations of beam losses near an alcove Ram pattern & TLD measurements Possible mitigations RHIC Run 14 Retreat

  16. Simulations of beam losses near an alcove • Simulations from Kin Yip • Single loss point near alcove, based on beam loss patterns during operations and optics model for accelerator • Terminology: “Reduction factor” is Ratio of flux with shielding to that without shielding. • Therefore: 100% reduction factor == no reduction. RHIC Run 14 Retreat

  17. scraping Reduction factors in % 1.4×10-3 /cm2 102% (n+)Flux per gold-ion Detection points 101% 2.6×10-4 /cm2 119% 91% 87% 125% 88% 91%

  18. 46% 75% 57% 90% Reduction factors in % 76% 80%

  19. Outline Alcove Radiation Upsets – network switches Quick Tour of an Alcove Comparison of Upsets for Au-Au runs History of Radiation upsets Simulations of beam losses near an alcove Ram pattern & TLD measurements Possible mitigations RHIC Run 14 Retreat

  20. Memory Upset Measurements • Began taking measurements Apr. 20 • Only using Ram Disks in Alcove PS FECs • Place 64 bit pattern repeatedly into 2 Mbytes of memory • Creates 262,144 samples per FEC • Do periodic reads of patterns to look for changes • Almost always changes are at the single bit level RHIC Run 14 Retreat

  21. RHIC Run 14 Retreat

  22. RHIC Run 14 Retreat

  23. Memory Pattern Upsets: Across Alcove Test FEC • Test FEC placed in corner of room • Compared to PS1 & PS2 near door. • After 1 RHIC store • 41 errors in PS1 • 36 errors in PS2 • 32 errors in TEST RHIC Run 14 Retreat

  24. TLD Results 324 325 4 wk 322 323 2 wk 4 wk 2 wk • TK322 & TK324 installed 4-30-14 & kept in place for 4 weeks • TK323 & TK325 installed 4-30-14 & kept in place for 2 weeks RHIC Run 14 Retreat

  25. Outline Alcove Radiation Upsets – network switches Quick Tour of an Alcove Comparison of Upsets for Au-Au runs History of Radiation upsets Simulations of beam losses near an alcove Ram pattern & TLD measurements Possible mitigation RHIC Run 14 Retreat

  26. Possible Mitigation • Could try network switches that use ECC memory • Costs are high, >$30k/switch • Could add shielding • Kin’s simulations suggest reduction factors are not large • Could move sensitive equipment to “quiet” alcoves • Some new equipment required & labor intensive • Could move some network switches out of alcoves to service buildings • Distances are large (~1400 ft) • Technology issues need investigation • Use copper cable instead of fiber? • Need to use converters, which may be susceptible and may pose operational challenges. RHIC Run 14 Retreat

  27. Plans under discussion • Improve on the data collection to better monitor the “health” of the equipment in the alcoves. • Automate & Log data from memory pattern scans • Log network statistics and consider additional methods to monitor the health of network switches (directly or indirectly) • Improve ability to deal with switch problems • Riding through a problem more gracefully • Improve how we recover • Recovering automatically • Identify equipment impacted and make more robust • Improve alarming mechanisms to alert operations of switch problems (distinguish more clearly between FEC problems and switch problems) • Next run explore mitigation strategies further (see previous slide). RHIC Run 14 Retreat

  28. Backup/Auxiliary Slides RHIC Run 14 Retreat

  29. Rack 9 Rack 10 RHIC Run 14 Retreat

  30. RHIC Run 14 Retreat

  31. Some Terminology • ECC memory: Error-correcting code memory • Detects and corrects most common kinds of internal data corruption. • Is immune to single-bit errors • Makes use of additional circuitry that checks accuracy of data during I/O operations. • E.g., data read from a word of memory will remain uncorrupted even if a single bit in that memory was flipped. • MRAM: Magneto-resistive random-access memory • Non-volatile memory • Data is not stored as electrical charge, but by magnetic storage elements (via two ferromagnetic plates), so works on spin, not charge. • Single Event Upsets, Single Event Latchups, Single Event Gate Rupture, & Single Event Burnouts • radiation striking a sensitive node in a micro-electronic device • Most common are upsets, which are not permanent damage • Result is at least a single bit change RHIC Run 14 Retreat

More Related