1 / 51

Machine Availability and System Reliability at RHIC

Machine Availability and System Reliability at RHIC. Fulvia Pilat. WAO-07 Trieste, September 24-28 2007. RHIC performance. Delivered luminosity increased by >2 orders of magnitude in 6 years. Delivered per run to PHENIX. FOM= LP 4. Enhanced Design Parameters.

Download Presentation

Machine Availability and System Reliability at RHIC

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Machine Availability and System Reliability at RHIC Fulvia Pilat WAO-07 Trieste, September 24-28 2007

  2. RHIC performance Delivered luminosity increased by >2 orders of magnitude in 6 years. Delivered per run to PHENIX. FOM=LP4

  3. Enhanced Design Parameters Calendar time in store affects ability to project performance.

  4. goalexceeded +10% 3x Enhanced Design Parameters (~2009)

  5. Enhanced and RHIC-II luminosity Electron or Stochastic cooling

  6. Time at store: trend and goal • Trend • Goal: back to mid 50% in Run-8 60% time at store in Run-9

  7. Outline • Operation stats, performance Factors determining time at store • Machine development (short term investment) • APEX: Accelerator Physics EXperiments program (longer term investment) • Scheduled Maintenancetalk Sampson today • Machine set-up • Systems downtime and failure • Mode of operation: “pushing the envelope”

  8. RHIC Retreat 2007-July 16-17Session on Availability and Reliabiliy 11:00 (15) Pilat Introduction 11:15 (25) Ingrassia Operations and Uptime 11:40 (20) Kling Turn-around time 12:00 (20) Sampson Maintenance models, organization 12:20 (10) Discussion 2:00 (15) Ahrens RHIC abort system 2:15 (15) Zhang, Wu Pulsed power systems 2:30 (30) Bruno Power supplies 3:00 (30) Sandberg Electrical systems 3:30 (30) Zaltsman RF: RHIC and injectors 4:30 (15) Oerter Controls, hardware 4:45 (15) Morris Controls, software 5:00 (15) Reich Access controls 5:15 (15) Russo BPM, IPM, BBQ in operations 5:30 (15) Tuozzolo Cryogenic system 5:45 (15) Mapes Vacuum systems

  9. 60% goal M M M M M M

  10. Failure Flavors • Charged – threshold for log is 6 minutes or more • Failure hours that impact the program -- charged to one OR MORE systems during a failure period. Simultaneous failures result in charged hours less than actual hours • Actual – Severe • Duration of a failure that impacts the program often LONGER than the hours charged. • Actual – Mild • Failure that does not impact the program e.g. 1 of 10 AGS Rf Stations trip. Hours recorded but not “charged” • Resets – threshold for log is less than 6 minutes

  11. “Top 10” Failures by Group & by Run

  12. Operations Planned Improvements • Multiple Failure, often simultaneousCAS (tech support on shift – 2 now) needs help • Train Siemens Watch for LOTO • Together with MCR Operators they can perform LOTO when CAS is busy • Get Operators into the field • Train Operators to (only) reset “accelerator” power supplies • OC instructed to call in help for CAS when CAS is making a repair AND another system goes down. • OC instructed to call in help from two groups with knowledge of the equipment when the cause of a problem is not clear

  13. Outline • Operation stats, performance Factors determining time at store • Machine development (short term investment) • APEX: Accelerator Physics EXperiments program (longer term investment) • Scheduled Maintenancetalk Sampson today • Machine set-up • Systems downtime and failure • Mode of operation: “pushing the envelope”

  14. Turn around time

  15. Outline • Operation stats, performance Factors determining time at store • Machine development (short term investment) • APEX: Accelerator Physics EXperiments program (longer term investment) • Scheduled Maintenancetalk Sampson today • Machine set-up • Systems downtime and failure • Mode of operation: “pushing the envelope”

  16. Input from systems • Maintenance, set-up and turn-around time, modes of operations all affect the availability but the main factor is system failure. In Retreat presentations please focus on the reliability of your system and think critically about ways to improve it. I would ask each of you to discuss a plan - including timelines and necessary funding - to increase your system reliability. This is an important input towards an integrated plan to improve time at store to be discussed at the Retreat and implemented thereafter.

  17. After the Retreat reliability • Review Retreat information on operations, maintenance and systems • Prioritize actions – especially systems improvements for reliability • Analyze aging infrastructure, systems • Use the recently revisited “Trouble Report Committee” as input and advice on system reliability

  18. RHIC PS Performance Stats Average RHIC PS Failure Hours/Week MTBF of RHIC due to any PS Failure MTBF of an individual PS Failure

  19. Leading Causes of PS Down Time in Hours

  20. Power Supply System Priorities • Bipolar 150A, 300A p.s.’s Phase 1 • QPA’ s (Quench protection assemblies) • Main dipole and quadrupole PS • Investigate yellow quad bus ground fault • Improving Dynapower PS cooling • Quench detector cleaning and fan replacements • Air Conditioning (for air quality and temperature)

  21. Expected MTBF in Run 8? • Run 5 = 30.79 hours • Run 7 = 14.75 hours • Remove 3 major problems from Run 7 = 40 hours

  22. Electrical Systems * excluding arc flash event

  23. Most Significant Causes of ES Downtime-Run # 7 4 areas responsible for 90% of downtime in Run-7

  24. Electrical systems: Steps being taken • 18 Electricians Assigned to C-AD this Summer vs. 6last year • On going Thermal Inspectionof Switches • Use of torque Wrenches Instituted • Better understanding of Thermal Effects • Replace 1000 P 13.8 kV Switches • Replace Trip units 1000 P Substation • Replace Switchgear in 914 • Maintenance BMMPS CB’s

  25. Electrical systems: Steps being taken- cont’ • Continuation of Arc Flash Calculations • Connecting RHIC Bard A/C Units through Isolation Transformers • 21 New Alcove UPS’ s • 8 year Program to improve Electrical Infrastructure ($ 9 million) • Open Slot for New Power Engineer

  26. ES: Top Concerns from last Year’s Retreat • Power Dips 8 in Run-6, 6 in Run-7 • Response to 1006 Arc Flashalmost done • 1004 B CB Problem • AMMPS Transformer Replacement Additional Steps to Improve Availability • Increase the number of assigned electricians • Centralize Spare Parts Location • Increase Spares Inventory •  this shutdown

  27. RF system: Performance Number of systems: Booster: 2 AGS: 11 RHIC: 16 Charged failure hours: Booster: 7 AGS: 39 RHIC: 65 Actual failure hours: Severe: 216 Mild: 272 Factor affecting the system performance in RHIC RF: beam loading (more than double total intensity than in Run-4). (Example: large debunching at rebucketing time, losses and beam dumps). Took time to understand and mitigate the beam loading effects.

  28. 07 Gold Bunch Merge

  29. RF - IMPROVEMENTS • Complete system upgrade of low level RF inAGS and RHIC (unified hardware and software, modern system, better ring-2-ring synchro) • Window comparators to provide fast shutdown for storage systems • New beam permit chassis to speed up the response • Low power circulators • New tubes • Ongoing work on window for storage system • Continue development of ferrite tuner for acceleration system

  30. Abort kickers - Failure Modes • Prefires • One module discharges unilaterally • The other four fire in response ASAP • Not synchronized with abort gap • Unconditioned Triggers • All five modules discharge together • Not synchronized with the abort gap • Spontaneous Capacitor Discharges • As if a “stop charge” occurred with no associated trigger – stop charge turns off the charging mechanism • Damaging if not noticed

  31. RHIC abort kickers pre-fires in Run-7 broken out by ring and by module

  32. Abort kickers: observations, improvements • B2 and B4 use thyratron CX1575C. They will be replaced by CX3575C. • Y5 had 7 pre-fire at beginning, but stayed clean after 4/4. • Y1 stayed clean during entire RUN • Y5, B2, and B4 had 7 pre-fires each, contributed to 70% of total pre-fires. • What may help? • Condition high voltage system at higher voltage than operation level (Engineering control? Routine procedure?) • Keep modulators on • Pre-conditioning before beam operation • Keep operating voltage as low as possible

  33. RHIC abort kickers: R&D • Charge up high voltage modulators on command 4ms before beam abort to avoid pre-fire during long DC hold up • A preliminary study was performed on 2003 • Project cost over $2 million based on 2003 budget estimate.

  34. Cryo system: Phase III Upgrade • New gas bearing turbine for energy removal at the cold end of the refrigerator (Run-7). • New high efficiency vertical heat exchanger system at the cold end of refrigerator (Run-7). • Re-configured the cold helium supply to the accelerator rings to eliminate the use of the cold circulators (Run-6). • Modified Cold Box 5 to reduce Helium inventory, improve insulation, and reduce flow restrictions (Run-6). Results: • Saved an additional 1.0 MW of compressor power in Run-6. • Reduced the liquid inventory in the refrigerator. • Additional 1.0 MW achieved during Run-7. • Reduced number of running compressors by 4 FS and 1 SS.

  35. RHIC POWER HISTORY

  36. Cryo Stumbling at the Start of Run-7:HX OBSTRUCTION • Oil contamination in HX-20 from Rotoflow oil bearing expanders • Oil Crossover Happens During Start-up (Warm) • + LN2 contamination on HX-20 • Extended 80K operations contaminated GHe in RHIC • During cool-down 80K GHe returned to the refrigerator • Poorly seated crossover valve (H409M) between CR line and Expander 6 outlet allowed LN2 to collect on HX-20 • = High Recooler Return Pressure • resulting in (too) high magnet temperatures.

  37. He Flow Rate HX20 DP Blue recooler Wave Starts Blue ready Yellow 45K Wave Starts Yellow 4.5K Wave Starts Blue 4.5K Wave Starts Warm-up Attempts To Clear Blockage

  38. Outline • Operation stats, performance Factors determining time at store • Machine development (short term investment) • APEX: Accelerator Physics EXperiments program (longer term investment) • Scheduled Maintenancetalk Sampson today • Machine set-up • Systems downtime and failure • Mode of operation: “pushing the envelope”

  39. Running for high availability Example: Low energy copper run (Run-5) 2 weeks of physics: choice to limit set-up time and downtime Machine parameters (almost the same #bunches 37-41, transmission HE~95%, LE ~ 85-92 %, same transition set-up) • bunch intensity: HE 41 x 4.5e9 LE: 37 x3.8e9 • beta* HE: 0.85m LE: 3m • energy HE: 100 GeV/u LE: 31.2 GeV/u Reproducibility: minimized time tuning time Minimized time between stores • Longer lumi-lifetime

  40. Cu Run-5 high-energy run b*=0.85m (0.89m) time at store: 52% access + equipment failures power dip+ access b*=2.6m b*=3.0m access + snowstorm

  41. Cu Run-5 low energy run time at store: 74%

  42. Cu Run-5 LE (week 2 – stores) injection access Phobos 0 & polarity Beam experiemnts

  43. Optimization of performanceandavailability • Projected performance and run plans must include optimization of the time at store if we want to achieve the 60% goal • Limit the number of new developments during the run preparation • Stop or reduce machine developments during physics running once potential for returns is low • Optimal choice of lattice, beta*, bunch intensity and number of bunches (with parameters evolution during the run, more conservative or aggressive, based of optimization of delivered luminosity and time at store)

More Related