110 likes | 216 Views
Electronics Long Term Operations What we learned from Electronics Commissioning G. Rakness U.C.L.A Dec 6, 2008. CMS CSC System is Hugh. There is a tendency to forget the size of this system. ~450,000 channels >17,000 electronics boards
E N D
Electronics Long Term Operations What we learned from Electronics Commissioning G. Rakness U.C.L.A Dec 6, 2008
CMS CSC System is Hugh There is a tendency to forget the size of this system. ~450,000 channels >17,000 electronics boards 60 remote VME crates ~23,000 skew clear cables with ~1,150,000 shielded conductors 1,400 gigabit optical fibers This system has been cabled and commission in less than 11 months!
Turning on the Electronics PCrate Sequential LV powerup - Major improvement, late October (Sytnik) This assures Proms properly load FPGAs 1) Power up DMB/TMB 2) Power up VMECC 3) Power up CCB/MPC It is essential that DCS monitoring is turned off during sequence. THERE IS NO AUTOMATIC WAY TO DO THIS IN DCS! This works well but there are rare problems.
Peripheral Crate Power-up Problems 1) Problem: VMECC fails to program Solution: a) renegotiate gigabit link (shutdown switch port via software PCSwitches) b) recycle power on slot (This presently takes ~five minutes using DCS GUI, THIS HAS TO BE AUTOMATED !) 2) Problem: Netgear Gigabit Switch CPU Locks out VMECC Solution: a) Recycle switch power supply with new remote AC power switch (ssh) 3) Problem: TMB or DMB fail to program Solution: a) TTC hard reset (1/2 detector) b) CCB hard reset (whole crate) c) worse case (rare): Power cycle DMB/TMB slot (2 slots) There is a run around problem here. One would like to reset only the problem DMB or TMB 4) Almost zero Prom programming loss observed
Front End Board Power-up Problems FEBoard Power LV Powerup - Switch on LV individually through DMB using LVMB Power on problems rare. Almost all due to infamous Erased Prom problem. CFEBs and ALCTs ocassionally lose Prom Data on powerup. rare on power-up , typically less than 1 in 458 ALCTs and 2300 CFEBs Prom Readback shows ~equal proms with one bit flip (1->0) and no bit flips from loaded data. (A typical Prom readback has millions of bits). 1->0 flip suggest charge loss on gate. Solution: Automatically detect problem proms and reload firmware. This was successfully implimented in late November. CCB Initialization - resets TTC signal communications e.g. hard resets This has been a bit problematic. Debugging possibly needed?
Problems during Global/Local Data Taking Global/Local Data Taking Electronics seems just to work on good boards. We have tested hard reset response (fpga reload, reset, and Flash memory constant loads) and have never seen a problem. Rarely VMECC loses gigabit communications. Solution: a) renegotiate gigabit link (shutdown switch port via software PCSwitches) b) recycle power on slot (This presently takes ~five minutes using DCS GUI, THIS HAS TO BE AUTOMATED !) Rarely a DMB or TMB looses VME communications - data/trigger operation unaffected - long period with no DCS access - this is under study, we have no explaination - only fixed on hard reset for a new run
Problems during Global/Local Data Taking Failures that Require Board Replacement VMECC, DMB, TMB, CCB, and MPC failures are rare. They are easily accessible and are fixed within hours. FED DDU and DCC failures are even rarer. They are swapped out within minutes if needed. FEBoard failures require access. Boards we discovered with problems last February have still not been replaced. LVDB Fuses Rarely ALCT and DMB LVDB fuses blow. These are extremely difficult to replace. It was earlier this year one can blow an LVDB fuse programming the ALCT with bad firmware. This had been fixed in software and is believed to be impossible now. There is a random unexplained source of blown fuses over the last six months
Problems during Global/Local Data Taking ~5 ALCT fuses need replacing 3 CFEB fuses need replacing Two of the ALCT fuses blew on separate chambers on the same night! We presently have no idea the source of these failures. Sudden LV Power Loss on Peripheral Crate There are electronics problems that can only be explained by sudden short term power loss to peripheral crates - DDU has registered 9 FMM Errors instantaneously in one crate - MPC has been observed to go into power up mode These seem to have decreased in frequency since mid-summer There is no DCS voltage history available. This would help greatly in debugging/understanding this problem. Solution: restart run
Failed Boards needing Replacement Other Longterm Board Failures ME1/1 Over half of the longterm board problems have occurred on ME1/1 chambers. The ME1/1 group has shown data suggesting that nearly all of these are skew clear cable related. ME1/1 Skew Clear cables have patch panel. Damaged connectors suspected. ME1/1 Skew Clear cables are at length limit of technology. Other Chamber Board Failures ~xxx/468 ME1/2,3 ME2, ME3, ME4/1 ALCT boards need replacement ~xxx/2300 ME1/2,3 ME2, ME3, ME4/1 CFEBs boards need replacement some of these are skew clear cable related Systematic repairs of boards replaced have shown no repeat problems. We have had few boards to autopsy with long term failures. Biggest problems still on chamber.
FED Crate Problems Monster Event problem showed filtering problems on DCC and on global daq groups slink mezzanine boards. Through collaboration problem eviserated on both sides. No single board DDU or DCC problems seen. Software thread loading problem solved in September DDUs report problems from other boards. The problems are on the other boards. "Don't kill the messenger." Online Computer Problems The online software runs on 16 CPUs. Known problems: 1) Problem: On power-up randomly some number of machines don't boot Solution: Hand recycling power on machines. Although not optimal, ACPI cards are expensive and are reportably flakey
Computer Problems Encountered 2) Problem: Farm machines overheating alarms Solution: fans with 3x air volume installed 3) Problem: Farm machine eth_hook drivers have problems after weeks of running Solution: patches to gigabit driver seems to have removed problem 4) Problem: DCS machines drivers don't work after several days Solution: XMAS monitoring seems to have solved problems 5) Problem: We do not manage the computers A recent motherboard was swapped on a farm machine 9 days later and 10s of email NSF mounting problem machine still unusable Solution: Eric Cano et al are overworked. This is their problem since we don't have root privileges on USA owned machines ???!? 2 Spare machines live, configured and connected $$$$ space for 1 2u machine in usc ???