60 likes | 165 Views
CSC Hardware Problems and Monitoring. J. Gilmore 8 Sept. 2009. Last Week of CRAFT. 26 August – 2 Sept : Checked local DAQ data from 21 runs * Focus on raw data quality, not trigger problems Several persistent problems in various categories CFEB DAV-LCT mismatch: 2 CSCs
E N D
CSC Hardware Problems andMonitoring J. Gilmore 8 Sept. 2009
Last Week of CRAFT 26 August – 2 Sept: Checked local DAQ data from 21 runs * Focus on raw data quality, not trigger problems • Several persistent problems in various categories • CFEB DAV-LCT mismatch: 2 CSCs • CFEB corrupted data bits • Persistent bad data bits: 2 CSCs • Bad data bits and Timeout errors: 1 CSC • Channel link failing: 2 CSCs • Bit 14 errors in Overlap events: 6 CSCs • Bad L1A number matching: 1 CSC • ALCT bad configuration: 1 CSC • ALCT bad data bits: 1 CSC About 8 boards are bad enough to warrant disabling • Others problems are less frequent or trivial • Only 1-2 runs per week, or 1-2 minor errors per run • Experts should discuss offline
Last Week of CRAFT -cont • Some “fixable” errors were observed as well • 2 CFEBs and 1 ALCT lost firmware Most problems we see are known from previous running Ultimately we must operate with a moderate level of Known Problems… • Some minor problems are relatively rare • Some problems must await diagnosis or repair We also need to clearly indicate New Problems to shifters • How can we do that…?
The Problem of Tracking the Problems Error diagnosis is hard, even for experts… • Classifying, recording and long-term monitoring is complicated too …so let’s break it down a little: • Trigger problems will show up in the data • Data/communication problems will show up in the data Local DQM should be able to detect these cases • Most error types have a clear signature of symptoms • We can write algorithms to cover at least 80% of cases • Of course, some problems will need expert help • Furthermore, Online DQM uses Local DAQ PCs Gives us prompt results
How can this be implemented? • First, develop the algorithms in software • Systematically identify typical problem categories • We can help DQM experts with this • Next we need to track Known Problems in a DB • Online DQM already determines a “severity” • Any “intolerable” problem that is not known gets flagged • Highlight it graphically in the Online DQM grid...a big red circle? • Finally, create a mechanism for experts/CEOs to add a newly identified problem to the Known Problem List • Perhaps a similar method as DCS uses to disable a warning…? • We might see a few new problems every month: infrequent use The Known Problem List should be easily viewable at all times
Some implementation details… • DCS and FMM continue independent operation as usual • Environmental issues and SEU cases handled normally • Perhaps they could share info in the future, later… Hardware error detection, symptoms in DAQ data • ALCT Lost Firmware • usually shows as "ALCT Full FIFO @DMB" alone for every event • ALCT “Blown Fuse” condition • usually shows as "ALCT Not Present" (corrupted header/trailer) with Timeout • Bad ALCT Config (pulse strip left floating) • Hot CFEB/CLCT on an edge strip (cfeb1 or 5) • CFEB Lost Firmware • usually shows as "Bad DAV-LCT" alone for every event • CFEB Channel Link failure • usually has persistent bad CRCs along with bad WC, often Full FIFO • CFEB bit 14/CRC errors in Overlap events • OVLP bit14 is low true, sometimes affects bit 14 of CRC-complement word