1 / 6

CSC Hardware Problems and Monitoring

CSC Hardware Problems and Monitoring. J. Gilmore 8 Sept. 2009. Last Week of CRAFT. 26 August – 2 Sept : Checked local DAQ data from 21 runs * Focus on raw data quality, not trigger problems Several persistent problems in various categories CFEB DAV-LCT mismatch: 2 CSCs

rafi
Download Presentation

CSC Hardware Problems and Monitoring

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CSC Hardware Problems andMonitoring J. Gilmore 8 Sept. 2009

  2. Last Week of CRAFT 26 August – 2 Sept: Checked local DAQ data from 21 runs * Focus on raw data quality, not trigger problems • Several persistent problems in various categories • CFEB DAV-LCT mismatch: 2 CSCs • CFEB corrupted data bits • Persistent bad data bits: 2 CSCs • Bad data bits and Timeout errors: 1 CSC • Channel link failing: 2 CSCs • Bit 14 errors in Overlap events: 6 CSCs • Bad L1A number matching: 1 CSC • ALCT bad configuration: 1 CSC • ALCT bad data bits: 1 CSC  About 8 boards are bad enough to warrant disabling • Others problems are less frequent or trivial • Only 1-2 runs per week, or 1-2 minor errors per run • Experts should discuss offline

  3. Last Week of CRAFT -cont • Some “fixable” errors were observed as well • 2 CFEBs and 1 ALCT lost firmware  Most problems we see are known from previous running Ultimately we must operate with a moderate level of Known Problems… • Some minor problems are relatively rare • Some problems must await diagnosis or repair We also need to clearly indicate New Problems to shifters • How can we do that…?

  4. The Problem of Tracking the Problems Error diagnosis is hard, even for experts… • Classifying, recording and long-term monitoring is complicated too …so let’s break it down a little: • Trigger problems will show up in the data • Data/communication problems will show up in the data  Local DQM should be able to detect these cases • Most error types have a clear signature of symptoms • We can write algorithms to cover at least 80% of cases • Of course, some problems will need expert help • Furthermore, Online DQM uses Local DAQ PCs  Gives us prompt results

  5. How can this be implemented? • First, develop the algorithms in software • Systematically identify typical problem categories • We can help DQM experts with this • Next we need to track Known Problems in a DB • Online DQM already determines a “severity” • Any “intolerable” problem that is not known gets flagged • Highlight it graphically in the Online DQM grid...a big red circle? • Finally, create a mechanism for experts/CEOs to add a newly identified problem to the Known Problem List • Perhaps a similar method as DCS uses to disable a warning…? • We might see a few new problems every month: infrequent use  The Known Problem List should be easily viewable at all times

  6. Some implementation details… • DCS and FMM continue independent operation as usual • Environmental issues and SEU cases handled normally • Perhaps they could share info in the future, later… Hardware error detection, symptoms in DAQ data • ALCT Lost Firmware • usually shows as "ALCT Full FIFO @DMB" alone for every event • ALCT “Blown Fuse” condition • usually shows as "ALCT Not Present" (corrupted header/trailer) with Timeout • Bad ALCT Config (pulse strip left floating) • Hot CFEB/CLCT on an edge strip (cfeb1 or 5) • CFEB Lost Firmware • usually shows as "Bad DAV-LCT" alone for every event • CFEB Channel Link failure • usually has persistent bad CRCs along with bad WC, often Full FIFO • CFEB bit 14/CRC errors in Overlap events • OVLP bit14 is low true, sometimes affects bit 14 of CRC-complement word

More Related