CSC Hardware Problems and Monitoring

CSC Hardware Problems andMonitoring J. Gilmore 8 Sept. 2009

Last Week of CRAFT 26 August – 2 Sept: Checked local DAQ data from 21 runs * Focus on raw data quality, not trigger problems • Several persistent problems in various categories • CFEB DAV-LCT mismatch: 2 CSCs • CFEB corrupted data bits • Persistent bad data bits: 2 CSCs • Bad data bits and Timeout errors: 1 CSC • Channel link failing: 2 CSCs • Bit 14 errors in Overlap events: 6 CSCs • Bad L1A number matching: 1 CSC • ALCT bad configuration: 1 CSC • ALCT bad data bits: 1 CSC  About 8 boards are bad enough to warrant disabling • Others problems are less frequent or trivial • Only 1-2 runs per week, or 1-2 minor errors per run • Experts should discuss offline

Last Week of CRAFT -cont • Some “fixable” errors were observed as well • 2 CFEBs and 1 ALCT lost firmware  Most problems we see are known from previous running Ultimately we must operate with a moderate level of Known Problems… • Some minor problems are relatively rare • Some problems must await diagnosis or repair We also need to clearly indicate New Problems to shifters • How can we do that…?

The Problem of Tracking the Problems Error diagnosis is hard, even for experts… • Classifying, recording and long-term monitoring is complicated too …so let’s break it down a little: • Trigger problems will show up in the data • Data/communication problems will show up in the data  Local DQM should be able to detect these cases • Most error types have a clear signature of symptoms • We can write algorithms to cover at least 80% of cases • Of course, some problems will need expert help • Furthermore, Online DQM uses Local DAQ PCs  Gives us prompt results

How can this be implemented? • First, develop the algorithms in software • Systematically identify typical problem categories • We can help DQM experts with this • Next we need to track Known Problems in a DB • Online DQM already determines a “severity” • Any “intolerable” problem that is not known gets flagged • Highlight it graphically in the Online DQM grid...a big red circle? • Finally, create a mechanism for experts/CEOs to add a newly identified problem to the Known Problem List • Perhaps a similar method as DCS uses to disable a warning…? • We might see a few new problems every month: infrequent use  The Known Problem List should be easily viewable at all times

Some implementation details… • DCS and FMM continue independent operation as usual • Environmental issues and SEU cases handled normally • Perhaps they could share info in the future, later… Hardware error detection, symptoms in DAQ data • ALCT Lost Firmware • usually shows as "ALCT Full FIFO @DMB" alone for every event • ALCT “Blown Fuse” condition • usually shows as "ALCT Not Present" (corrupted header/trailer) with Timeout • Bad ALCT Config (pulse strip left floating) • Hot CFEB/CLCT on an edge strip (cfeb1 or 5) • CFEB Lost Firmware • usually shows as "Bad DAV-LCT" alone for every event • CFEB Channel Link failure • usually has persistent bad CRCs along with bad WC, often Full FIFO • CFEB bit 14/CRC errors in Overlap events • OVLP bit14 is low true, sometimes affects bit 14 of CRC-complement word

CSC Hardware Problems and Monitoring

CSC Hardware Problems and Monitoring

Presentation Transcript

Using Performance Monitoring Hardware for Application Performance Analysis

Survey of Heart Monitoring and Sleep Monitoring Problems

Carroll University CSC 409 Chapter 5: Hardware

DT Hardware Status and Monitoring

Secure In-VM Monitoring Using Hardware Virtualization

CSC Online Error Monitoring with the DDU

Accuracy of Performance Monitoring Hardware

ECAL Monitoring (non-hardware)

Code Coverage Testing Using Hardware Performance Monitoring Support

CSC Hardware Alignment

Combining Software and Hardware Monitoring for Improved Power and Performance Tuning

Digicon and CSC

Programmability and Portability Problems? Time for Hardware Upgrades

Solving Difficult HTM Problems Without Difficult Hardware

Secure In-VM Monitoring Using Hardware Virtualization

Flexible Hardware Acceleration for Instruction-Grain Program Monitoring

Monitoring of tropospheric methane from space: problems and solutions

Most Important PPT For Computer Hardware and Software Problems

Laptop Repairs - Troubleshooting Common Hardware Problems

A Review on Substation Monitoring and control technologies and Problems

5 most common hardware problems of the laptop

Solving Difficult HTM Problems Without Difficult Hardware