80 likes | 218 Views
DDU Functions. EMU DDU: not just for data handling Scan data for evidence of SEUs, determine if Reset needed Data format errors a likely indicator of SEU: needs Hard Reset via FMM Monitor front-end data for event synch loss: needs Sync Reset (FMM)
E N D
DDU Functions • EMU DDU: not just for data handling • Scan data for evidence of SEUs, determine if Reset needed • Data format errors a likely indicator of SEU: needs Hard Reset via FMM • Monitor front-end data for event synch loss: needs Sync Reset (FMM) • Watch for buffer warning signals, avoid Overflows! • Set FMM Warning as needed, at half-to-3/4 full (many events!) • Beyond ~90% full DDU will set FMM Busy • As buffers get near empty, DDU returns to FMM Ready • Note that Buffer Overflows will lead to other errors • Synch loss, Data corruption, Timeout errors • Diagnose cause and source of errors • Report Reset Request states via VME Interrupt • Provide VME registers for diagnostics and monitoring • Track which CSCs have set which error types • Allows for a discriminated error response in specific cases • One occurrence is just a bad event, no Reset • Several occurrences could indicate SEU, needs Reset • We can apply this for L1A, CRC, DAV, data format errors
DDU Capabilities • Resilient against single bit errors • Bit errors ought to be rare, but an occasional CRC error should never cause a critical problem • Watch out for repetition: can indicate an SEU or hardware problem • Some bit errors can destroy an entire 16-bit word • Fiber data encoded with 8-bit/10-bit protocol • Try to continue operation after this occurs, assumed rare • DDU Firmware adds “filler” words as-needed to the end of CSC data stream to maintain the 64-bit word boundary • Right now sets critical “corrupted data” error, will adjust this • A “stuck” bit can cause critical problems, may indicate SEU • Critical “data corruption” errors require a Reset • e.g. when DDU can not detect the ending of the CSC data stream • Some types of errors may be “single loss” events • Automatic self-recovery, no Reset needed • Such events set “bad event” signal; e.g. bad CRC • Repetition can indicate an SEU or other hardware problem • FMM Errors must be “approved” by the VME IRQ Handler • DDU Error reporting to FMM is disabled until nCSCerrors > nThresh
Recent Error Experience • General error categories • Configuration failures • Caused by communication errors, bad timing or hardware • Causes many error symptoms: Timeout, bad DAV , sync loss, buffer overflow, dead or hot channels, format errors, data corruption • Format errors • Caused by config fail, bad hardware or signal timing/quality • Can cause DDU to misidentify a board header/trailer word • May show as “missing” board in event • May show as bad L1A, CRC or word count • A critical format error can cause data corruption • Hot/dead channels • Caused by config fail, bad hardware or signal timing/quality • Can lead to buffer overflows • DAV-LCT mismatch • Caused by config fail, bad hardware or signal timing/quality • This can cause buffer overflows or timeout errors • Full FIFO @DMB (buffer overflow) • Caused by config fail, bad hardware or signal timing/quality • Overflows can cause Synch loss, Data corruption, or Timeout • CRC errors
Defining an Error at the DDU • Setting the “bad event” signal in the DDU trailer • Any events with non-perfect data checks will get this • Minor bit errors or format problems, SCA Full • Exceptions where data payload may not be affected: • Clean single-bit error in a header/trailer-word marker • Fiber Rx error that may have occurred between events • DCM phase-lock-lost that may occur between events • To add: 64-bit boundary violation (rather than Hard Reset) • Requesting a Sync Reset (via VME IRQ, then FMM) • L1A mismatch detected twice on one CSC • Two different boards in the same event • Separate occurrences in two different events • Either the same board or different boards • Buffer Overflow at DMB or DDU • Requesting a Hard Reset (via VME IRQ, then FMM) • Anything that corrupts the data irreversibly • Anything that “looks” like an SEU…e.g. repeated trivial errors
Limitations & Considerations • We do not know how frequently any particular error may occur • We may need to modify definitions as we get LHC experience • Failure modes can be complex • Obvious error symptoms may be caused by more subtle problems • E.g. we often see a “CFEB problem” which is caused by the corrupted ALCT data that precedes it (bad ALCT headers & 64-bit violations) • We will learn more from LHC experience • We can kill fibers for known, frequent problems • But we don’t want to kill everything! • at some low rate, we must be allowed to request a Reset • We may see spontaneous critical problems that repeat • For these, we may need to automatically set “Ignore Fiber” • This would be temporary, set in real time by DDU logic • Only use in case of a repeated Critical Error from a CSC • Notification of any action is always registered in the data stream • We already send a complete “Live Fiber” list in _every_ event • At next Reset, all “Ignore” settings get cleared to normal state
DDU Error Table I [1] Error bits resulting in RESET REQUIRED persist until the RESET occurs. Questionable cases (in gold) indicate that a reset is only required for mitigation of recurring errors. TBD: sync/hard reset distinctions. [2] Found inside an event, i.e. between Beginning-Of-Event (=Header1 signature) and End-Of-Event (=combination Trailer1+Trailer2 signatures), at least one of the following: Extra DMB_Header1, Extra DMB_Header2, Lone Word, Extra TMB/ALCT_Trailer, Extra DMB_Trailer1, DMB_Trailer2. [3] Missing TMB/ALCT_Trailer word, missing DMB Header word, Wrong First word, or Extra Control words.
DDU Error Table II [1] Error bits resulting in RESET REQUIRED persist until the RESET occurs. Questionable cases (in gold) indicate that a reset is only required for mitigation of recurring errors. TBD: sync/hard reset distinctions. [2] Found inside an event, i.e. between Beginning-Of-Event (=Header1 signature) and End-Of-Event (=combination Trailer1+Trailer2 signatures), at least one of the following: Extra DMB_Header1, Extra DMB_Header2, Lone Word, Extra TMB/ALCT_Trailer, Extra DMB_Trailer1, DMB_Trailer2. [3] Missing TMB/ALCT_Trailer word, missing DMB Header word, Wrong First word, or Extra Control words.
DDU Error Table III • Notes about the error table • [1] Error bits resulting in RESET REQUIRED persist until the RESET occurs. Questionable cases (in gold) indicate that a reset is only required for mitigation of recurring errors. TBD: sync/hard reset distinctions. • [2] Found inside an event, i.e. between Beginning-Of-Event (=Header1 signature) and End-Of-Event (=combination Trailer1+Trailer2 signatures), at least one of the following: Extra DMB_Header1, Extra DMB_Header2, Lone Word, Extra TMB/ALCT_Trailer, Extra DMB_Trailer1, DMB_Trailer2. • [3] Missing TMB/ALCT_Trailer word, missing DMB Header word, Wrong First word, or Extra Control words.