100 likes | 291 Views
CSC Online Error Monitoring with the DDU. J. Gilmore CSC-DPG #41 July 17, 2008. FMM output port (sTTS). VME FPGA. Input FPGA. Control FPGA. Input FIFOs. SLINK. GbE FIFO. Mezz Board . DDU Overview. Functions Merge data from 15 CSCs
E N D
CSC Online ErrorMonitoring with the DDU J. Gilmore CSC-DPG #41 July 17, 2008
FMM output port (sTTS) VME FPGA Input FPGA Control FPGA Input FIFOs SLINK GbE FIFO Mezz Board DDU Overview • Functions • Merge data from 15 CSCs • Perform online data unpacking and status monitoring in real-time (CRC, word count, format quality, BXN, L1A number, buffer status, link status) • Send CSC status to FMM • Large Buffer Capacity • 2.5 MB buffer • Average DDU data volume estimated to be 0.4kB per L1A at LHC (@1034 lumi) • Buffer can hold over 6000 events • Status info accessed via VME 15 Optical Fiber Inputs. Reads a 20-degree slice through an endcap GbE/SPY To Local DAQ
Data Unpacking in the DDU • Scan data for evidence of SEUs, determine if Reset is needed • Data errors are an indicator for SEU • Requires Hard Reset, report it to FMM • Monitor front-end data for event sync loss • Requires Sync Reset, report it to FMM • Watch for buffer warning signals, avoid Overflows! • Set FMM Warning as needed, at half-to-3/4 full (many events!) • Beyond ~90% full DDU will set FMM Busy • As buffers get near empty, DDU returns to FMM Ready • Note that Buffer Overflows will lead to other errors if not Reset • Sync loss, Data corruption, Timeout errors • Diagnose cause and source of problems • Track which CSCs have set which error types • Report “Reset Required” states via VME Interrupt • Tracking for chronic problems in offline log files • Provide VME registers for diagnostics and monitoring • Include status and error information in the DDU Trailer
Reported Error Categories I • Configuration failures • Constants loaded on a board are not correct • Caused by communication errors, bad timing or hardware • Often leads to data errors: Timeout, bad DAV, sync loss, buffer overflow, dead or hot channels, format errors, data corruption • Format error, Consistency error or Not Present • An expected format marker is not detected in the proper position • Can cause DDU to misidentify a board header/trailer word • May show as “missing” board in event • May show as bad L1A, CRC or word count • Caused by config fail, bad hardware or signal timing/quality • Hot/dead channels or Empty/Missing CSC • Caused by HV, config fail, bad hardware or signal timing/quality • Can lead to buffer overflows • Missing CSCs are caused by LV-off or disabled CSCs • DAV-LCT mismatch • A CFEB was triggered but it failed to send data • Caused by config fail, bad hardware or signal timing/quality • Can lead to buffer overflows or Timeout errors
Reported Error Categories II • Full FIFO @DMB (ALCT or CFEB buffer overflow) • Caused by config fail, bad hardware or signal timing/quality • Can cause Sync loss, Data corruption, or Timeout • L1A Number Mismatch Errors • Fundamental sign of sync loss • Caused by problem with hardware or signal timing/quality • Possibly SEU related • CRC error: bit error detected in transmission • Generally a minor concern, affecting only one event • Only serious if it affects multiple Header/Trailer bits • May be an indicator of a deeper problem • CSC electronics have a CRC at every level to detect bit errors • CFEB, ALCT, TMB, DMB and DDU • Overall severity of an error is hard to predict • Cases that appear as “Critical” require a Reset as they usually lead to more errors, but sometimes may be self-correcting
Event Quality Indicators from DDU • The “Single Error” flag in DDU trailer: Do Not Analyze Event • Any events with non-perfect data checks will get this • Minor bit errors or format problems, SCA Full • “Single Warning” if problem might not affect the data payload • Clean single-bit error in a header/trailer-word marker • Fiber receiver/link error that may have occurred between events • DCM phase-lock-loss that may occur between events • The “Critical Error” Sync Lost case: Data Integrity Failure • L1A mismatch detected twice on one CSC • Two different boards in the same event • Separate occurrences in two different events • Buffer Overflow at DMB or DDU • Note: offline analysis might not see the loss in data integrity • At the full point, a buffer still has many “good” events to read out before the compromised data is observed, and sTTS actions can conceal all this • The “Critical Error” Hard Reset case: Unpacker Failure Likely • Anything that corrupts the data irreversibly • Violation of event boundaries, can’t determine end-of-CSC data stream • Anything that “looks” like an SEU…e.g. repeated trivial errors
Summary • The DDU performs online CSC error monitoring in real-time • The monitor status is in the DDU Trailer for every event • The DDU monitoring results are useful for offline data quality checking • Details of DDU monitoring status can be found here: http://www.physics.ohio-state.edu/~cms/ddu/ddu2_pro.html#tr-1
DDU Error Table I [1] Error bits resulting in RESET REQUIRED persist until the RESET occurs. Questionable cases (in gold) indicate that a reset is only required for mitigation of recurring errors. TBD: sync/hard reset distinctions. [2] Found inside an event, i.e. between Beginning-Of-Event (=Header1 signature) and End-Of-Event (=combination Trailer1+Trailer2 signatures), at least one of the following: Extra DMB_Header1, Extra DMB_Header2, Lone Word, Extra TMB/ALCT_Trailer, Extra DMB_Trailer1, DMB_Trailer2. [3] Missing TMB/ALCT_Trailer word, missing DMB Header word, Wrong First word, or Extra Control words.
DDU Error Table II [1] Error bits resulting in RESET REQUIRED persist until the RESET occurs. Questionable cases (in gold) indicate that a reset is only required for mitigation of recurring errors. TBD: sync/hard reset distinctions. [2] Found inside an event, i.e. between Beginning-Of-Event (=Header1 signature) and End-Of-Event (=combination Trailer1+Trailer2 signatures), at least one of the following: Extra DMB_Header1, Extra DMB_Header2, Lone Word, Extra TMB/ALCT_Trailer, Extra DMB_Trailer1, DMB_Trailer2. [3] Missing TMB/ALCT_Trailer word, missing DMB Header word, Wrong First word, or Extra Control words.
DDU Error Table III • Footnotes for the error table • [1] Error bits resulting in RESET REQUIRED persist until the RESET occurs. Questionable cases (in gold) indicate that a reset is only required for mitigation of recurring errors. TBD: sync/hard reset distinctions. • [2] Found inside an event, i.e. between Beginning-Of-Event (=Header1 signature) and End-Of-Event (=combination Trailer1+Trailer2 signatures), at least one of the following: Extra DMB_Header1, Extra DMB_Header2, Lone Word, Extra TMB/ALCT_Trailer, Extra DMB_Trailer1, DMB_Trailer2. • [3] Missing TMB/ALCT_Trailer word, missing DMB Header word, Wrong First word, or Extra Control words.