Comprehensive Analysis of Bad Runs at the Alberto Oliva Facility

Bad Runs Alberto Oliva, July 2012

Sample Sample: B584/pass3 Runs: 22804 (26220 files) Time span: 20/05/2011 (1305853512) – 17/05/2012 (1337244535) ~ 1 year Processed runs: 22802 Errors in processing: 2 (1327065737, 1335142636) B584 has new tools for the study of bad runs provided with the new production: A. Kounine (xDR error handling) M. Duranti (DSP errors) V. Choutko (DaqEvent storing) This analysis has been performed on the SEU cluster. Full analysis of all the runs in less than 1 day.

Day-by-day Acquisition Time 24h Power reconfig (14/Nov/2011) JMDC Hang (25/Nov/2011) RDR LV Alarm (24/Dec/2011) TPD Problem (2/Dec/2011) Nothing in e-log, maybe missing runs (?) (Aug/2011) ECAL Desync (15-22/Dec/2011) Commissioning ~ 2 weeks (TOF HV scans, TTCS test, Trigger tests, …) Ended with trigger finalization, 5/Jun/2011 (1307301440) https://amsvobox04.cern.ch/elog/LEAD/5143 346 working days (after comm.) for 334 acquisition days “Operations Efficiency” ~ 96.3%

Bad Runs Lists • DAQ: a list of known bad runs from the general point of view provided by M. Paniccia • > Pulser Left ON: 17 runs, ~ 5 hours • ECAL: a detailed list came from M. Paniccia • > ECAL crate E0 event mismatch (Dec 2011): 491 runs, ~ 7 days • > EDR-1-0-A wrong configuration file (Jul 2011): 14 runs, ~ 4 hours • > EIB RP3 (EDR-1-3 connector 0) power cycle (Sep 2011): 1 run, ~ 3 min • > EIB RP3 (EDR-1-3 connector 0) wrong trigger settings (Sep 2011): 23 runs, ~ 8 h • > Ecal Trigger Test (Nov 2011): 1 run, ~ 1 hours • Tracker: Jose provide a list of bad runs • > Bad Calibration (done in SAA or polar regions): 135 runs, ~ 1.8 days • TRD: a list will be provided soon • TOF: from a list of TOF problems compiled by V. Bindi no runs decide to be tagged as bad. • RICH: an estimator of the run quality will be provided. For the moment no bad runs from RICH.

Run Tag ECAL Tags TRD Refills Cutting on RunTag will cut away all the refill periods that rougthly account for 1179 runs, ~ 17 days. No cut right now: we will use TRD list. TOF scans JMDC Reboot (defult RunTag)

Event Synchronization Synchronization error occurs whenever part of the event comes from different triggers. Most (not all) of the synchronization problem are detected during event building. > A. Kounine provided code to check for this problem in the offline. No ECAL Desync and No Commissioning ECAL Desync (15-22/Dec/2011) only few desync during run (?) Runs with at least one desync error: 635 runs (~ 11 days) Big fraction comes from 15-22 Dec 2011 ECAL problem (cuts superposition). Another big fraction comes from runs with a very low number of desync errors.

Program Memory Error Corruption of the program memory of DSP. From time to time (rarely) causes a crazy behavior of the board. > M. Duranti developed tools needed to check the problem carefully. Node is OK Node is KO Node is OK DSP test on node  Status is OK DSP test on node  Status is KO Node Boot Runs with at least one DSP error: 4030 runs, ~ 60 days  Not used. Left as possibility for the accurate data analysis of efficiency.

Fraction of Missing Events Comparison between the last event number and events on disk. > Error on frame transmission (missing, incomplete, corrupted). > Error on JMDC event transmission (corrupted format, event CRC error). > Offline production problem. 16587 runs have 0 difference! Runs with a large number of missing events 0.1% 1% 10% 100% Why this shape? Fraction of missing events > 0.1%: 421 runs (~ 6 days) We have some runs with number of events on disk > events  reprocess duplication (86 runs).

Fraction of Events with Error • The event collected and stored on disk could be affected by “hardware” problem • (ROOM error, desync error, …). • Using A. Kounine code we can check errors on all the DAQ nodes (not only upper part). Typical error rate is of the order of 0.1% Runs with a large number of events with errors 0.1% 1% 10% 100% Fraction of events with Errors > 1%: 661 runs (~ 10 days)

Fraction of Events with no Particles Events with no error may not have an associated reconstructed particle. > Trigger on an interaction event (accounted in acceptance evaluation) > Bad Trigger configuration (not accounted). All coming from 7/Aug/2011 (Pulser On) 1% 10% 100% Fraction of events with No Particle < 1%: 22 runs (8 hours)

AMS Zenith Angle ISS moves like an “acrobat”, then AMS is not always pointing to the “Zenith”. The code for the AMS Z axis angle with respect to Zenith comes from C. Consolandi. movements during run high angle AMS “not vertical”: 74 runs (~ 1 day)

Conclusion Lists with Vitaly’s format created. TRD bad run list is missing. Probably some runs should be reprocessed. This acquisition time should be decreased by the SAA exposition (around 15%).

Comprehensive Analysis of Bad Runs at the Alberto Oliva Facility

Comprehensive Analysis of Bad Runs at the Alberto Oliva Facility

Presentation Transcript

Bank Runs

OSSE precursor runs

Climate Sensitivity Runs

Predictability » runs

GUN RUNS

Bad ,bad, bad design

New 2000 Runs Future Runs

AURA Runs

PARABOLIC RUNS

Thursday afternoon runs

LBDS Dry Runs

Commissioning / Reliability Runs / Dry runs

First Calculation Runs

Evil Runs Rampant

Model Runs

LA 2020 Runs

Money Fund Runs

Bad Drupal, Bad!

LHC Scrubbing Runs

2019 Jamb Runs

Dog Runs Ireland Dog runs Cork

LHC Scrubbing Runs