100 likes | 245 Views
AMOD Report . Doug Benjamin Duke University. Hourly Jobs Running during last week. 140 K. 0 . Blue – MC simulation Yellow Data processing Red – user Analysis Magenta – group production Grey – group Analysis. DDM data flows during last week. 800 TB. 10 TB. 0 TB. Notable activities.
E N D
AMOD Report Doug Benjamin Duke University
Hourly Jobs Running during last week 140 K 0 Blue – MC simulation Yellow Data processing Red – user Analysis Magenta – group production Grey – group Analysis
DDM data flows during last week 800 TB 10 TB 0 TB
Notable activities • Monday - Recover from slow T0 Export over the weekend to RAL and Triumf • Both switched over to backup OPN over the weekend Cause never understood • Triumf slower link and RAL Asymmetric link • Tuesday – SARA T0 export and T1 stage from tape issues • Wednesday -RAL unplanned power cut , CERN LSF job submission slowness • Thursday – RAL power restored – recover outage , continue with CERN LFS job submission slowness • Friday - CERN LFS job submission slowness • Saturday – Rain lots of it (flooding, R1, my office building, SPS – took beam offline)
Other notable events • ND cloud local storage problems • Currently trying to recover 70k files to avoid declaring them lost. Resubmitting most tasks and Rob subscribed to missing Raw input files. • RAL – worked to recover several ATLAS pools affected by the power cut. (159 files declared lost)
Bulk reprocessing • Bulk Reprocessing • Originally planned to start Period D , then B, then A and then C • Instead Period D started, then period B, A and C to keep all jobs running in all clouds but….. • This processing pattern has caused problems with disk space issues at Tier 1 sites • Stopped early submission of periods A and C, D and B continue • As of Sunday period D – 98.5% done (before merge) Period B 68% done • Over weekend disk space in Tier 1 became an issue.
T1 data disk space • Due to low free disk space – PIC, SARA, FZK all were removed from SANTA CLAUS, now 4 T1 sites excluded (DE,ES,NL,IT clouds) • Saturday – StephaneJezequel triggered cleaning (Victor is running very slowly recently). • Situation at FZK and SARA improved. • Monday (12-Nov) SARA will migrate 60 TB from scratch to data disk • PIC still issue as of Sunday night. • Stephane – moving away MC datasets
LSF • LSF job dispatch speed caused problems all week, 60 K 6K
Conclusions • Thanks to the experts, sites, shifts (Comp@p1, ADCOS, ADCOS expert) • Bulk reprocessing proceeding relatively smoothly • LSF job submission speed causing Tier 0 team headaches • DATA disk space at the Tier 1 sites an issue. Needs to be monitored as not to effect Bulk reprocessing