160 likes | 318 Views
AMOD Report December 3-9, 2012. Torre Wenaus December 11, 2012. Activities. Datataking until 6 th , the likely end of 2012 pp physics running B ulk reprocessing mostly done ~ 1 .3M production jobs ( group, MC, validation, reprocessing) ~2.5M analysis jobs ~610 analysis users.
E N D
AMOD Report December 3-9, 2012 Torre Wenaus December 11, 2012
Activities • Datataking until 6th, the likely end of 2012 pp physics running • Bulk reprocessing mostly done • ~1.3M production jobs (group, MC, validation, reprocessing) • ~2.5M analysis jobs • ~610 analysis users Torre Wenaus
Production & Analysis Sustained activity, production and analysis Fluctuating analysis workload ~6k min (!) – 34k max Torre Wenaus
Data transfer - source Torre Wenaus
Data transfer - destination Torre Wenaus
Data transfer - activity Torre Wenaus
T0 export tailing off at end of week with end of ppdatataking Torre Wenaus
Reprocessing (yellow) tailing off Torre Wenaus
Tier 0, Central Services • Tue pm: T0 LSF: slow LSF job dispatching ALARM ticket 23:46. Promptly answered, a reconfig run at 23:00 to fix an issue was slow, and reduced responsiveness to job submission. Queues refilled by 00:06. Experts looking at why the config took so long. Ticket closed. GGUS:89202 • Sat am: CERN-PROD: ALARM: ATLAS web server down Sat am, response in 10min, resolution in ~30. Due to power outage. Closed. GGUS:89334 • Weekend: CERN-PROD: EOS source errors and several periods of EOSATLAS instability in SLS (next slide). GGUS:89328 • During week, a few cases (besides alarm ticket) of T0 bsub time spiking to ~6-8 sec for <~1hr Torre Wenaus
EOSATLAS availability lapses Torre Wenaus
ADC • Tue pm: Security ticket to ATLAS VOSupport: ATLAS creating world writable directories. In the PanDA pilot, one directory creation case was missed (in job recovery directory) in setting access to 770. Fixed in pre-production code. GGUS:89182 • Tue: Problem recurred in a corrupt dCache library libdcap.so being disseminated by swinstallation resulting in ANALY jobs failing for all sites using dCache. Fixed promptly with a new check to prevent recurrence. • Tue: MuonCalibration-17.2.7.4.1 not found at ANALY_MPPMU calibration site, resolved by AleDG/AleDS/Alden. Confusion over celist source (it is AGIS) • Bulk ESD lifetime changed from 4 weeks to 3 weeks (Ueda) • Case of duplicate GUIDs, analysis ongoing • Thu: MUON_CALIBDISK close to full at INFN-NAPOLI, deletion run, freed sufficient space • Weekend: SARA DATADISK filled up (next slide) Torre Wenaus
T1 DATADISK space full • At SARA on weekend, ~monotonic decline of available space for a week reached the end • Sat pm: Taken out of T0 export at 10TB free • DDM auto blacklisting didn’t kick in – when was it supposed to? 1TB? Very low… • Mon am: Manually blacklisted Torre Wenaus
Tier 1 Centers • Mon am: IN2P3: regular SRM hangups thought to have been fixed with a dCache patch for long proxies problem (GGUS:88984) did not actually fix the issue. Recurred Tuesday, then they put in a cron to detect need for and perform SRM restart. No need to restart the server since. Investigations ongoing. GGUS:89111 • Mon am: RAL: Failures in input file staging, high FTS error rate. Restarted the stager and rebalanced the database which solved it. Closed. GGUS:89141 • Tue pm: Taiwan-LCG2: many job failures due to insufficient space on local disk. Site increased maximum job workdirsize in schedconfig. Ticket closed but problem recurred Thu am, new ticket. Site reduced job slots on small-disk WNs. Ticket on hold for observation. GGUS:89200, 89253 • Wed am: FZK TAPE T0 export resumed after resolution of last week ticket. Some timeout failures since but not persistent. Closed. GGUS:88877 Torre Wenaus
Tier 1 Centers (2) • Thu pm: SARA: T0 export failures, quick site response and resolution, "we were overloaded with requests from jobs from another cluster. This has been blocked now..." which solved the problem. Closed. GGUS:89289 • Sat am, through weekend: FZK-LCG2: Persistent <8% job failure rate due to timeouts saving files to local SE, logged on reopened 2/12 ticket. Mon am update: site canceled some long standing inactive transfers on the ATLAS write buffer pools. GGUS:89110 • Sat am: Taiwan-LCG2: Missing file needed for production. Affected by disk maintenance, recovered by site. Closed. GGUS:89332 • Sat pm: PIC: failing source transfers. Cured with SRM restart. Site is checking what caused the SRM failures. GGUS:89338 Torre Wenaus
Other • GGUS experts unable to reproduce issue of last week, that clicking ‘back’ twice after creating a ticket creates another one (observed in Firefox) • Coming: • PIC capacity at ~65% Dec 10-21 to save electricity • Several downtimes this week (Dec 10+) • Sites: please make clear in GOC downtime notices the scope/impact of the downtime • With regular space issues as well as occasional hardware etc issues, exclusion from T0 export is pretty common, would be nice to have monitoring of in/exclusion status, simplified/safer inclusion/exclusion procedure • Noticed shifters paying attention to a site they shouldn’t need to (UTD-HEP)… how to prevent? • https://savannah.cern.ch/support/?133697 Torre Wenaus
Thanks • Thanks to all shifters and helpful experts! Torre Wenaus