1 / 16

AMOD Report December 3-9, 2012

AMOD Report December 3-9, 2012. Torre Wenaus December 11, 2012. Activities. Datataking until 6 th , the likely end of 2012 pp physics running B ulk reprocessing mostly done ~ 1 .3M production jobs ( group, MC, validation, reprocessing) ~2.5M analysis jobs ~610 analysis users.

amity
Download Presentation

AMOD Report December 3-9, 2012

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. AMOD Report December 3-9, 2012 Torre Wenaus December 11, 2012

  2. Activities • Datataking until 6th, the likely end of 2012 pp physics running • Bulk reprocessing mostly done • ~1.3M production jobs (group, MC, validation, reprocessing) • ~2.5M analysis jobs • ~610 analysis users Torre Wenaus

  3. Production & Analysis Sustained activity, production and analysis Fluctuating analysis workload ~6k min (!) – 34k max Torre Wenaus

  4. Data transfer - source Torre Wenaus

  5. Data transfer - destination Torre Wenaus

  6. Data transfer - activity Torre Wenaus

  7. T0 export tailing off at end of week with end of ppdatataking Torre Wenaus

  8. Reprocessing (yellow) tailing off Torre Wenaus

  9. Tier 0, Central Services • Tue pm: T0 LSF: slow LSF job dispatching ALARM ticket 23:46. Promptly answered, a reconfig run at 23:00 to fix an issue was slow, and reduced responsiveness to job submission. Queues refilled by 00:06. Experts looking at why the config took so long. Ticket closed. GGUS:89202 • Sat am: CERN-PROD: ALARM: ATLAS web server down Sat am, response in 10min, resolution in ~30. Due to power outage. Closed. GGUS:89334 • Weekend: CERN-PROD: EOS source errors and several periods of EOSATLAS instability in SLS (next slide). GGUS:89328 • During week, a few cases (besides alarm ticket) of T0 bsub time spiking to ~6-8 sec for <~1hr Torre Wenaus

  10. EOSATLAS availability lapses Torre Wenaus

  11. ADC • Tue pm: Security ticket to ATLAS VOSupport: ATLAS creating world writable directories. In the PanDA pilot, one directory creation case was missed (in job recovery directory) in setting access to 770. Fixed in pre-production code. GGUS:89182 • Tue: Problem recurred in a corrupt dCache library libdcap.so being disseminated by swinstallation resulting in ANALY jobs failing for all sites using dCache. Fixed promptly with a new check to prevent recurrence. • Tue: MuonCalibration-17.2.7.4.1 not found at ANALY_MPPMU calibration site, resolved by AleDG/AleDS/Alden. Confusion over celist source (it is AGIS) • Bulk ESD lifetime changed from 4 weeks to 3 weeks (Ueda) • Case of duplicate GUIDs, analysis ongoing • Thu: MUON_CALIBDISK close to full at INFN-NAPOLI, deletion run, freed sufficient space • Weekend: SARA DATADISK filled up (next slide) Torre Wenaus

  12. T1 DATADISK space full • At SARA on weekend, ~monotonic decline of available space for a week reached the end • Sat pm: Taken out of T0 export at 10TB free • DDM auto blacklisting didn’t kick in – when was it supposed to? 1TB? Very low… • Mon am: Manually blacklisted Torre Wenaus

  13. Tier 1 Centers • Mon am: IN2P3: regular SRM hangups thought to have been fixed with a dCache patch for long proxies problem (GGUS:88984) did not actually fix the issue. Recurred Tuesday, then they put in a cron to detect need for and perform SRM restart. No need to restart the server since. Investigations ongoing. GGUS:89111 • Mon am: RAL: Failures in input file staging, high FTS error rate. Restarted the stager and rebalanced the database which solved it. Closed. GGUS:89141 • Tue pm: Taiwan-LCG2: many job failures due to insufficient space on local disk. Site increased maximum job workdirsize in schedconfig. Ticket closed but problem recurred Thu am, new ticket. Site reduced job slots on small-disk WNs. Ticket on hold for observation. GGUS:89200, 89253 • Wed am: FZK TAPE T0 export resumed after resolution of last week ticket. Some timeout failures since but not persistent. Closed. GGUS:88877 Torre Wenaus

  14. Tier 1 Centers (2) • Thu pm: SARA: T0 export failures, quick site response and resolution, "we were overloaded with requests from jobs from another cluster. This has been blocked now..." which solved the problem. Closed. GGUS:89289 • Sat am, through weekend: FZK-LCG2: Persistent <8% job failure rate due to timeouts saving files to local SE, logged on reopened 2/12 ticket. Mon am update: site canceled some long standing inactive transfers on the ATLAS write buffer pools. GGUS:89110 • Sat am: Taiwan-LCG2: Missing file needed for production. Affected by disk maintenance, recovered by site. Closed. GGUS:89332 • Sat pm: PIC: failing source transfers. Cured with SRM restart. Site is checking what caused the SRM failures. GGUS:89338 Torre Wenaus

  15. Other • GGUS experts unable to reproduce issue of last week, that clicking ‘back’ twice after creating a ticket creates another one (observed in Firefox) • Coming: • PIC capacity at ~65% Dec 10-21 to save electricity • Several downtimes this week (Dec 10+) • Sites: please make clear in GOC downtime notices the scope/impact of the downtime • With regular space issues as well as occasional hardware etc issues, exclusion from T0 export is pretty common, would be nice to have monitoring of in/exclusion status, simplified/safer inclusion/exclusion procedure • Noticed shifters paying attention to a site they shouldn’t need to (UTD-HEP)… how to prevent? • https://savannah.cern.ch/support/?133697 Torre Wenaus

  16. Thanks • Thanks to all shifters and helpful experts! Torre Wenaus

More Related