100 likes | 177 Views
Status of the production and news about Nagios. ALICE TF Meeting 22/07/2010. Summary of the last week production. Large amount of running jobs (expected) Pass 1 reconstruction (LHC10d)activities ongoing Analysis trains and user analysis tasks ongoing 2 MC cycles
E N D
Status of the production and news about Nagios ALICE TF Meeting 22/07/2010
Summary of the last week production • Large amount of running jobs (expected) • Pass 1 reconstruction (LHC10d)activities ongoing • Analysis trains and user analysis tasks ongoing • 2 MC cycles • LHC10d14: pp events, Pythia6, 900GeV • LHC10d15: pp events, Phojet, 900GeV • 7.4 M of requested events per cycle • Currently 90% completed • New MC cycles expected during the weekend
Job profile per sites Currently decreasing the activity moreover at the T0: Not an issue, well understood Decrease of running jobs at key T1 sites: FZK and CNAF
Job profiles per users MC and reconstruction activities User analysis activities
Raw data information 52TB of raw data transferred to T0 T0-T1 raw data transfers: up to 270MB/s achieved
Site news (T0 site) • CERN • Cooling problem at the IT this week affecting the experiment voboxes declared with an importance below 50 • In the case of ALICE: Npone of the production VOBOXES nor the xrootd redirectors were affected (importance = 90) • CAF nodes affected (nodes were switched off). CAF users prevented. The importance of all CAF nodes (nodes and PROOF master) have been already increased to 50 • CREAM-CE: Better performance and stability of the systems this week • Transparent upgrade of CASTOr2 (including xrootd) on the 21st of July
Site news (T1 sites) • CNAF • Job profile: problems with the information reported by the resource BDII twice this week • Up to 11K Alice agents waiting in the queues • SE disk space for ALICE increased. 546T available (32% used by the 21st July) • Today the SE was reporting some problems in ML: SOLVED by Francesco • All server services have been restarted • SARA • SE in scheduled downtime this week. Upgrade of dCache • System back in production and good performance in ML • LYON • SE still under tuning • the configuration is set up and writing on the storage disks works however there's a problem with the migration of the disk data towards HPSS (rfio issue)
Sites news (T2 sites) • Torino • Migrating CREAM system to the latest version • Madrid • Same operation as in Torino • Cyfronet • Lack of available resources. Waiting for the site admin to increase them
Site summary • Hardly going over the 20K concurrent jobs • Cooling problems at FZK last week • Info system issues with CNAF • High load of other experiment at Lyon • Several sites have seen this week a high number of jobs running over 46h • Pathological jobs. Although finishing correctly, their outputs cannot be used • Prevention measure: Set the CPU time of the ALICE queues to a 24h limit
Nagios news • Currently publishing in the VALIDATION infrastructure 2 sensors: VOBOXES and CE • Some of the discrepancies found between the SAM and Nagios results: SOLVED • SITES SHOULD KNOW: • The voboxes MUST be published in the GOCDB and the BDII • Voboxes MUST be pingable from samnag014.cern.ch • Standard Nagios test • Requested by Alice about one month ago through this meeting • We all together will need to define when we should put this infrastructure in production • This needs to be announced at the next MB meeting on the 27th • This implies the deprecation of SAM