120 likes | 239 Views
http://grid.infn.it/gridice. A monitoring tool for a Grid Operation Center by EGEE-SA1 Sergio Fantinel, INFN LNL/PD. OUTLINE. Architecture overview CMS DC04 Experience Next Steps Validation. CMS DC04 GridICE basic layout.
E N D
http://grid.infn.it/gridice A monitoring tool for aGrid Operation Center by EGEE-SA1 Sergio Fantinel, INFN LNL/PD
OUTLINE • Architecture overview • CMS DC04 Experience • Next Steps • Validation
CMS DC04 GridICE basic layout • Low level collection*: we use LEMON (was FMON) to collect the host related metrics; we improved standard metrics with our extensions (eg. host services info). It is based on sensors on the hosts side and on a client/server paradigm for the collection • Publishing service*: on a collector node visible from the Inet (std. is LCG-SE) there is a service that put the info collected by the LEMON server to a EX GRIS (run on port 2136) • Discovery and high level collection: on the top there is a service that discovery new resources from BDIIs and accordingly fire queries on GRISes to acquire the monitoring info of the resources; the info are stored on a RDBMS for historical and analysis purposes *needed only to publish extended info
web interface Cont. discovery & collection First discovery phase BDII/GIIS (GLUE schema) LEMON monitoring agent LEMON monitoring agent run metric output run metric output sensors sensors read read metric output metric output /procfilesystem /procfilesystem cluster worker node cluster worker node Data Collection Framework ldap query CentralMonitoringDatabase information index ldap query monitoring server GridICE Schema LEMON Server GRIS (GLUE+ schema) write run ldif output information providers farm monitoringarchive read cluster head node
Info Sources & metrics GridICE Server Std. GRIS (port 2135) (CE, SE) EX GRIS (port 2136)(GridICE collector node) • Basic info: • Number of queues • Jobs running/waiting • Storage Areas info • Extended info: • Disk partitions space • Network Adapters activity • Role based (CE, SE, RB, RLS, WN,…) user defined services (daemons, agents,…) • More… (MEM, CPU, swap, context switches, interrupts, reg. open files, sockets, procs, INodes, host power,…) • GRIS status info: • GRIS Service Online/Offline
CMS DC04 experience • 11 monitored sites from LCG-CMS/CMS merged BDIIs;6 sites publish extended information (CE, SE, RB); 3 sites publish complete info-------------------------------------------------------------------- 42 GRISes (status w/ 5min resolution), 10 RBs, 13 CEs, 8 SEs, 402 WNs (all extended info) • Most difficulties encountered come from the following facts: • at the rump up of the CMS DC04 the monitoring requirements and the environment were not well known • High utilization of proprietary/non-grid resources • High latency on people response due to DC stress
CMS DC04 experience • The following are the areas where the GridICE team put the major efforts during the DC04 • produced instructions to install GridICE agent on WNs in site installed with LCG-2 that has no WNs monitoring support (manual & LCFGng) • produced instructions to install GridICE agent on whichever host (UI, non Grid/LCG,…) • support to users • LEMON preinstalled on hosts compatibility issue resolved (hosts managed by IT/CERN for CMS DC04)
IT/CERN machines integration • We were in direct contact with IT people of CERN to ensure the compatibility of GridICE with the hosts managed by this CERN division: they provided and managed most of the CERN hosts involved in the CMS DC04 • Export Buffers (ClassicSE, SRM, SRB) • key machines running Agents (i.e. lxgate04.cern.ch for CMS DC04) • Although the compatibility and the integration have been proven, the installation never reached the production hosts due to the ending phase of the DC and the lack of time by the people involved.
CMS DC04 experience: notification • We made experience with GridICE notification service, a new feature introduced just for the CMS DC04, with 3 main sites: LNL, CERN, PIC • LNL: helped us in many situations when services crashed (e.g., sbatchd LSF daemon on CE & WNs, nfsd on LCFGng server) or host disappeared from the GIS. Sometimes GridICE correctly reported down of hosts, while the local monitoring (ganglia) has not caught the anomaly. • PIC: correctly notified of RBs services restart for maintenance made by PIC people. • CERN: RBs services unavailability
Next steps • Job Monitoring per VO: an effective (VO,queue) job monitoring, per user (user certificate) job statistics so to produce detailed use of resources utilization and resources availability. • Notification: in future we expect to have a flexible system where authorized users will be able to set up via a GUI the notifications they would like to receive • Analysis: a generic interface for graph generation
Validation/experiences: LCG-0 First large deployment in the CMS-LCG0 testbed graph and analysis provided by: M. Maggi et al. – INFN Bari CMS group