90 likes | 262 Views
PEM. The Performance and Exception Monitoring Project. Purpose. Provide means to run stable services Problem diagnosis, corrective actions and alarms; at the detailed level but also at the global level to allow correlations and identify common causes Provide level of service measures
E N D
PEM The Performance and Exception Monitoring Project Tim Smith: EFF workshop
Purpose • Provide means to run stable services • Problem diagnosis, corrective actions and alarms; at the detailed level but also at the global level to allow correlations and identify common causes • Provide level of service measures • End-to-end views, user views of services; current status and historical • Provide uniform access to such information over all services • Provide resource planning information • Long term resource usage and growth statistics, failure rates • Provide scalable solutions for farms of 1000s PCs Tim Smith: EFF workshop
Scope • The scope of the PEM Project includes: • IT services directly accessed by end users • Provision of tools with core service functionality, extensible to other services • Provide documentation, of code, usage and plug-in interfaces • Consideration of farms remote to the computer centre (exps) • But does not include: • Definition of Service level agreements • Network device status/configuration, either inside or outside the computer centre • Printers • Coding of plug-ins for all IT applications • Provision of installation tools Tim Smith: EFF workshop
Objectives • To provide tools in which the alarms and displays are orientated to the overall service provided: • User end-to-end views, Quality of service views • Managerial views of resource usage and evolution • Service provider views, and detailed machine views • Link the alarms to both the monitoring and corrective actions • To provide service level metrics • To provide a uniform monitoring infrastructure • Coordinated central repositories + Common logging format • Averaging and archiving of logged information • Correlations between logged information • Multiple input routes; extensible moni. clients • Modular tools; demonstrated scalability Tim Smith: EFF workshop
Global Metrics • Honour Service Definitions • “Availability of usable 3000 CUs batch” • Machines up + FATMEN + LSF • “Availability of an interactive facility” • ASIS available + low trivial response time • “Job turnaround time expectations” • “Time to service tape request” + Disk/Network bandwidths + CPU/Memory utilisations Tim Smith: EFF workshop
Visions of the Future (I) • 1000’s of PCs per cluster • Living with failures + scalable solutions! • Assure a service; Quorum of machines NOTfull complement • Quality of Service measures – reflected in the monitoring – Global Metrics • High level correlations – to assess impact on a service Tim Smith: EFF workshop
Visions of the Future (II) • Automated installations • Bootstrap and checklist • Like CERN new arrivals! • Distributed control • Pull new versions • Dynamic assignment to experiment • Configuration management and Monitoring intertwined Tim Smith: EFF workshop
Milestones (Past and Present) • Mandate agreed with IT management • User Requirements Document • Goal / Question / Metrics study • Product Survey • Prototyping – SNMP / JDMK / NetLogger • Analysis http://cern.ch/proj-pem Tim Smith: EFF workshop
Product Survey • PIKT for primitives • bonobo (GNOME) for CORBA components • JDMK (JMX): Java management tools • MAT: Monitoring and Admin Tool • PCP: Performance Co-Pilot • SNMP • NetLogger • SCADA • Tivoli, Patrol, Unicenter TNG • Ranger/SLAC - Vamos/DESY – rls/IN2P3/Lyon Tim Smith: EFF workshop