1 / 9

PEM

PEM. The Performance and Exception Monitoring Project. Purpose. Provide means to run stable services Problem diagnosis, corrective actions and alarms; at the detailed level but also at the global level to allow correlations and identify common causes Provide level of service measures

freja
Download Presentation

PEM

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. PEM The Performance and Exception Monitoring Project Tim Smith: EFF workshop

  2. Purpose • Provide means to run stable services • Problem diagnosis, corrective actions and alarms; at the detailed level but also at the global level to allow correlations and identify common causes • Provide level of service measures • End-to-end views, user views of services; current status and historical • Provide uniform access to such information over all services • Provide resource planning information • Long term resource usage and growth statistics, failure rates • Provide scalable solutions for farms of 1000s PCs Tim Smith: EFF workshop

  3. Scope • The scope of the PEM Project includes: • IT services directly accessed by end users • Provision of tools with core service functionality, extensible to other services • Provide documentation, of code, usage and plug-in interfaces • Consideration of farms remote to the computer centre (exps) • But does not include: • Definition of Service level agreements • Network device status/configuration, either inside or outside the computer centre • Printers • Coding of plug-ins for all IT applications • Provision of installation tools Tim Smith: EFF workshop

  4. Objectives • To provide tools in which the alarms and displays are orientated to the overall service provided: • User end-to-end views, Quality of service views • Managerial views of resource usage and evolution • Service provider views, and detailed machine views • Link the alarms to both the monitoring and corrective actions • To provide service level metrics • To provide a uniform monitoring infrastructure • Coordinated central repositories + Common logging format • Averaging and archiving of logged information • Correlations between logged information • Multiple input routes; extensible moni. clients • Modular tools; demonstrated scalability Tim Smith: EFF workshop

  5. Global Metrics • Honour Service Definitions • “Availability of usable 3000 CUs batch” • Machines up + FATMEN + LSF • “Availability of an interactive facility” • ASIS available + low trivial response time • “Job turnaround time expectations” • “Time to service tape request” + Disk/Network bandwidths + CPU/Memory utilisations Tim Smith: EFF workshop

  6. Visions of the Future (I) • 1000’s of PCs per cluster • Living with failures + scalable solutions! • Assure a service; Quorum of machines NOTfull complement • Quality of Service measures – reflected in the monitoring – Global Metrics • High level correlations – to assess impact on a service Tim Smith: EFF workshop

  7. Visions of the Future (II) • Automated installations • Bootstrap and checklist • Like CERN new arrivals! • Distributed control • Pull new versions • Dynamic assignment to experiment • Configuration management and Monitoring intertwined Tim Smith: EFF workshop

  8. Milestones (Past and Present) • Mandate agreed with IT management • User Requirements Document • Goal / Question / Metrics study • Product Survey • Prototyping – SNMP / JDMK / NetLogger • Analysis http://cern.ch/proj-pem Tim Smith: EFF workshop

  9. Product Survey • PIKT for primitives • bonobo (GNOME) for CORBA components • JDMK (JMX): Java management tools • MAT: Monitoring and Admin Tool • PCP: Performance Co-Pilot • SNMP • NetLogger • SCADA • Tivoli, Patrol, Unicenter TNG • Ranger/SLAC - Vamos/DESY – rls/IN2P3/Lyon Tim Smith: EFF workshop

More Related