160 likes | 253 Views
P erformance and E xception M onitoring Project. Tim Smith CERN/IT. Overview. Motivation Objectives Analysis and Design Prototyping Perspective and Future. Alarm Recovery action Monitoring System Local Remote Process killer Console Resource planning Accounting Security
E N D
Performance and Exception Monitoring Project Tim Smith CERN/IT
Overview • Motivation • Objectives • Analysis and Design • Prototyping • Perspective and Future Tim Smith: HEPiX @ JLab
Alarm Recovery action Monitoring System Local Remote Process killer Console Resource planning Accounting Security Inventory Independent systems No single overview Duplicated collection Host based: Want Service Perceived problems not real Scalability Motivation Tim Smith: HEPiX @ JLab
Alarm Recovery action Monitoring System Local Remote Console Resource planning Accounting Security Inventory Motivation • Configuration • Collection • Transport • Repository mgmt • Display Tim Smith: HEPiX @ JLab
Objectives • To provide tools in which the alarms and displays are orientated to the overall service provided: • User end-to-end views, Quality of service views • Managerial views of resource usage / evolution / failure rates • Service provider views, and detailed machine views • Link the alarms to both the monitoring and corrective actions • To provide service level metrics • To provide a uniform monitoring infrastructure • Coordinated central repositories + Common logging format • Averaging and archiving of logged information • Correlations between logged information • Multiple input routes; extensible moni. clients • Modular tools; demonstrated scalability Tim Smith: HEPiX @ JLab
Process • Analysis • User Requirements Document • Current Tools survey • Enterprise/Cluster mgmt, Pub domain, other labs, building blocks, DAQ, Run Control, Slow Control • Goal / Question / Metric formalism • System Requirements Document • Design • Interfaces Document • Prototyping Tim Smith: HEPiX @ JLab
Goal / Question / Metric • Ensure quality of Interactive Service • Sufficient nodes? • Low enough load? • Slow to respond to commands? • Contactable via network • Network daemons alive • No nologin • Free ptys • Connection test from remote node Tim Smith: HEPiX @ JLab
PEM Architecture 1 1..n 1 1..n 1 Monitoring Agent Monitoring Broker Measurement Repository 1..n 1 1 1..n Outside PEM 1 1..n Configuration Repository Correlation Engine 1 1..n 1 1 1..n User Interface Access Server Tim Smith: HEPiX @ JLab
Configuration Repository Loading the DB <TAG> </TAG> Parser <TAG> </TAG> <TAG> </TAG> <TAG> </TAG> <TAG> </TAG> XML-DBMS RDBMS jdbc XML Schema Host, Host type Metrics, Services XML-DBMS freeware (Tried XSU from Oracle) Viewers Xerces From Apache Tim Smith: HEPiX @ JLab
Configuration Repository Querying the DB <TAG> </TAG> Parser <TAG> </TAG> <TAG> </TAG> <TAG> </TAG> <TAG> </TAG> XML-DBMS RDBMS jdbc jdbc XML DB Configuration Items Java Objects Tim Smith: HEPiX @ JLab
Correlation Engine • To correlate metrics from the MRS according to configuration in the CRS • Metric collections: trends + multiple machines • Samplings: Union for read efficiency from MRS • Example Java Classes: • Correlation coordinator • Sampling cache • Evaluators • Timers Tim Smith: HEPiX @ JLab
Events • Publish / Subscribe : Java RMI • Interfaces Document Monitoring Agent Monitoring Broker Measurement Repository metricstream metricvalue Configuration Repository Correlation Engine exception configuration User Interface Access Server Tim Smith: HEPiX @ JLab
Monitoring Agent/Broker I • SNMP • extended existing infrastructure • Multithreaded broker loading DB • JMX / JDMK • JMX public specification: managed resources • Plugable agents • Reported several important bugs • Demo at JavaOne conference • Linux/NT remote reset • Netlogger instrumentation • Opened up license negotiations Tim Smith: HEPiX @ JLab
C Low overhead Monitoring Agent/Broker II • Not yet … DMTF • DMI, CMI SNMP Spool /proc netlogger Script Monitoring Process Spool Manager Monitoring Broker Tim Smith: HEPiX @ JLab
PEM Futures • Today: CERN CC needs it • Prototype for ALICE MDC III in January • Tomorrow: Tier-0 RC / GRID node need it • More complete management solutions • Integrate into the Fabric Management WP • ‘GRIDification’ • Rapidly evolving technologies • Lots of middleware • Lots of companies wanting collaboration • still need framework Tim Smith: HEPiX @ JLab
PEM in Perspective Configuration Management Monitoring Alarm Recovery Actions Inventory Resource Planning Security Application Inst/Update OS Configuration/Update OS Installation/Update Power Mgmt/Remote Reset Console Mgmt PC Hardware Tim Smith: HEPiX @ JLab