110 likes | 190 Views
The P erformance and E xception M onitoring Project. Tim Smith IT/PDP. Contents. Requirements current systems inadequacies Views + global metrics GQM + correlations Framework Scalabilty issues Project Status Tools survey Details from Alessandro…. Current systems inadequacies.
E N D
ThePerformance and ExceptionMonitoring Project Tim Smith IT/PDP
Contents • Requirements • current systems inadequacies • Views + global metrics • GQM + correlations • Framework • Scalabilty issues • Project Status • Tools survey • Details from Alessandro… Tim Smith: FNAL workshop
Current systems inadequacies • Independent alarm/monitoring systems • System snapshot requires multiple displays • Independent agents which: monitor local / monitor remote / restart /alarm • Calculate same info multiply and use differently • Host based – no correlations • Hosts complain about perceived problem not real one • Operator only follows precise instructions • Automation! (+ manual Remedy entry) • Separate static config DBs for alarms and machines Tim Smith: FNAL workshop
Visions of the Future • One tool, many purposes…Views: • End-to-end, user, sysadmin, resource planning • 1000’s of PCs per cluster • Living with failures + scalable solutions! • Assure a service; Quorum of machines NOTfull complement • High level correlations; impact on a service • Quality of Service measures; Global Metrics Tim Smith: FNAL workshop
Global Metrics • Honour Service Definitions • “Availability of usable 3000 CUs batch” • Machines up + FATMEN + LSF + lic. Serv. • “Availability of an interactive facility” • ASIS available + low trivial response time • “Job turnaround time expectations” • “Time to service tape request” + Disk/Network bandwidths + CPU/Memory utilisations Tim Smith: FNAL workshop
Goal / Question / Metric • PDP Services e.g. Monitor quality of Interactive Service • Sufficient nodes? • Low enough load? • Slow to respond to commands? • Contactable via network • Network daemons alive • No nologin • Free ptys Tim Smith: FNAL workshop
Correlations • Examples: • Web server on “SUN cluster” • Interactive Service Tim Smith: FNAL workshop
Framework Diagram Tim Smith: FNAL workshop
Scalability • Avoid bottlenecks by allowing for multiplicity of all components • Guiding principle: to avoid the PEM design being constrained by “possible” performance worries Tim Smith: FNAL workshop
Project Status • Approval as divisional project • Interest in EFF and GRID projects • Documents Produced: • User Requirements • Tools survey • Goal / Question / Metric • Analysis (end April) • Design (end May) • http://cern.ch/proj-pem > Progress > Analysis Tim Smith: FNAL workshop
Tools Survey • Enterprise / Cluster Management • Tivoli, UnicenterTNG, Patrol, PCP, SCADA, Alinka, SCMS, MosixMON • Public Domain Tools • MAT, GAP, Ranger (SLAC), VAMOS (DESY), rls (IN2P3) • Building blocks • SNMP (Scotty, Advent, MRTG, UCD), JDMK • PIKT, NetLogger, bonobo Tim Smith: FNAL workshop