1 / 102

XC 3.0 SystemsAdministration

XC 3.0 SystemsAdministration. XC Systems administration Fred Cassirer January 25, 2006. Agenda. Management Overview Command broadcast System services Configuration and management database Monitoring the System Remote console Firewall Customization Network device naming

patia
Download Presentation

XC 3.0 SystemsAdministration

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. XC 3.0 SystemsAdministration XC Systems administration Fred Cassirer January 25, 2006

  2. Agenda • Management Overview • Command broadcast • System services • Configuration and management database • Monitoring the System • Remote console • Firewall • Customization • Network device naming • Troubleshooting

  3. Management Philosophy • Simple is better • Minimize load on application nodes, concentrate management processing on management hubs where possible • Use industry standard tools and techniques • Provide the necessary “glue”, configuration, and policy for best in class cluster management

  4. Components • Database • MySQL server using schema designed for HP’s HPC environment • Nagios • System monitoring, reporting, management, eventing, and notification • SuperMon • Optimized metric aggregation across entire cluster. • Syslog-NG • Next Generation syslog replaces syslogd functionality and provides tiered aggregation of system logs • Pdsh • Parallel distributed command execution and copy services • Console management • Log and control console output

  5. XC Management Stack Monitoring & logging Distributed commands Nagios cmf pdsh pdcp Syslog Super-Mon Nagiosplug-ins Installation Configuration Kickstart XCConfig SystemImager XC Database mySQL Red Hat Compatible Distribution Orange components are open tools configured / adapted for use by XC.

  6. System Operations:Monitoring and logging SystemFiles Nagios Mgmt Server XCDB syslog-ng forwarding Supermon aggregation Aggregation points MgmtHub MgmtHub MgmtHub

  7. pdsh (1) • Multithreaded remote shell • Used to send a shell command to multiple nodes • [root@n5 root]# pdsh -a hostname • n5: n5 • n4: n4 • n3: n3 • n1: n1 • n2: n2 • [root@n5 root]#

  8. pdsh (2) • Options: • -a all nodes • -f # sets max number of similtaneous commands • -w nodelist sets a list of target nodes • -x nodelist exclude a list of nodes Where « nodelist » is standard node list syntax such as: n[1-50,75,80-1044]

  9. cexec • A shell script that invoke pdsh to perform commands on a group of hosts • The host group has previously been defined with the hostgroup command # hostgroup -c group1 # hostgroup -a n1 group1 n1 # hostgroup -a n2 group1 n2, n1 # cexec -r group1 hostname n1: n1 n2: n2

  10. pdcp • Copy files to multiple nodes: pdcp –a –n `nodename` /etc/passwd /etc/passwd

  11. Services in XC cluster (1)

  12. Services in XC cluster (2)

  13. Services in XC cluster (3)

  14. Services in XC cluster (4)

  15. Services in XC cluster (5)

  16. Displaying all services • shownode servers cmf: n[15-16] compute_engine: n[10-14,16] dbserver: n16 dhcp: n16 gather_data: n[10-16] gmmon: n16 hpasm: n[10-16] hptc-lm: n16 hptc_cluster_fs: n16 hptc_cluster_fs_client: n[10-16] httpd: n16 imageserver: n16 iptables: n[10-16] lkcd: n[10-16] lsf: n16 mpiic: n16 munge: n[10-16] nagios: n16 nagios_monitor: n[15-16] nat: n16 network: n[10-16] nfs_server: n16 nrpe: n[10-16] nsca: n16 ntp: n16 pdsh: n[10-16] pwrmgmtserver: n16 slurm: n16 supermond: n[15-16] swmlogger: n16 syslogng_forward: n[15-16]

  17. Displaying nodes that provide a given service shownode servers [service] shownode servers imageserver n8 shownode servers ntp n8 shownode servers lvs n[6-7]

  18. Displaying the services provided by a given node • shownode services n7 mond lsf network lkcd slurm_compute slurm_controller lvs iptables cmf_client hptc_cluster_fs_client syslogng gather_data pdsh nrpe hpasm slurm_launch • shownode services n8 clients • cmf: n[1-7] • hptc_cluster_fs: n[1-7] • nagios: n[1-7] • nat: n[1-7] • ntp: n[1-8] • supermond: n[1-7] • syslogng_forward: n[1-7]

  19. Displaying which are the service providers for a given node shownode services n3 servers cmf: n8 hptc_cluster_fs: n8 nagios: n8 nat: n[8,10] ntp: n8 supermond: n8 syslogng_forward: n8

  20. Starting/stopping a service • on one node: service slurm stop service slurm restart • On a group of nodes: pdsh –w[n1-4] service ntpd stop pdsh –w[n1-4] service ntpd restart • On all nodes: pdsh –a service ntpd stop pdsh –a service ntpd restart

  21. Adding a new service • /opt/hptc/config/roles_service.ini ….. [roles_to_services] compute = <<EOT slurm_compute mond nrpe EOT external = nat disk_io = <<EOT nfs_server EOT login = lvs management_hub = <<EOT supermond syslogng_forward EOT …. Service « stanza » Service name (compute, login, etc)

  22. Adding a service to an existing role • cd /opt/hptc/config • cp roles_services.ini roles_services.ini.ORIG • Add the new service inside the role (in between <<EOT and EOT ) • Create the appropriate gconfig and nconfig files • reset_db • /opt/hptc/config/sbin/cluster_prep • /opt/hptc/config/sbin/discover –system –verbose • /opt/hptc/config/sbin/cluster_config

  23. Creating a new role and adding a new service to it • cd /opt/hptc/config • cp roles_services.ini roles_services.ini.ORIG • Insert the new role stanza • Insert the appropriate services list in the stanza • Create the appropriate gconfig and nconfig files • reset_db • /opt/hptc/config/sbin/cluster_prep • /opt/hptc/config/sbin/discover –system –verbose • /opt/hptc/config/sbin/cluster_config

  24. Configuration and management database • Key to the XC configuration • Keep track of: which node provides which service which node received which service network interface configuration enabled/disabled nodes And other things

  25. Database • MySQL server version 4.0.20 • MySQL text-based client • Perl DBI interface for MySQL • HPC value added management tools • shownodes: displays information on node configuration, statistics, services and status • managedb: assists in three areas • backs up the entire database • archives supermon log table data • dumps entire database in human readable form • reset_db: if you need to run cluster_config or discover again

  26. Database: using mysql directly (1) Location of sql database: /opt/hptc/database/lib # mysql -p Enter password: Welcome to the MySQL monitor. Commands end with ; or \g. Your MySQL connection id is 55969 to server version: 4.0.20-Max Type 'help;' or '\h' for help. Type '\c' to clear the buffer. mysql> show databases; +--------------+ | Database | +--------------+ | cmdb |  this is the XC Configuration and Management Database | hptc_archive | | mysql | | qsnet | | test

  27. Database: using mysql directly (2) mysql> use cmdb; • mysql> show tables; • +---------------------------------------+ • | Tables_in_cmdb • +---------------------------------------+ • | CPUInfoLog • | CPUTotalsLog • | CPUTypeLog • | aveNRunLog • | bTimeLog • | hptc_adminInfo • | hptc_archive • | hptc_cmf_port • | hptc_interface • | hptc_interfaceType • | hptc_interfaceUsageType • | hptc_node

  28. Database HPC value added: shownode • shownode --help • USAGE: /opt/hptc/bin/shownode subcommand [options] • subcommand is one of: • all list all nodes • metrics show various statistics • roles show information about node roles • servers show which nodes provide services to whom • services show which services are provided by which nodes • clients show who are the clients of the services • status show which nodes are up and which are down • enabled show which nodes are enabled • disabled show which nodes are disabled • config show configuration details for nodes and other hardware • /opt/hptc/bin/shownode subcommand --help for more details

  29. Database HPC value added: shownodes (cont) [root@n9 root]# shownode roles common: n[1-9] compute: n[1-9] disk_io: n9 external: n[8-9] login: n[8-9] management_hub: n9 nat_client: n[1-7] node_management: n9 resource_management: n[8-9]

  30. Database HPC value added: shownodes (cont) shownode enabled n1 n3 n4 n5 n6 n7 n8

  31. Database HPC value added: shownodes (cont) shownode status n1 ON n2 ON n3 ON n4 ON n5 ON n6 ON n7 ON

  32. Database HPC value added: shownodes (cont) # shownode config --help USAGE: /opt/hptc/bin/shownode config [options] { nodes [name ...] | cp_ports [name ...] | switches [name ...] | otherports [name ...] | hostgroups [name ...] | roles | sysparams | node_prefix | golden_client | all } options: --help this text --indent n indent each level by n spaces --labelwidth n print up to n characters of each label --perl output in format understood by perl --admininfo print extra information useful for system admins

  33. Database HPC value added: enablenode and disablenode • The operation manipulates the is_enabled bit in database • Example: • setnode --disable n4 • setnode --disable n[3,4] • Control for cluster related commands • Currently not used by most services • Affects startsys(8) and stopsys(8) node lists • Disabled nodes will not be affected by these commands • Use shownode(1) to list • “shownode enabled” • “shownode disabled”

  34. Database HPC value added:backup/restore cmdb managedb backup • Creates/opt/hptc/datbase/cmdbbackup-YYMMDDhhmmss.sql • Example: date Tue Jul 26 15:54:54 CEST 2005 managedb backup ls /opt/hptc/database/*.sql /opt/hptc/database/dbbackup-20050726155458.sql • Restore sequence: ls /opt/hptc/database/*.sql /opt/hptc/database/dbbackup-20050726155458.sql managedb restore /opt/hptc/database/dbbackup-20050726155153.sql mysql -u root -p cmdb < cmdbbackup-YYMMDDhhmmss.sql • Restoration will remove all old data and replace with data from the backup file.

  35. Database HPC value added:archiving metrics data • Removes old data from metrics tables and stores them in the hptc_archive database. • The tables to archive are those with name ending with “log” in the cmdb database: • mysql> show tables; • …. • …. • | lastSensorLog • | memInfoLog • | netInfoLog • | pagingLog • | processLog • | sensorLog • | swapInfoLog • | switchLog • | timeLog

  36. Database HPC value added:archiving metrics data • Example 1: archiving metrics data older than four days: managedb archive –archive ./artfile_Tuesday 4d • Example 2: restoring metrics data: Managedb restore –archive ./artfile_Tuesday

  37. Database HPC value added:purging metrics data • Metrics data on cmdb can be purged without archiving: • Example: purging data older than 2 weeks: managedb purge 2w

  38. Monitoring the system • Health and Metrics • Nagios • Supermon • Syslog-ng • Console • CMF

  39. Management roles • XC management services are part of three XC roles • Management Server • Denotes the node where the nagios master resides • Only *ONE* node may be assigned the management_server role • For V3.0 the management_server role must be assigned to the headnode (nh) • Management Hub • Denotes those nodes that provide management aggregation services. • Assign this role to as many nodes as you like. Cluster-config will by default assign 1 management_hub for every 150-200 compute nodes. • The Management Server node is also a management hub. • Supermon and CMF services are associated with a Management Hub • Common • Every node that is monitored has an XC “nrpe” service defined. • Every node that is monitored has an XC “mond” service defined • Console Services • By default, a console service role is assigned identically to Management Hubs. For systems where physical console connectivity is limited, assign this role to those nodes that have direct console access.

  40. Mgmt Server – Nagios master and headnode, 1 per system. Mgmt Hub – Distributed management services such as nagios monitor, supermon and syslog-ng aggregator Console Services – Service nodes connected to the console network. Utilized by Mgmt Server/Hub’s to proxy requests that require console connectivity such as sensor collection, firmware log (SEL/IML) and console services (cmf) Console Services Monitoring Mgmt Server Mgmt Hub Mgmt Hub Console Services Console Services … …

  41. Root Supermon – Supermond connects to all supermonds and manages a set of subnodes mond – Run on every node (even those with supermond) and collect metrics via the kernel and the “monhole” (if configured) Nagios/Nagios monitor – Runs the check_metrics plug-in periodically to cause supermon data collection and storage into the database Nagios master check_metrics Nagios monitor DB supermon Supermon manages requests for mond children supermon Manage requests for mond collection mond mond mond mond mond mond per-node data collector reports up to parent supermon “monhole” can be configured to pass any metric data for aggregation

  42. Nagios • Version 2.0-b6 • No changes to Nagios engine for Perseus • Value add is in the dynamic configuration and template model used by XC • Easily extendable and configurable by the end user • Dynamically adapts to the cluster configuration based on XC service assignments (roles) • Distributed monitoring supported in V3.0! Supports failover for active plugins from nagios monitors • Graphically monitors nodes, environmentals, processes, power state, slurm, jobs, LSF, syslog, load average etc • Monitors ProCurve switches via SNMP. Checks for slow ethernet links (linkspeed <1GB) • Initiates supermon metric gathering

  43. Nagios • Nagios gets all of its data from “plug-ins”, • Essentially shell/perl/C programs • return 0/1/2 for Ok, Warning, Critical • produce a status string on stdout. • Runs as the user “nagios” • nagiostats command new for V3.0 • Nagios master tracks distributed nagios monitors and reports if they fail to provide status • User customizable threshold values for all nodes/classes via nagios_vars.ini

  44. Nagios Runs as user “nagios” Nagios demon does most all work on management nodes Serves as scheduler for supermon data collection LSF failover functionality invoked by nagios plug-in Dynamically creates nagios configuration files based on cluster topology and role assignment Offloads management from nagios clients Nagios Nagios master demon on nh Nagios Monitor Nagios Monitor Nagios Monitor Nagios clients Nagios process offloading nagios master Forwards results to master Nagios clients run mond for metrics gathering hourly root key synchronization

  45. Navigation

  46. Nagios Tactical Overview

  47. Hostgroup summary (Roles)

  48. Nagios master services

  49. Management Server and Hub services • Apache HTTPS Server – Apache webserver status • Configuration Monitor – Updates node configuration • LSF Failover Monitor - Watch LSF and failover if needed • Monitor Power – Watch power nodes, report overall status • Nagios Monitor – Watch nagios master and monitor • Resource Monitor – Report squeue info • Root key synchronization – Report on ssh keys • Slurm Monitor – Report sinfo for each node • Supermon Metrics Monitor – Gather predetermined metrics • Syslog Alert Monitor – Watch for patterns in syslogalertrules file and report for each node • Load average - Provide per-node load average information and alerts • Environment - Provide per-node sensor reporting and alerts • Node information - Per node process, user, and disk statistics

  50. Plug-ins that report data for individual nodes or switches • System Event Log - Monitor hardware event log for iLO and IPMI based systems and alert based on patterns in selRules file. • SFS monitoring - Monitor an attached SFS appliance • Procurve switch monitor - Gather switch status and metrics via SNMP

More Related