220 likes | 332 Views
A Service-Based SLA Model. HEPIX -- CERN May 6, 2008 Tony Chan -- BNL. Overview. Facility operations is a manpower-intensive activity at the RACF. Sub-groups responsible for systems within the facility (tape storage, disk storage, linux farm, grid computing, network, etc) Software upgrades
E N D
A Service-Based SLA Model HEPIX -- CERN May 6, 2008 Tony Chan -- BNL
Overview • Facility operations is a manpower-intensive activity at the RACF. • Sub-groups responsible for systems within the facility (tape storage, disk storage, linux farm, grid computing, network, etc) • Software upgrades • Hardware lifecycle management • Integrity of facility services • User account lifecycle management • Cyber-security • Experience with RHIC operations for the past 9 years. • Support for ATLAS Tier 1 facility operations.
Experience with RHIC Operations • 24x7 year-round operations since 2000. • Facility systems classified into 3 categories: non-essential, essential and critical. • Response to system failure depends on component classification: • Critical components are covered 24x7 year-round. Immediate response is expected from on-call staff. • Essential components have built-in redundancy/duplication and are addressed the next business day. Escalated to “critical” if large number of essential components fail and compromise service availability. • Non-essential components are addressed the next business day. • Staff provides primary coverage during normal business hours. • Operators contact on-call person during off-hours and weekends.
Experience with RHIC Operations (cont.) • Users report problems via ticket system, pagers and/or phone. • Monitoring software instrumented with alarm system. • Alarm system connected to selected pagers and cell phones. • Limited alarm escalation procedure (ie, contact back-up if primary is not available) during off-hours and weekends. • Periodic rotation of primary and back-up on-call list for each subsystem. • Automatic response to alarm conditions in certain cases (ie, shutdown of Linux Farm cluster in case of cooling failure). • Facility operations in RHIC has worked well over past 8 years.
Table 1 Summary of RCF Services and Servers Service Server Rank Comments Network to Ring 1 Internal Network 1 External Network 1 ITD handles RCF firewall 1 ITD handles HPSS rmdsXX 1 AFS Server rafsXX 1 AFS File systems 1 NFS Server 1 NFS home directories rmineXX 1 CRS Management rcrsfm, rcras 1 Rcrsfm is 1, rcras is 2 Web server (internet) www.rhic.bnl.gov 1 Web server (intranet) www.rcf.bnl.gov 1 NFS data disks rmineXX 1 Instrumentation 2 SAMBA rsmb00 DNS rnisXX 2 Should fail over NIS rnisXX 2 Should fail over NTP rnisXX 2 Should fail over RCF gateways 2 Multiple gateway machines ADSM backup 2 Wincenter rnts00 2/3 CRS Farm 2 LSF rlsf00 2 CAS Farm 2 rftp 2 Oracle 2 Objectivity 2 MySQL 2 Email 2/3 Printers 3 Service Level Agreement
A New Operational Model for the RACF • RHIC facility operations is a system-based approach. • Some systems support more than one service, and some services depend on multiple systems – unclear lines of responsibilities. • Service-based operational approach better suited for distributed computing environment in ATLAS. • Tighter integration of monitoring, alarm mechanism and problem tracking – automate where possible. • Define a system and service dependency matrix.
Monitoring in the new SLA • Monitor service and system availability, system performance and facility infrastructure (power, cooling, network). • Mixture of open-source and RACF-written components. • Nagios • Infrastructure • Condor • RT • Choices guided by desired features: historical logs, ease of integration with other software, support from open-source community, ease of configuration, etc.
Nagios • Monitor service availability. • Host-based daemons configured to use externally-supplied “plugins” to obtain service status. • Host-based alarm response customized (e-mail notification, system reboot, etc). • Connected to RT ticketing system for alarm logging and escalation.
Infrastructure (Cooling) • The growth of the RACF has put considerable strain on power and cooling. • UPS back-up power for RACF equipment. • Custom RACF-written script to monitor power and cooling issues. • Alarm logging and escalation through RT ticketing system. • Controlled automatic shutdown of Linux Farm during cooling or power failures.
Infrastructure (Network) • Use of cacti to monitor network traffic and performance. • Can be used at switch or system level. • Historical information and logs. • To be instrumented with alarms and be integrated in the alarm logging and escalation.
Condor • Condor does not have native monitoring interface. • RACF created its own web-based, monitoring interface. • Interface used by staff for performance tuning. • Connected to RT for alarm logging and escalation. • Monitoring functions • Throughput • Service Availability • Configuration Optimization
RT • Flexible ticketing system. • Historical records available. • Coupled to monitoring software for alarm logging and escalation. • Integrated in service-based SLA.
Implementing new SLA • Create Alarm Management Layer (AML) to interface monitoring to RT. • Alarm conditions configurable via custom-written rule engine. • Clearer lines of responsibilities for creating, maintaining and responding to alarms. • AML creates RT ticket in appropriate category and keeps track of responses. • AML escalates alarm when RT ticket is not addressed within (configurable) amount of time. • Service Coordinators oversee management of service alarms.
RT ticket status (new, open, resolved) Timestamp of lastest RT update Due date RT ticket information (number, queue, owner, priority, etc) What data is logged? • Host, service, host group, and service group • Alarm timestamp • NRPE (Nagios) message content • Alarm status • Notification status
Example Configuration (rule) File [linuxfarm-testrule] host: testhost(\d) (Regular expression compatible) service: condorq, condor hostgroup: any queue: Test after_hours_PageTime: 30 work_hours_PageTime: 60 work_hours_response_time: 120 (When does the problem need to be resolved by) after_hours_response_time: 720 (When does the problem need to be resolved by) auto_up: 1 (Page people) down_hosts: 2 (Number of down hosts to be a real problem) firstContact: test-person@pager secondContact: test-person@bnl.gov
Summary • Well-established procedures from RHIC operational experience. • Need service-based SLA for distributed computing environment. • Create Alarm Management Layer (AML) to integrate RT with monitoring tools and create clearer lines of responsibilities for staff. • Some features already functional. • Expect full implementation by late summer 2008.