1 / 20

A Service-Based SLA Model

A Service-Based SLA Model. HEPIX -- CERN May 6, 2008 Tony Chan -- BNL. Overview. Facility operations is a manpower-intensive activity at the RACF. Sub-groups responsible for systems within the facility (tape storage, disk storage, linux farm, grid computing, network, etc) Software upgrades

Download Presentation

A Service-Based SLA Model

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Service-Based SLA Model HEPIX -- CERN May 6, 2008 Tony Chan -- BNL

  2. Overview • Facility operations is a manpower-intensive activity at the RACF. • Sub-groups responsible for systems within the facility (tape storage, disk storage, linux farm, grid computing, network, etc) • Software upgrades • Hardware lifecycle management • Integrity of facility services • User account lifecycle management • Cyber-security • Experience with RHIC operations for the past 9 years. • Support for ATLAS Tier 1 facility operations.

  3. Experience with RHIC Operations • 24x7 year-round operations since 2000. • Facility systems classified into 3 categories: non-essential, essential and critical. • Response to system failure depends on component classification: • Critical components are covered 24x7 year-round. Immediate response is expected from on-call staff. • Essential components have built-in redundancy/duplication and are addressed the next business day. Escalated to “critical” if large number of essential components fail and compromise service availability. • Non-essential components are addressed the next business day. • Staff provides primary coverage during normal business hours. • Operators contact on-call person during off-hours and weekends.

  4. Experience with RHIC Operations (cont.) • Users report problems via ticket system, pagers and/or phone. • Monitoring software instrumented with alarm system. • Alarm system connected to selected pagers and cell phones. • Limited alarm escalation procedure (ie, contact back-up if primary is not available) during off-hours and weekends. • Periodic rotation of primary and back-up on-call list for each subsystem. • Automatic response to alarm conditions in certain cases (ie, shutdown of Linux Farm cluster in case of cooling failure). • Facility operations in RHIC has worked well over past 8 years.

  5. Table 1 Summary of RCF Services and Servers Service Server Rank Comments Network to Ring 1 Internal Network 1 External Network 1 ITD handles RCF firewall 1 ITD handles HPSS rmdsXX 1 AFS Server rafsXX 1 AFS File systems 1 NFS Server 1 NFS home directories rmineXX 1 CRS Management rcrsfm, rcras 1 Rcrsfm is 1, rcras is 2 Web server (internet) www.rhic.bnl.gov 1 Web server (intranet) www.rcf.bnl.gov 1 NFS data disks rmineXX 1 Instrumentation 2 SAMBA rsmb00 DNS rnisXX 2 Should fail over NIS rnisXX 2 Should fail over NTP rnisXX 2 Should fail over RCF gateways 2 Multiple gateway machines ADSM backup 2 Wincenter rnts00 2/3 CRS Farm 2 LSF rlsf00 2 CAS Farm 2 rftp 2 Oracle 2 Objectivity 2 MySQL 2 Email 2/3 Printers 3 Service Level Agreement

  6. A New Operational Model for the RACF • RHIC facility operations is a system-based approach. • Some systems support more than one service, and some services depend on multiple systems – unclear lines of responsibilities. • Service-based operational approach better suited for distributed computing environment in ATLAS. • Tighter integration of monitoring, alarm mechanism and problem tracking – automate where possible. • Define a system and service dependency matrix.

  7. Service/System Dependency Matrix

  8. Monitoring in the new SLA • Monitor service and system availability, system performance and facility infrastructure (power, cooling, network). • Mixture of open-source and RACF-written components. • Nagios • Infrastructure • Condor • RT • Choices guided by desired features: historical logs, ease of integration with other software, support from open-source community, ease of configuration, etc.

  9. Nagios • Monitor service availability. • Host-based daemons configured to use externally-supplied “plugins” to obtain service status. • Host-based alarm response customized (e-mail notification, system reboot, etc). • Connected to RT ticketing system for alarm logging and escalation.

  10. Nagios (cont.)

  11. Infrastructure (Cooling) • The growth of the RACF has put considerable strain on power and cooling. • UPS back-up power for RACF equipment. • Custom RACF-written script to monitor power and cooling issues. • Alarm logging and escalation through RT ticketing system. • Controlled automatic shutdown of Linux Farm during cooling or power failures.

  12. Infrastructure (Network) • Use of cacti to monitor network traffic and performance. • Can be used at switch or system level. • Historical information and logs. • To be instrumented with alarms and be integrated in the alarm logging and escalation.

  13. Condor • Condor does not have native monitoring interface. • RACF created its own web-based, monitoring interface. • Interface used by staff for performance tuning. • Connected to RT for alarm logging and escalation. • Monitoring functions • Throughput • Service Availability • Configuration Optimization

  14. RT • Flexible ticketing system. • Historical records available. • Coupled to monitoring software for alarm logging and escalation. • Integrated in service-based SLA.

  15. Implementing new SLA • Create Alarm Management Layer (AML) to interface monitoring to RT. • Alarm conditions configurable via custom-written rule engine. • Clearer lines of responsibilities for creating, maintaining and responding to alarms. • AML creates RT ticket in appropriate category and keeps track of responses. • AML escalates alarm when RT ticket is not addressed within (configurable) amount of time. • Service Coordinators oversee management of service alarms.

  16. How It Works

  17. RT ticket status (new, open, resolved) Timestamp of lastest RT update Due date RT ticket information (number, queue, owner, priority, etc) What data is logged? • Host, service, host group, and service group • Alarm timestamp • NRPE (Nagios) message content • Alarm status • Notification status

  18. Example Configuration (rule) File [linuxfarm-testrule] host: testhost(\d) (Regular expression compatible) service: condorq, condor hostgroup: any queue: Test after_hours_PageTime: 30 work_hours_PageTime: 60 work_hours_response_time: 120 (When does the problem need to be resolved by) after_hours_response_time: 720 (When does the problem need to be resolved by) auto_up: 1 (Page people) down_hosts: 2 (Number of down hosts to be a real problem) firstContact: test-person@pager secondContact: test-person@bnl.gov

  19. New Response Mechanism

  20. Summary • Well-established procedures from RHIC operational experience. • Need service-based SLA for distributed computing environment. • Create Alarm Management Layer (AML) to integrate RT with monitoring tools and create clearer lines of responsibilities for staff. • Some features already functional. • Expect full implementation by late summer 2008.

More Related