Implementing a Model for Service Level Management: A Practical Approach to Integrating Performance Tools

Implementing a Model for Service Level Management:A Practical Approach to Integrating Performance Tools Steve Lewis J.D. Edwards & Company

Topics: • Why manage/monitor your infrastructure? • What tools must be in place? • Managing diverse systems, networks, and applications. • Key design decisions. • Implementation experiences, examples, and lessons.

Why do we need tools? Every IT organization wants to be known for its proactive monitoring and automated Service Level Management.

What is the Cost to Manage? • Hardware, Software, & Maintenance fees. • Facilities – building, cooling, electricity, access control, disaster recovery sites. • People – design, operations, support. But what about . . . • Cost avoidance – no addition to bottom line. • Do these costs offset the cost of not managing? (Under- or Over-utilization, lost productivity, “waste”)

What can we gain? • If you know what resources you have used in the past, you can better plan for the future. • Re-active mode vs. Pro-active mode: operating from a pager vs. identifying potential problems before they happen. • Quick notification gives a jump to the technical team who repairs the service. • Knowledge base = better history on failures; a training tool for new team members.

How to Move in the Right Direction • Break down the task into sequential steps. • Build Service Level Management step-by-step from the bottom up.

The Layers of Service Level Mgmt • Automated functionality built in layers according to their dependencies.

#1 – Technical Infrastructure • In order for a specific service to be available, all of the technical components must exist: • Network Devices & Communication Links • Server Hardware & Operating Systems • Application Software & Processes • Each device must gather statistics on itself (using SNMP, WMI, syslog, flat files, etc.) • This is where most $$$ and people are allocated! Network, System, and Application Infrastructure

#2 – Fault Management Tools • A defined SERVICE may not be available if a network, system, or application component experiences a failure or poor performance. • “Root Cause Correlation” identifies the exact point of failure in the event chain. Fault Management Tools Network, System, and Application Infrastructure

#3 – Information Management Tools • This should include tightly integrated tools: • Problem Management • Change Management • Asset Management Information Management Tools Fault Management Tools Network, System, and Application Infrastructure

Problem Management Tools If an infrastructure event is detected by the Fault Management tools, it should be reported to the Problem Management System: • Documenting (trouble ticket & knowledge base) • Tracking (status update & workflow) • Escalating (service response) • Notifying (pager, email, phone, PA system) • Generating reports (mean time between failure) Problem Mgmt Change Mgmt Asset Mgmt

Change Management Tools Change Management System: • Schedule & approve changes to the infrastructure. • Track routine maintenance tasks. • The Problem Management tool can check with the Change Management tool to distinguish between “Planned Outages” & unexpected faults. • Notification & reporting are handled differently for planned outages. Problem Mgmt Change Mgmt Asset Mgmt

Asset Management Tools Vital information on each technical component -- Asset Management System: • Vendor & maintenance plan • Serial number & location • Lease expiration & asset owner • Responsible support team by shift so the appropriate group is notified of an event. Problem Mgmt Change Mgmt Asset Mgmt

#4 – Performance Management Tools • Performance/Capacity Planning statistics. • Resource utilization thresholds for proactive notification when thresholds are exceeded. Performance Management Tools Information Management Tools Fault Management Tools Network, System, and Application Infrastructure

#5 – Service Level Policies • Technical components grouped into services. • “Customer view” transaction monitoring. Service Level Policies Performance Management Tools Information Management Tools Fault Management Tools Network, System, and Application Infrastructure

#5 – Service Level Policies (continued) Two ways to measure a service: • Monitor each component in the “service chain” – BUT how do you synchronize the data from different monitoring tools? • Generate synthetic transactions from an “end user” viewpoint – BUT how do you isolate troublesome components? Service Level Policies

#6 – Service Level Management • Automated reporting of SLA compliance. Service Level Management Service Level Policies Performance Management Tools Information Management Tools Fault Management Tools Network, System, and Application Infrastructure

#6 – Service Level Management (continued) Service Level Management is not a unique, isolated function. It is the culmination of ALLthe functions involved in providing the service. Rick Sturm

Difficulty of Service Level Management • Collecting the appropriate metrics. • Automating the correlation of those metrics. Technology View Customer View

Design Decision #1 Reality: • The technical infrastructure is relatively dynamic, constantly changing, with little centralized control. Decision: • Choose “Self-Configuring” Tools that detect and adjust to change automatically.

Design Decision #2 Reality: • Cannot afford the intensive administrative overhead required to maintain most tools. Decision: • Choose “Zero-Admin” tools that automate or minimize administrative tasks.

Design Decision #3 Reality: • Extensive software distribution, version control, and cost issues with agent-based tools. Decision: • Choose “Agent-Less” tools for common metrics (collect with SNMP, WMI, syslog).

Design Decision #4 Reality: • Need a consolidated “single-pane-of-glass” view of performance and service level statistics. Decision: • Choose “Web-Based” tools that offer security & customization per user.

Design Decision #5 Decision: • Centralize to provide a single control point for security, event monitoring, administration, and report generation.

Constructing The System (part 1) • Fault Management Layer:HP OpenView NNM • Adjusts to network configuration changes. • Provides up/down status on connected devices. • Does “root cause” correlation for events. • Ability to define metrics for SNMP collection and database storage. • Serves as SNMP trap destination for processing application-level events.

Constructing The System (part 2) • Fault Management Layer:Magnum Technologies: COORDINATOR • Provides “root cause” correlation for events. • Updates its correlation engine when the OpenView topology changes. • Contains an External Command Processor for parsing event messages, automatically opening trouble tickets, and sending notifications.

Constructing The System (part 3) • Performance Management Layer:Magnum Technologies: CAPTREND • Contains internal SNMP & WMI polling engines to collect basic performance metrics. • Stores data for ad hoc reporting; generates several canned graphical reports. • Ability to create performance thresholds that generate exception events for notification.

Constructing The System (part 4) • Performance Management Layer:BMC Software: Patrol • Monitors application metrics at a detailed level. • Ability to generate SNMP traps for application events which are sent to OpenView and COORDINATOR for processing.

Constructing The System (part 5) • Performance Management Layer:Empirix: eMonitor & OneSight • Generates web-based customer-orientedtransactions (including https authentication). • Ability to generate SNMP traps for response time threshold violations that are sent to OpenView and COORDINATOR for processing.

Still-to-be-Accomplished • Integration of tools at theInformation Management layer. • Automated reporting from existing agent-based tools at the Performance Management layer. • Tools to correlate technology components and define policies at the Service Level Policy layer.

Lessons Learned • It always costs more MONEY and takes more TIME than expected. • It is always more difficult than expected to INTEGRATE diverse tools. • Key Success Factors: • Management Commitment • Business Process Improvement • Customer Care Strategy • Organizational Flexibility

Implementing a Model for Service Level Management: A Practical Approach to Integrating Performance Tools