270 likes | 408 Views
System Health M onitoring. A best practice framework to reduce disruption by proactive systems health monitoring. Clarence T. Moore, Jr . Eli Lilly and Company - Sr. Associate IT. Purpose and Agenda. Purpose
E N D
System Health Monitoring A best practice framework to reduce disruption by proactive systems health monitoring Clarence T. Moore, Jr. Eli Lilly and Company - Sr. Associate IT
Purpose and Agenda Purpose • Share best practice framework for system health monitoring along with a specific use case for system health monitoring of critical systems at Lilly. Agenda • My Background • Framing • What is System Health Monitoring • Value Proposition • Components of System Health Monitoring • Lilly Approach to System Health Monitoring • Monitoring Technologies • Summary
My Background • Work Experiences • Eli Lilly and Company – Since 2013 • UCB - 2010-2013 • Education • BS in Computer Science, North Carolina A&T State University (2009) • ITIL Foundation
Framing What don’t your customers want.
Framing • The view of the IT organization has shifted. • From back office to critical business partner: • External TPO partnership/collaboration • Customer engagement • Critical business support • IT organizations now have a greater responsibility to ensure the technologies used to enable our partners are stable, reliable, and available. • To meet this objective IT organizations should look to adopt a robust systems health monitoring framework. This framework will provide guidance, definition, and clarity related to system health for applications.
Systems Thinking • Input • Real-time system performance data • Process • Robust System Health Monitoring • Feedback • Analyzing and validating data obtained • Adjusting monitoring thresholds • Output • Reducing outages • Minimizing Performance Degradation Source for image: http://www.oyova.com/systems-thinking-for-your-website/
Framing What do your customers want? 24 x 7 System Availability Proactive Engagement Immediate Response to Outages Preventative Maintenance and Risk Mitigation Minimal business disruption due to IT
System Health Synergies w/ ITIL Source for image: http://www.elitser-me.com/consulting/itil-process-and-procedure-implementations/
System Health Monitoring • System Health is an assessment of the application’s ability to deliver consistent business outcomes in an efficient and effective manner. • It should be a representation of the key dimensions for an application. • System Health Dimensions • Criticality of business processes supported • Consists of activities and process essential to ensuring continuity of business critical activities • Necessity for service controls • Associated with metrics that determine if the application service management is in control • Health and reliance on infrastructure components • Represented by the availability, reliability and performance of the underlying components of the system.
Value proposition • Monitoring improves the availability, reliability and performance of IT systems while reducing the operational and capital costs associated with delivering them. This has a direct impact on the business processes dependent upon them. • Appropriate monitoring should render the following outcomes: • Reduce number of incidents • Reduce duration and frequency of incidents • Reduce cost of monitoring • Reduce operational and capital expenditures • Reduce deviations • Improve stability • Reduce impact of unplanned outages • Proactively identify problems
Key Terms • System Status • Up – a system is said to be up when the system is available and performing according to expectations/design • Typically defined as SLA or SLO • At-Risk – a system is said to be at risk when it is approaching a limiting factor that could negatively impact the system • Degraded – a system is said to be degraded when the system is available but not performing according to expectations/design • Down – a system is said to be down when the system is either unavailable or is performing so poorly that it cannot be used • Availability – a measure of the percentage of time a system is capable of performing a specified function • Reliability – a measure of the number of system-impacting incidents over a specified time period • Thresholds – the value of a metric which generates an alert or management action to take place
System Health Monitoring Scenarios Source for images: www.google.com
Scenario 1 A server in the datacenter fails • What happens next? • Notify the end users. • Since this occurred at the hardware level, that team should be working with all teams that are affected to give updates until the issue is resolved. • Once the issue has been resolved the end users need to be notified that service has been restored.
Scenario 2 Apache tomcat encounters a fatal error • What happens next? • Notify the end users. • The team responsible for middleware should be working to fix the issue while keeping all affected teams aware of their progress • Once the issue has been resolved the end users need to be notified that service has been restored.
Lilly approach to system health monitoring • In 2011, our R&D component prioritized the need for sustainability as a key deliverable in our quest to deliver on an aggressive pipeline. • Subsequently, a six sigma project commenced to address sustainability with regard to more real-time system health data related to our critical applications. • This project included a cross functional team made up of R&D IT, infrastructure, and shared service IT. • The outcome of this project was a more streamlined process to request, fulfill, and enable robust system health monitoring. This team also delivered prioritized dimensions to provide enterprise guidance for implementing system health monitoring. • The team leveraged the existing enterprise monitoring platform, Syntervision’s Oasis, to provide the necessary data visualization, thresholds, and event management associated with our critical systems.
Command Center Concept Application Dashboard Infrastructure support Database Command center, 24× 7 monitoring Operations Managers App. Command center Team, 24× 7 monitoring Threshold breach alerts Escalation point for command center to reach-out. 1 Threshold breach alerts Service Operations Service Operations Service Operations 1
Command Center Value Proposition • Reduce costs • Enhance cycle time • Reduce business disruption • Improve system availability • Strengthen orchestration across IT • Streamline communication Application Dashboard Infrastructure support Database Command center, 24× 7 monitoring Operations Managers App. Command center Team, 24× 7 monitoring Threshold breach alerts 1 1 Escalation point for command center to reach-out. Threshold breach alerts Service Operations Service Operations Service Operations
Key Components of the Command Center Concept • Create robust dashboards • Enables real-time alerts to IT support team(s) , thus ensuring proactive resolution of potential downtimes and expedited reporting of disruptions. • Establish Physical Command Center • Monitors a set of systems in a portfolio leveraging dashboard views. • Synchronization between command center teams • Promotes early resolution of service disruptions and potential downtimes. • Timely communication on service disruptions. • Engages business partners with up to date information to help facilitate action plans during system downtimes. Application Dashboard Infrastructure support Database Command center, 24× 7 monitoring Operations Managers App. Command center Team, 24× 7 monitoring Threshold breach alerts 1 Escalation point for command center to reach-out. 1 Threshold breach alerts Service Operations Service Operations Service Operations
Monitoring Technologies • Syntervision’s Oasis • Features • Monitors and measures performance, memory, CPUs, file systems, and transaction response times. • End-user experiences are also measured and monitored for deviations, threshold violations, and errors. • Tool has the ability to monitor a vast variety of systems • http://www.syntervision.com • Xymon • Features • Monitors server, applications and networks • Collects information on the system’s health, its applications, and the network connectivity between the system and the application • Tool has the ability to monitor a vast variety of systems • http://www.xymon.com/xymon/help/about.html
Monitoring Technologies (cont.) • Nagios • Features • Allows problems to be detected and mitigated before they affect end-users, reducing downtime and business losses • Allows users to plan and budget for IT upgrades • http://www.nagios.org/about/overview • Event Sentry • Features: • Monitors core components of the operating system, providing alerts for immediate problems and also collecting information for later analysis, trend prediction and real-time overview • Provides notification of status changes for services and drivers • Provides alerts when the available disk space is below a certain minimum • The tool is customizable and allows organizations to integrate their own applications • http://www.eventsentry.com/features/system-health-monitoring
Summary • Monitoring the health of your systems is a significant opportunity for IT. • Creates an environment where business partners can focus more on process and delivery versus technology • Enables IT to shift focus from operations to innovation • Provides sustainability and availability • Creates a more proactive versus reactive organization