Incident Management Revue

Incident Management Revue Strategic Process Planning and Integration Management (SPPIM) Sue Silkey, Thelma Simons and Gail Schaplowsky

Best Practices • Best practices serve as a guide to designing IT management processes that increase the overall efficiency, reduce costs and align IT with business needs. • ITIL asks…

How ITIL best practices can help • Faster incident recovery • Fewer unplanned outages • Better communication with users • Information that enables better informed management decisions

Incident Management Goal • Restore normal service operation as quickly as possible and minimize adverse impact on business operations • Basically this means using all available resources to get the user back to a productive state as quickly as possible

Incident Management Benefits • Minimize the disruption and downtime for our users • Maintain a record during the entire Incident life-cycle. (This allows any member of the service team to obtain or provide an up-to-date progress report) • Building knowledgebase of known issues to allow quicker resolution of frequent Incidents

Incident Management How we implemented • Began using process July, 2006 • Continued regular meetings to review and tweak process • Process formally adopted in December, 2006 Current status • Starting to develop metrics to create management reports (how many incidents, major incidents, etc.)

Definitions • Incident - any event which is not part of the standard operation of a service and which causes, or may cause, an interruption to, or a reduction in, the quality of that service • Service Request - request for increased functionality for new services, not a failure in the IT infrastructure. • Major Incident – an Incident for which the degree of impact on the User community is extreme, and which requires a response that is above and beyond that given to normal incidents. • Problem - A condition identified by multiple incidents exhibiting common symptoms, or from one single significant incident, indicative of a single error, for which the cause is unknown

Incident Lifecycle

A day in the life…of an Incident Our players • Nervous Nellie – Gail Schaplowsky • Incident/Major Incident – Dave Barnhill • Support Staff – Mike Wright • Major Incident Manager – Sue Silkey • CSC Staff – Bill Farris • Narrator – Thelma Simons We begin on a bright and sunny day…

Case Types • Incident - any event which is not part of the standard operation of a service and which causes, or may cause, an interruption to, or a reduction in, the quality of that service • Service Request - request for increased functionality for new services, not a failure in the IT infrastructure. • Major Incident – an Incident for which the degree of impact on the User community is extreme, and which requires a response that is above and beyond that given to normal incidents. • Problem - A condition identified by multiple incidents exhibiting common symptoms, or from one single significant incident, indicative of a single error, for which the cause is unknown

Incident Management Goal • Restore normal service operation as quickly as possible and minimize adverse impact on business operations

I+U=P Impact + Urgency = Priority

I+U=P Impact is defined as the number of people affected by a service outage. • Low Impact: One customer affected, where no executive or executive staff are involved. • Medium Impact: Several customers are affected, or an executive or executive staff are involved. • High Impact: Whole organization, complete department or building affected, or revenue/financial systems affected.

I+U=P Urgency is defined as the affect of the event on a customer’s ability to work. (This is not to be confused with how urgent the requestor believes the incident to be.) • Low Urgency: Ability not impaired, the customer is requesting extra or additional functions or services (a service request). • Medium Urgency: Abilities are partially impaired, and customers cannot use certain functions or services. • High Urgency: Abilities are completely impaired and customers cannot work.

I+U=P Priority is based on Impact and Urgency. The priority determines how quickly the issue needs to be addressed. • Low Priority: Work to be completed in 4 business days. • Medium Priority: Work to be completed in 2 business days. • High Priority: Work to be completed in 4 hours. • Urgent Priority: Work to be completed in 2 hours.

Major Incident I am the highest category of impact for an incident I result in significant disruption to our business In short, in matter technical on which we are dependent I am the very model of an IT Major Incident! (Sung to the tune of The Major General’s Song in the Pirates of Penzance

Case Types • Incident: an event which is not part of the standard operation of a service and which causes or may cause an interruption to, or a reduction in the quality of, that service i.e. some piece of technology that I previously used is not working now.Major Incident: an Incident for which the degree of impact on the User community is extreme, or where the disruption is excessive and which requires a response that is above and beyond that given to normal incidents.

Major Incident Responsibilites Support Staff Major Incident Checklist Assign the case to yourself (if not already done so) Updates: • Hourly updates should be made to the work log or to the Major Incident Manager at the CSC. If you do not make these hourly updates, the MIM or CSC will contact you for an update. • Resolution updates should be called into the MIM or CSC for verification. Once verified, Move the case to resolved Status and complete the information in the solutions tab.

Major Incident Responsibilites Major Incident Manager Checklist • Replicate or substantiate the failure (via monitoring equipment alerts) • Log the case • Consult the Call List (contact support staff, Service Owner, SCC) • Monitor the case a. Check activity log for updates hourly b. If activity log hasn’t been updated for an hour, contact support staff. • Upon “resolution” or moving the case to “Pending – Major Incident Cleared” a. Test that failure is resolved. b. Contact the SCC.

Call List

Tune in next time… • What will happen to Major Incident? • Come back next month to see the continuing saga of Mr. Incident as he wafts his way through Change Management, Problem Management and Configuration Management.

Hope you had fun and… Learned • The difference between Incident and Major Incident • How IM can minimize the disruption and downtime for our users • The importance of maintaining a record during the entire Incident life-cycle • That building a knowledgebase of known issues will allow quicker resolution of frequent Incidents

IM Wrap Up • Where we are • Where we want to be • Metrics to tell us when we arrive • Annual Review • New committee based on reorganization

Upcoming Sessions Future sessions are scheduled on: • Change Management • Problem Management • Configuration Management • Release Management

More information at SPPIM (PSMO) website www.technology.ku.edu/psmo Also in IS/Process Management public folders Questions?

Incident Management Revue

Incident Management Revue

Presentation Transcript

Incident Management

Incident Management

Incident Management

Post Incident Management

Incident Management

TRIBAL INCIDENT MANAGEMENT

Incident Management

Evolving incident management

Traffic INCIDENT MANAGEMENT

Incident Management

INCIDENT MANAGEMENT

INCIDENT MANAGEMENT SYSTEM

Incident Management

Test Management –Incident Management

Incident Management

Incident Management Revue

Incident Management

Incident Management

Incident Management

Incident Management

Incident Management Overview

Incident Acquisition Management