580 likes | 1.19k Views
Deal with Production Issues. Suggestions from ITIL. Problems to solve. Long resolution time Neglected issues Issues we lose track of until our users remind us Recurring issues Inconsistency in response time Developers are distracted constantly to resolve issues. Goal.
E N D
Deal with Production Issues Suggestions from ITIL
Problems to solve • Long resolution time • Neglected issues • Issues we lose track of until our users remind us • Recurring issues • Inconsistency in response time • Developers are distracted constantly to resolve issues
Goal • Manage issues in a consistent manner • Fast resolution • Reduce client impact • Proactively resolve issues before they impact clients
Basic Concepts • Incidents • Any event which is not part of the standard operation of a service and which causes, or may cause an interruption to or a reduction in, the quality of that service • Problems • A problem is a condition often identified as the cause of multiple incidents that exhibit common symptoms. • Known Errors • A known error is a condition identified by successful diagnosis of the root cause of a problem, and subsequent development of a Work-around
Relationship of the three • Problem is the root cause of the incidents • Incident is the manifest of a underline Problem • One Problem can cause many Incidents • Known error is a problem with known root cause and known workaround
Manage Incident vs. Manage Problem • Different goals • Incident Management focus on restoring the service operation as quickly as possible • Problem management focus on finding and eliminating the root cause • Different actions • Incident management applies workarounds or temporary fixes to quickly restore the services • Problem management issue a change to fundamentally eliminate the root cause • Incident management is reactive and problem management is proactive • Incident management emphasize speed and problem management emphasize quality
Common mistakes • Spend tremendous time and efforts to find root cause before the service level is recovered • Stop the investigation after an incident is fixed by a workaround • Same incident occurs repeatedly without understanding of the root cause
Solutions from ITIL • Separate out Incident Management and Problem Management into two independent but related processes • Handle incidents (restore service) as quickly as possible • Proactively and independently work on resolving problems • Wisely manage Known Errors
Incident Management • Always remember the goal is to “Restore service level as quickly as possible” • How to go fast? • Classification • Match known errors and known workarounds • Appropriate escalation • Go fast, but not go crazy. Don’t miss • Record • Prioritize • Follow up
Acceptance And Record • Benefits of recording • Help to diagnosis new incidents based on known incidents • Help Problem Management to find the root cause • Easy to determine the impact • Be able to track and control the issue resolution. • Incident Reporting Channels • User • System Monitor/Alert • IT person
Incident Record • Unique ID • Basic diagnosis info • Timestamp • Symptoms • User info (name, contact info) • Who’s responsible • Additional information • Screenshots • Logs • Status • New, Accepted, Scheduled, Assigned, Active, Suspended, Resolved, Terminated
Classification • Classification • Possible reasons (application, network, database, business logic, etc.) • Supporting group (application group, database group, infrastructure group, network group, etc.) • Prioritize • Priority = Impact X Urgency • Determine resolution timeline (resolve within X hours) based on Service Level Agreement
Preliminary Support • Preliminary Response • Acknowledge of acceptance • Collect basic info • Provide basic help to the user • Service Requests • Service Request is standard service like check status, reset password, etc. • Go through standard procedure to handle service requests
Match • Match known errors • Known solution • Known workaround • Known resolution procedure • Match existing incidents • Link the new incident with the existing incidents • Increase the impact level of the existing incident • If the existing one is already worked on, inform the responsible personal/group
Investigate and Diagnosis • Escalation • Functional escalation (Technical escalation) : Involve more technical experts, involve teams in other functional group, or involve external suppliers • Hierarchical escalation (Management escalation): Escalate to higher level management team
A (Service Desk) B (Second Line) C (Third Line, Supplier) D (Incident Manager) E (Division Management) F (Corporate Management Escalation by Priorities
Investigation Activities • Assign dedicated support person • Collect basic info • Query historical data • Recent releases • Recent changes • Workload trend • Analyze • Again, don’t spend too much time in finding the root cause. Find a workaround as soon as possible!
Resolve and recover • Resolution (workarounds or permanent fix) • Create a Request For Change (RFC) • Approve RFC • Implement Change. • Record the analysis, the root cause, the workaround and the solution • Leave the incident in Open status when resolution hasn’t been found
Termination • Contact the user to confirm incident is resolved • Change the Incident status into “Closed” • Update all the Incident record to reflect the final priority, impact, user and root cause
Track and Monitor • Assign an owner to each incident. Usually it’s the Service Desk person. • Provide feedback to the users after a change • Enforce the escalation based on the priority
Problem Management • Problem Control • Find the root cause of a problem • Turn a problem into a Known Error • Error Control • Control and Monitor the Known Errors until they are appropriately handled • Proactive Problem Management • Resolve problems before they cause any incidents
Identify Problems • Analyze the trends of incidents • Likely to reoccur • Likely more will occur • Likely to have larger impact • Analyze the weakness of the infrastructure • Availability • Capability • A significant incident (outage)
Diagnosis • Recreate incident in testing environment • Link the modules with incidents • Review the latest changes • After the root cause of a problem is found, this problem becomes a Known Error
Temporary Fixes • It’s important to find a temporary fix if the problem causes significant incident • If temporary fix involves changes in the infrastructure, a Request For Change must be submitted. (Later, another RFC may be submitted to fix the root cause) • For urgent problems, Emergency Change Request Process should be initialized.
Identify and Record Known Error • Identify • Find the root cause of a problem • Link a problem with a known error • Record • Assign an ID • Symptoms • Root cause • Status • Notification • Notify incident management team. They can associate new incidents with known errors
Determine the solution • Evaluate based on • Service Level Agreement • Impact and Urgency • Cost and benefit • Possible solutions • Temporary fixes • Permanent fixes • No fix (cost is greater than benefits) • Record the decision in Problem Database
Known Errors from other environments • Known errors from development environment • We may choose to release with some minor known issues • Known errors from suppliers • Usually reported in the release notes • Record, Monitor and Track those known errors • Relate problems with those known errors
PIR (Post Implementation Review) • Normal problems • Confirm all the related incidents are closed • Verify if the problem record is complete (symptoms, root cause and solutions) • Change the problem status into Resolved • Significant problems • What went well? • What went wrong? • How to do better next time? • How to prevent the similar issues from happening again?
Track and Monitor • Track the full lifecycle of each known error • Reevaluate impact and urgency. Adjust the priorities accordingly. • Monitor the progress of the diagnosis and implementation of the solution. Monitor the implementation of the RFC.
Proactive Problem Management • Focus on the quality of the service and the infrastructure • Analyze operational trends • Detect the potential incidents and prevent them from happening • Find out the weak points of the infrastructure or the overloaded components
Ideas to improve our Production Support process • Idea 1: Create an independent Problem Management Team. • Idea 2: Create an Problem Database • Idea 3: Define the Production Support Procedure • Idea 4: Review and revise the procedures of using TeamTrack • Idea 5: Enforce Post Implementation Review • Idea 6: Proactively manage problems • Idea 7 (optional): Acquire an Service Desk software to facilitate the process
Create an independent Problem Management Team. • Can be a full time team or a part time team • Appoint a Problem Management Manager. Must be different than the Production Support Manager. Their goals, schedules and requirements are different. • Responsible for managing all the production problems (not incidents) for multiple applications • Identify problems • Record problem • Find and evaluate solutions • Track the progress till closure • Work closely with the existing Production Support team.
Create a Problem Database • A easy to search knowledge database • Include problems and known errors • Track symptoms, root causes, temporary fixes, workarounds, and permanent solutions • Include all the known errors in DEV and unresolved or deferred defects in QA/RATE environments • Maintained by the Problem Management Team • Will be used by Production Support team for match and fast resolution of incidents
Define the Production Support Procedure (Work Instructions) • Create a formal and detailed document. Train Production Support Team to follow the new procedure • Start with ITIL Incident Management Process. Adjust it to our own situation and tools • Clearly define how to calculate priorities • Clearly define the time-bound escalation procedure • Clearly define the monitoring and tracking steps
Review and define the procedure of using TeamTrack • TeamTrack is our existing Incident Tracking system • Review the functions of TeamTrack • Redefine the incident escalation process according to ITIL suggestions • Define the interface between PC Support and IT Production Support Team • Communication channel • Roles and responsibilities • Escalation • Track and Control • Knowledge sharing
Enforce PIR • Contact each user to confirm all the incidents are closed • Make sure the Problem record is complete and useful • Identify issues in the Incident and Problem Management process. Add those to Problem database.
Proactively Manage Problems • Responsibility of the Problem Management Team. • Perform the following activities: • Analyze incidents to find the trend • Analyze infrastructure to identify possible bottleneck • Run fail-over and stress tests • Apply a problem solution across multiple related applications • Establish and maintain the Production Monitor System to proactively detect system anomalies • Evaluate how many problems are proactively identified and resolved
Service Desk Software • Evaluate the existing TeamTrack software and see if it covers out needs • Other popular options • HP Openview Service Desk • Remedy Strategic Service Suite • CA Unicenter Service Desk