Deal with Production Issues

Deal with Production Issues Suggestions from ITIL

Problems to solve • Long resolution time • Neglected issues • Issues we lose track of until our users remind us • Recurring issues • Inconsistency in response time • Developers are distracted constantly to resolve issues

Goal • Manage issues in a consistent manner • Fast resolution • Reduce client impact • Proactively resolve issues before they impact clients

Basic Concepts • Incidents • Any event which is not part of the standard operation of a service and which causes, or may cause an interruption to or a reduction in, the quality of that service • Problems • A problem is a condition often identified as the cause of multiple incidents that exhibit common symptoms. • Known Errors • A known error is a condition identified by successful diagnosis of the root cause of a problem, and subsequent development of a Work-around

Relationship of the three • Problem is the root cause of the incidents • Incident is the manifest of a underline Problem • One Problem can cause many Incidents • Known error is a problem with known root cause and known workaround

Manage Incident vs. Manage Problem • Different goals • Incident Management focus on restoring the service operation as quickly as possible • Problem management focus on finding and eliminating the root cause • Different actions • Incident management applies workarounds or temporary fixes to quickly restore the services • Problem management issue a change to fundamentally eliminate the root cause • Incident management is reactive and problem management is proactive • Incident management emphasize speed and problem management emphasize quality

Common mistakes • Spend tremendous time and efforts to find root cause before the service level is recovered • Stop the investigation after an incident is fixed by a workaround • Same incident occurs repeatedly without understanding of the root cause

Solutions from ITIL • Separate out Incident Management and Problem Management into two independent but related processes • Handle incidents (restore service) as quickly as possible • Proactively and independently work on resolving problems • Wisely manage Known Errors

Incident Management • Always remember the goal is to “Restore service level as quickly as possible” • How to go fast? • Classification • Match known errors and known workarounds • Appropriate escalation • Go fast, but not go crazy. Don’t miss • Record • Prioritize • Follow up

Incident Management Process

Acceptance And Record • Benefits of recording • Help to diagnosis new incidents based on known incidents • Help Problem Management to find the root cause • Easy to determine the impact • Be able to track and control the issue resolution. • Incident Reporting Channels • User • System Monitor/Alert • IT person

Incident Record • Unique ID • Basic diagnosis info • Timestamp • Symptoms • User info (name, contact info) • Who’s responsible • Additional information • Screenshots • Logs • Status • New, Accepted, Scheduled, Assigned, Active, Suspended, Resolved, Terminated

Classification • Classification • Possible reasons (application, network, database, business logic, etc.) • Supporting group (application group, database group, infrastructure group, network group, etc.) • Prioritize • Priority = Impact X Urgency • Determine resolution timeline (resolve within X hours) based on Service Level Agreement

Preliminary Support • Preliminary Response • Acknowledge of acceptance • Collect basic info • Provide basic help to the user • Service Requests • Service Request is standard service like check status, reset password, etc. • Go through standard procedure to handle service requests

Match • Match known errors • Known solution • Known workaround • Known resolution procedure • Match existing incidents • Link the new incident with the existing incidents • Increase the impact level of the existing incident • If the existing one is already worked on, inform the responsible personal/group

Investigate and Diagnosis • Escalation • Functional escalation (Technical escalation) : Involve more technical experts, involve teams in other functional group, or involve external suppliers • Hierarchical escalation (Management escalation): Escalate to higher level management team

A (Service Desk) B (Second Line) C (Third Line, Supplier) D (Incident Manager) E (Division Management) F (Corporate Management Escalation by Priorities

Investigation Activities • Assign dedicated support person • Collect basic info • Query historical data • Recent releases • Recent changes • Workload trend • Analyze • Again, don’t spend too much time in finding the root cause. Find a workaround as soon as possible!

Resolve and recover • Resolution (workarounds or permanent fix) • Create a Request For Change (RFC) • Approve RFC • Implement Change. • Record the analysis, the root cause, the workaround and the solution • Leave the incident in Open status when resolution hasn’t been found

Termination • Contact the user to confirm incident is resolved • Change the Incident status into “Closed” • Update all the Incident record to reflect the final priority, impact, user and root cause

Track and Monitor • Assign an owner to each incident. Usually it’s the Service Desk person. • Provide feedback to the users after a change • Enforce the escalation based on the priority

Problem Management • Problem Control • Find the root cause of a problem • Turn a problem into a Known Error • Error Control • Control and Monitor the Known Errors until they are appropriately handled • Proactive Problem Management • Resolve problems before they cause any incidents

Problem Control

Identify Problems • Analyze the trends of incidents • Likely to reoccur • Likely more will occur • Likely to have larger impact • Analyze the weakness of the infrastructure • Availability • Capability • A significant incident (outage)

Diagnosis • Recreate incident in testing environment • Link the modules with incidents • Review the latest changes • After the root cause of a problem is found, this problem becomes a Known Error

Temporary Fixes • It’s important to find a temporary fix if the problem causes significant incident • If temporary fix involves changes in the infrastructure, a Request For Change must be submitted. (Later, another RFC may be submitted to fix the root cause) • For urgent problems, Emergency Change Request Process should be initialized.

Error Control

Identify and Record Known Error • Identify • Find the root cause of a problem • Link a problem with a known error • Record • Assign an ID • Symptoms • Root cause • Status • Notification • Notify incident management team. They can associate new incidents with known errors

Determine the solution • Evaluate based on • Service Level Agreement • Impact and Urgency • Cost and benefit • Possible solutions • Temporary fixes • Permanent fixes • No fix (cost is greater than benefits) • Record the decision in Problem Database

Known Errors from other environments • Known errors from development environment • We may choose to release with some minor known issues • Known errors from suppliers • Usually reported in the release notes • Record, Monitor and Track those known errors • Relate problems with those known errors

PIR (Post Implementation Review) • Normal problems • Confirm all the related incidents are closed • Verify if the problem record is complete (symptoms, root cause and solutions) • Change the problem status into Resolved • Significant problems • What went well? • What went wrong? • How to do better next time? • How to prevent the similar issues from happening again?

Track and Monitor • Track the full lifecycle of each known error • Reevaluate impact and urgency. Adjust the priorities accordingly. • Monitor the progress of the diagnosis and implementation of the solution. Monitor the implementation of the RFC.

Proactive Problem Management • Focus on the quality of the service and the infrastructure • Analyze operational trends • Detect the potential incidents and prevent them from happening • Find out the weak points of the infrastructure or the overloaded components

Ideas to improve our Production Support process • Idea 1: Create an independent Problem Management Team. • Idea 2: Create an Problem Database • Idea 3: Define the Production Support Procedure • Idea 4: Review and revise the procedures of using TeamTrack • Idea 5: Enforce Post Implementation Review • Idea 6: Proactively manage problems • Idea 7 (optional): Acquire an Service Desk software to facilitate the process

Create an independent Problem Management Team. • Can be a full time team or a part time team • Appoint a Problem Management Manager. Must be different than the Production Support Manager. Their goals, schedules and requirements are different. • Responsible for managing all the production problems (not incidents) for multiple applications • Identify problems • Record problem • Find and evaluate solutions • Track the progress till closure • Work closely with the existing Production Support team.

Create a Problem Database • A easy to search knowledge database • Include problems and known errors • Track symptoms, root causes, temporary fixes, workarounds, and permanent solutions • Include all the known errors in DEV and unresolved or deferred defects in QA/RATE environments • Maintained by the Problem Management Team • Will be used by Production Support team for match and fast resolution of incidents

Define the Production Support Procedure (Work Instructions) • Create a formal and detailed document. Train Production Support Team to follow the new procedure • Start with ITIL Incident Management Process. Adjust it to our own situation and tools • Clearly define how to calculate priorities • Clearly define the time-bound escalation procedure • Clearly define the monitoring and tracking steps

Review and define the procedure of using TeamTrack • TeamTrack is our existing Incident Tracking system • Review the functions of TeamTrack • Redefine the incident escalation process according to ITIL suggestions • Define the interface between PC Support and IT Production Support Team • Communication channel • Roles and responsibilities • Escalation • Track and Control • Knowledge sharing

Enforce PIR • Contact each user to confirm all the incidents are closed • Make sure the Problem record is complete and useful • Identify issues in the Incident and Problem Management process. Add those to Problem database.

Proactively Manage Problems • Responsibility of the Problem Management Team. • Perform the following activities: • Analyze incidents to find the trend • Analyze infrastructure to identify possible bottleneck • Run fail-over and stress tests • Apply a problem solution across multiple related applications • Establish and maintain the Production Monitor System to proactively detect system anomalies • Evaluate how many problems are proactively identified and resolved

Service Desk Software • Evaluate the existing TeamTrack software and see if it covers out needs • Other popular options • HP Openview Service Desk • Remedy Strategic Service Suite • CA Unicenter Service Desk

Deal with Production Issues

Deal with Production Issues

Presentation Transcript

Actor John Nicholson signs Co Production Deal with director

DEAL with it!

Local Issues Associated With Fracking and Gas Production

Issues with Production Grids

Production Issues

Objectivity Production Issues

Deal With Stress

Conveyancers Adelaide : Help to Deal with Property Transaction Issues

How to deal with mail delivery issues in Yahoo?

Using Adobe Support To Deal With Photoshop Issues

How to Deal with Online Gambling Issues?

Plumbing and Heating Issues â€“ How to Deal with Common Plumbing Issues

Deal With Infertility Issues Easily With IVF Treatment

Guidelines To Help You Deal with Your Diabetic issues

How to deal with iCloud storage issues?

How to Deal With SBCGlobal Issues With Authority?

Deal with Your Pipes Issues With These Tips

Deal with Cryptocurrency

A Complete Guide to Deal with Mortgage Issues

How you can Deal with Infertility Issues

Blockchain customer care phone number to deal with issues

deal with quarantine