420 likes | 550 Views
Problem Management Practitioners Forum Thursday January 19, 2012. Jon Dowell Jorge A. Wong. Agenda Housekeeping & Introductions Define a successful investigation Makeup of a successful Problem Manager Proactive monitoring of automated alerts for trends/patterns
E N D
Problem Management Practitioners ForumThursday January 19, 2012 Jon Dowell Jorge A. Wong
Agenda Housekeeping & Introductions Define a successful investigation Makeup of a successful Problem Manager Proactive monitoring of automated alerts for trends/patterns Impact of Change Management on PbM Feedback & next steps
Housekeeping & IntroductionsFire & WashroomsName, Company, & ExperienceJon Dowell
Jon Dowell • Senior Consultant with KSLD Consulting. • 15 years of experience solving I.T. mysteries. • Facilitation and critical thinking during: • Major Incidents • Problem investigations • Project quality assessments prior to go-live • Project warranty periods • Training and mentoring • Critical thinking • Root cause analysis • Impact assessments • Potential risks associated with requests for change • KSLD Consulting specializes in I.T. Problem Management and problem solving for today’s busy world.
Jorge Wong • Over 13 years in IT with Enmax and Accenture • Senior Systems Analyst • Applications Support Team Lead • Contact Center Technology Team lead • Service Delivery Lead • Relationship Manager • Problem Manager • ITIL Background • Focuses on reactive and proactive problem management • Facilitates and conducts problem investigations with cause mapping analysis method to capture the complete investigation to: • Assess impact and cost • Identify root cause(s) • Best solution(s) to prevent recurrence • Reviews and analyzes data from incident management and pinpoint problems which will give the best results once resolved.
Successful Problem Investigations Must first understand: Why do we have problem investigations? An investigation should be conducted to diagnose the root cause of the problem. How long should it take? The speed and nature of the investigation will vary depending upon the impact, severity, and urgency of the problem. What resources are required? The appropriate level of resources and expertise should be applied to finding a resolution corresponding to the priority and service levels targeted. Then, use your problem investigation toolkit. There are many problem solving analysis, diagnosis and solving techniques available and much research has been done in this area.
Successful Problem Investigations Some of the most useful and frequently used techniques include: Chronological analysis Timeline of events Pain Value Analysis What level of pain has been caused to the organization/business by these problems Kepner and Tregoe Deeper rooted problems Cause Mapping Deeper rooted problems 5 Whys Cause and effect Brainstorming Gather together the relevant people and brainstorm the problem Ishikawa Diagrams Document causes and effects which can be useful in helping identify where something may be going wrong, or be improved Pareto Analysis Separate important potential causes from more trivial issues Use what is appropriate and what you feel comfortable with.
Successful Problem Investigations End results Expected and desired outcome realized Root cause(s) identified and or validated Corrective measure(s) identified and or implemented Effective use of resources throughout the investigation Which means Increased benefits to the business and the IT organization of: Decreased downtime Increased business satisfaction Decreased amount of IT resources spent on incident management Other benefits Influences future cost avoidance CMDB Improved IT service quality Incident volume reduction Permanent solutions Improved organizational learning Better first time fix rate at the Service Desk Improves existing processes and procedures Happy Staff, including Problem Manager!
Root Cause 11
Root Cause 12
Root Cause 13
Root Cause 14
Root Cause 15
Root Cause 16
Root Cause 17
Root Cause 18
Root Cause 19
KepnerTregoe has a process called Incident Mapping that performs a similar process.
What are the traits of a Problem Manager? Listening Ability to listen Attention to detail… while listening Questioning Open questions… to allow the story to flow Closed questions… to confirm facts/details Ability to ask tough questions and not be side tracked by miss direction. Leadership Ability to lead a teams, resolve conflict, and drive resolution. Prioritization with a focus on business, not technical, impact. Strong organization & time management abilities. Business writing skills And… Understanding of business terminology and concepts. Understanding of basic technical concepts, architecture, and methodologies.
Helpful educational opportunities? Dale Carnegie KepnerTregoe Problem Solving & Decision Making Incident Mapping ThinkReliabilty Cause Mapping FranklinCovey Focus General Business Writing
Proactive monitoring of automated alerts for trends/patternsJorge Wong
Alerts and monitoring, why? Identify future problems. Prevent problems from happening. Manage technology infrastructure based on business. Anticipate and meet the needs of the business. Effectively manage an increasingly intricate and complex infrastructure. Predict and solve problems before they affect business. Industry analyst reports, IT still discovers about 70% of problems through the service desk.
Alerts and monitoring, why? Reactive to Proactive End-user experience Application performance and availability Service level commitments Outages Cost avoidance Resources Productivity Efficiency Capacity Predictive analytics MTTR MTBF
Alerts and monitoring, what? Demand Capacity Availability KPIs Logs Services Network Servers User Defined Monitoring and Instant Alerts Monitor the Windows Event log Alert on hardware and software changes Alert on specific file changes and protection violations Know if disk space is running low on computers Monitor computer online/offline status Know if a server goes down Know when traveling users with notebooks connect Alert message and recipient configuration
Alerts and monitoring, what? Pro-active approach Server's utilization exceeds predefined percentage of total capacity available......raise alert! Server CPU breaches 90% utilization, or disk becomes 80% full. Food For Thought What happens when a server goes down? Alarms, alerts, and notifications are triggered all over the place. The application, database, and operating system may appear to be down. However, this problem behavior may be due to a single point of failure elsewhere in the network. What is the problem? What is the impact? What is or are the root causes? What is or are the workarounds and resolutions? Or......should we even be worried about it? Problem Management Categories Re-active Pro-active Predictive Intelligence?
Next Steps Future sessions Problem Management Practitioner Forums 2012 January 19 (9am - Noon) March 15 (9a - Noon) June 7 (9a - Noon) Followed by casual lunch Change Management Practitioner Forum 2012 April 12 (9a - Noon) <Tentative> Business Analyst World Conference 2012 May 7, 8, & 9 Practitioner Forums 2012 Looking for subject ideas Configuration Management Service Level Management Looking for thought leaders and interested participants
IT Problem Management What is a Problem? A cause of one or more Incidents. The cause is not usually known at the time a Problem Record is created. What is Problem Management? The objective of Problem Management is to resolve the root cause of Incidents, and to prevent the recurrence of Incidents related to these errors. What does a Problem Manager do? The Problem Manager is responsible for managing the lifecycle of all Problems. He undertakes research for the root-causes of Incidents and thus ensures the enduring elimination of interruptions. His primary objectives are to prevent Incidents from happening, and to minimize the impact of Incidents that cannot be prevented.
What is Root cause Analysis? • A standard process of: • Identifying a problem • What happened? • Containing and analyzing the problem • What were the root causes of the problem? • Defining the root cause • What internal options are available to deal with the problem? • Defining and implementing the actions required to eliminate the root cause • What is the cost of acting upon the available options? • Validating that the corrective action prevented recurrence of problem • Which decision options will provide the most cost-effective solution? Identify Problem Validate Identify Team Follow Up Plan Immediate Action Complete Plan Root Cause Action Plan
At a high level, problem investigation looks at: • What were we doing? (Before Major Incident, Incident) • What was the problem? • Why did it happen? • What should be done? • What will we be doing now? (After Problem Investigation)