200 likes | 461 Views
A A A N C N U I N F O R M A T I O N T E C H N O L O G Y IT OPERATIONS. Problem Management . Jim Heronime, Manager, ITSM Program Tanya Friehauf-Dungca, Manager, Problem Management 2/17/11. Agenda. PM Overview History Vision & Mission Operational Level Agreement (OLA) Action Items
E N D
A A A N C N U I N F O R M A T I O N T E C H N O L O G Y IT OPERATIONS Problem Management • Jim Heronime, Manager, ITSM Program • Tanya Friehauf-Dungca, Manager, Problem Management • 2/17/11
Agenda • PM Overview • History • Vision & Mission • Operational Level Agreement (OLA) • Action Items • Trending (Proactive Problem Management) • Facilitated Meetings (MIR & ToE) • KPIs and Metrics • Future Initiatives • Questions? Problem Management Team Members
Problem Management Overview • Main goal of Problem Management: • Detection of the underlying causes of an incident and the subsequent resolution and prevention of the incidents. • Problem Management ensures: • The identification and classification of problems, root cause analysis, and resolution of problems • Problem Management process also includes: • The formulation of recommendations for improvement, maintenance of problem records, and review of the status of corrective actions
History of PM at AAA • Began our formal Problem Management practice in 2008. • Track major incidents • ID Root cause for major incidents • Rudimentary MS-Access dB to store info • Began formal implementation of ITSM in June 2009 • Average root cause found was 55.4% • Mean time to close problems = 6 days • Implemented current iteration of Problem Management October 2009. By January 2010. • Average root cause found was 83% • Mean time to close problems = 3 days • We continue to mature our process
Vision and Mission • VISION: • To permanently eliminate problems in our production environment and prevent new problems from occurring • MISSION: • To aggressively identify root cause of problems and drive permanent solutions to stabilize our IT infrastructure • We do this by: • PROCESSES: Ensuring PM processes and procedures are followed by IT support teams • ACTION ITEMS: Managing assigned action items and their timeframes with support teams to drive permanent solutions • ROOT CAUSE: Driving root cause identification within OLA timeframes
OLAs for PM Be aggressive: 3 Business days to identify root cause - Report enables us to track daily progress
Action Items • Objective: • Action items are identified and assigned to drive permanent solutions • Types of Action Items: • Root cause identification for every problem created from an incident • Areas of improvement • Documentation • Process improvement & training • Vendor management • Hardware replacement • How are Action Items identified? • Incident management activities • Problem management activities – Root Cause Analysis • Meetings: Daily IT Operations Meeting, Major Incident Review (MIR), or Team of Experts (ToE) • How are they tracked? • Maximo – integrated system with Change, Incident, and Asset
Trend Analysis (Proactive Problem Management) • Objective: • Analyze related incidents for common root causes • Collaboration with Operations Bridge: • Weekly work sessions to identify potential areas of concern • The Problem Management team reviews related incidents to look for common symptoms, causes, or conditions • Commonalities identified by trend analysis? • A Global Problem record is created and assigned to the Service Owner with appropriately assigned action items • Service Owner analysis: • The Service Owner prioritizes their efforts • Determine to identify root cause • Prioritize and approve with business for funding, scheduling
Major Incident Review (MIR) • What is it? • Evaluation of the incident process after a major incident • What’s it’s purpose? • Validate details of the incident record • Review incident handling – identify opportunities • Identify lessons learned - share across the enterprise • Identify action items • When is one required? • Mandated for all Severity 1 incidents • Lower severities by request or as needed • Why does Problem Management facilitate a Major Incident Review? • Unbiased view of events – no call involvement
Team of Experts (ToE) • What is it? • A special team of technical subject matter experts (SMEs) assembled to analyze and resolve critical problems at an accelerated pace to minimize or eliminate exposure. • How long has this process been in place? • This is one of our newest additions – since December 2010 • Why are ToEs initiated? • Teams not collaboratively engaging each other • Need to identify root cause immediately – back to back incidents • Leadership’s request for information and status of critical or chronic problems
ToE (cont.) • ToE Activities • Root cause analysis • Brainstorm solutions and permanent fixes • Assign action items and due dates • Where’s the template? • Currently under construction
KPIs and Metrics • KPIs • Root cause identified within OLA • MIRs conducted for Sev1 Incidents • Operational Metrics • Total Problems by Severity • Problems by Causing Party • Outages by Domain (Applications, Network, Security, Servers, Telecom or Other)
KPIs *Baseline determined by internal historical data = 82% *Industry standards non-existent
KPI Details *2010 Average for RC Identified within OLA = 85.7%
Examples of Metrics *Change Freeze AT&T AAA NCNU
Future Initiatives • Workarounds and defects – Known Error Database • Action item validation – quality check on completed actions • ToE template development
Questions? • PROBLEM MANAGEMENT TEAM MEMBERS • Mark Hernandez - IT Service Transition Analyst V • Gessica Briggs-Sullivan – IT Service Transition Analyst III • Andrew Egan - Intern