460 likes | 598 Views
Setting the Standard for DR. John Pollard – 23 March 2006. PAS 77 – Guide to IT Service Continuity Management. PAS 56 Guide to Business Continuity Management. Business Continuity Management. RISK MANAGEMENT. IT DISASTER RECOVERY. FACILITIES MANAGEMENT. SUPPLY CHAIN MANAGEMENT.
E N D
Setting the Standard for DR John Pollard – 23 March 2006 PAS 77 – Guide to IT Service Continuity Management
PAS 56 Guide to Business Continuity Management Business Continuity Management RISK MANAGEMENT IT DISASTER RECOVERY FACILITIES MANAGEMENT SUPPLY CHAIN MANAGEMENT QUALITY MANAGEMENT HEALTH & SAFETY KNOWLEDGE MANAGEMENT EMERGENCY MANAGEMENT SECURITY CRISIS COMMUNICATIONS & PR * Source: PAS 56:2003 Guide to Business Continuity Management
IT Service Continuity Management … managing an organisation’s ability to continue to provide a pre-determined and agreed level of IT Services to support the minimum business requirements … * Source: ITIL: Best Practice for Service Delivery
Threats • Loss, damage or denial of access to key infrastructure services • Failure or non-performance of third parties • Loss or corruption of key information • Sabotage, extortion or industrial espionage • Infiltration or attack on critical information systems
Scope • Generic framework and guidelines for a continuity programme, including: • Management structure & responsibilities • How to conduct business criticality & risk assessments • How to define and create an IT Service Continuity plan • How to rehearse an IT Service Continuity plan • Solution architectures and design considerations
What is a PAS? * Source: BSI
Status Group formed First draft External review Expected release Edit Revise Contracts / Structure / Content Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 2004 2005 2006
ITSC Strategy • Define direction and high-level methods to meet IT service level objectives • Agreed at Board level • Needs to consider 4 stages of major incident • Initial response • Service recovery • Service delivery (following incident) • Normal service resumption • Enable rehearsal of major incident
ITSC Strategy & Plan Business Strategy Threat Analysis Business Criticality IT Service Continuity Strategy IT Architecture IT Service Continuity Plan Rehearsals Costs Processes
Maintaining an ITSC Strategy Monitor
Management Structure Crisis Management Team CMT CMT Business Continuity Management Team BCMT BCMT Incident Management Team IMT IMT
Business Criticality & Risk Assessments • Identify business units & processes • Categorise criticality of processes • Identify IT services supporting the business processes • Categorise criticality of IT services • Review • By location • By business unit
Business Criticality Categories • Critical • Vital to day-to-day operation • Mandatory • Vital to meet statutory requirements • Strategic • Important for implementation of long-term strategy • Tactical • Important for short/medium term objectives
Risk Assessment Process Learn Lessons
ITSC Plan • Part of wider BCM Plan • Model plan should include: • Initial response • Incident assessment • Roles & responsibilities • Procedures • Rehearsing the plan • Maintaining the plan
Recovery Objectives • Recovery Point Objective (RPO) • The point in time to which work is restored. E.g. Start of day • Recovery Time Objective (RTO) • The time required to recover service
IT Architecture – Resilience Considerations • Location & distance between sites • Number of sites • Staff access & proximity • Remote access • Dark site vs. manned site • Staff skill levels • Telecoms connectivity and redundant routing • Automation required • Telephony and email • 3rd party / external links
Rehearsal • A body to control & coordinate • Objectives & success criteria • Rehearsal plan & scripts • Staff briefing • Logs and critique forms • Observers • Post-rehearsal review
Areas to Rehearse • Callout • Walk through reviews • Walk through exercises • Component rehearsals • Integration rehearsals • Relocation rehearsals • Failover rehearsals • Major incident simulations
Site Models • Active / Contingency • Cold site • Active / Active • Service runs from both sites • Active / Alternate • Service can run from either site • Active / Backup • Warm standby site • Multi-site and other hybrids
App App Data Resilience Tape/backup Database Application Host Storage Array SAN
Replication Modes • Synchronous • Increased write latency • Typically OK for OLTP • May impact batch processing • Requires greater inter-site bandwidth than other options • Snapshot • Point in time copy • Only valid on completion of transfer • Minimal/no performance impact • Near real-time • Frequent snapshots • Minimal performance impact
Service Continuity Technology People Processes A Holistic Approach Service Continuity is much more than technology
Defining the Standard for DR Part II - Workshop John Pollard – Unisys PAS 77 – Guide to IT Service Continuity Management
Typical Challenges • Tape recovery slow • Manual build is complex • Complex inter-operation between systems • Difficult to define critical and non-critical • Management of failover site • Keeping sites in step • Windows Servers
Synchronous Write Latency Server Transfer time Write 1 ≈ 0. 5 mSec Write 2 ≈ 0.5 mSec Storage Array Storage Array Communication link Latency = 2 * Write Time + Transfer Time For 200 kilometres using Fibre Channel Latency = 2 * 0.5 + 4.0 = 5.0 mSec
Site Synchronisation • Major challenge • Cultural change is needed • Critical to successful operation • DR systems • Build at recovery time • Slow / complex recovery • Maintain ready to use • How to validate changes • Live run • System dependent
Windows Servers • Build DR servers at recovery time • Lengthy recovery process • Prone to errors • Complex – requires higher skill level • Maintain DR servers ready to use • HW does not have to be identical • Complex SW change and configuration management • How to validate releases • Boot servers from storage array • Requires matching HW • SW only installed once • Simplifies SW change and configuration management • Simplifies failover process / improves recovery
Windows Boot from SAN Production Site DR Site Test Server Live Server DR Server Live Data Test Data Live OS Test OS Data OS Storage Array Storage Array
Virtualisation • Reduced investment • Fewer servers dedicated for resilience • Expand/replace if long term outage • Flexibility • Allocate/use servers as required • Potentially reduced capacity • Depending on system and scale of incident • Configuration may not have been proved
Service Management Identify Affected Areas • Service Desk • Incident Management • Problem Management • Configuration Management • Change Management • Release Management • Testing
Operational Assessment • Understand people and process • Gap analysis
Delivery Approach Discover Model Design Implement Manage • Business Objectives • Current Issues or Problems • Existing/Target Infrastructure • Success Criteria • Vision • Existing Systems, Applications & Services • Physical ‘As-Is’ Model • Logical ‘As-Is’ Model • Data profiling • Security assessment • ‘To-Be’ Logical Model • ‘To-Be’ Physical Model • Project plan • Resource schedule • Develop business case • Implement target environment • Migrate and consolidate applications • Application and middleware integration • Define and implement test strategy • Operational assessment & gap analysis • Implement operational & management processes
Workshop • Determine high-level requirements • Determine Business Drivers • Determine Success Criteria • Overview systems and applications • Identify team members, sponsors, etc. • Agree timelines
SERVERS STORAGE NETWORKING Discovery Audit and map: • Hardware • Software • Services
Data Applications Services Group Systems Analysis
Design • Systems architecture • Operational assessment • Test environment • Project plan and resource schedule • Training requirements
Transition to Future State Operational Management Optimised Architecture Service Continuity Application Selection and Development Standards Data Centre Transformation Network Design Storage Design Training Requirements Systems Design Systems Management Migration Plan Test Environment and Strategy
Implementation • Methodology • Call on best practice • Operational management • Cultural change • Keep people informed