The Challenge of 24/7 Operations: Strategies for Continuous Availability

Are We ReadyFor “24 by 7” ? Dennis Cromwell Michael Egolf CUMREC 2001

What We Will Cover • The Challenge of “24 X 7” • Definitions- A common ground • The road to “high availability” • The road to “continuous operations” • The road to “continuous availability” • Business process steps to get there

What we will NOT cover • If you are in this session to learn about…. • Detailed technical architectures to install, and how to manage High Availability, HA, Fault Tolerance--- • To learn about “speeds and feeds”, “slots and watts”……. • Then you are in the WRONG session

The Challenge • Global economy, internet, increased reliance on computers, server consolidation--------- • Demand/increased expectations to make services and applications accessible by users more hours per day, more days per week. • Distance education, internet registration, etc. • Users are not tolerant of unscheduled interruptions in service. • Users are less tolerant of scheduled interruptions in service as well.

Have you ever been asked? • We need “24 x 7”, so we can just buy another server, or this fault-tolerant stuff, right? • We need “24 x 7”, so the system techies can take care of this, right? • We need “24 x 7”, so can’t YOU just change our batch schedule around, or do some special DBA things? • If we just bought the GizmoTech Belchfire Z-9000HA cluster, that would give us “24 X 7”, right? • Why can’t you people in IT do “24 X 7” like everyone else on the internet? It’s an IT problem, right?

The Reality is--- • Achieving “24 x 7” requires a multi-dimensional strategy. • Cannot be bought off the shelf. • Requires substantial levels of cross-organizational people,process planning, discipline, and control. (unlike most IT projects) • Is expensive. • Fewer than 20% will achieve by 2005. (GartnerGroup)1

Definitions • Reliability • Performance • Schedule of Operations • Availability • High Availability • Continuous Operations • Continuous Availability • Fault tolerance

Reliability • The extent to which the application/service provides the same results on repeated trials. • Provides consistent, correct results. • Not the same as Availability.

Performance • The amount of elapsed time for the service or application to provide the information or result to the end user. • Performance is not an absolute measure, it is relative to the end user. • Performance is acceptable if it allows the end user to be productive in his/her work. Ask- they’ll tell you.

Schedule of Operations(scheduled time) • The negotiated, agreed-to, and published schedule (days, hours per day), that the application or service is to be accessible by the customer. • Not all applications are scheduled to be accessible 24 hours per day, or 7 days per week, nor need they be. • Thought: Should the service be accessible outside the schedule, even though it could be?

Availability • “The percent of the time that the application or service is actually accessible by the customer, within the schedule of operations.” • “The proportion of time that a system can be used for productive work.” • Availability implies reliability and acceptable performance.

Availability ChartBased on 24 X 7 X 52 Schedule Availability Unscheduled Down Time per Year Percent Minutes Minutes Hours 90.0% 50,000 52,560 876 99.0% 5,000 5,256 87.6 99.9% 500 525 8.76 99.99% 50 52 .87 99.999% 5 5 .08 99.9999% .5 .5 .008 99.99999% .05 .05 .0008

Availability ChartFor a schedule of 6 am – 9 pm, 6 days/week Unscheduled Down Time per Year Percent Minutes Hours • 90% 28,080 468 • 99% 2808 46.8 • 99.9 280 4.68 • 99.99% 28 .46 • 99.999% 2 .046

High Availability • Gartner Group: “A highly available application provides user access to applications and data a high percentage (e.g. 99 percent or greater) of scheduled time, despite unscheduled events.” • IBM: “High Availability isn’t a specific technology but is instead a balanced solution that addresses the people, process, and technology issues for specific systems.”

High Availability Microsoft: • “One way to understand high availability is to contrast it with fault tolerance. These terms describe two different benchmarks measuring availability. Fault tolerance is defined as 100% availability 100% of the time, regardless of the circumstances. A fault tolerant system is designed to guarantee resource availability. • In contrast, a high-availability system is concerned with maximizing resource availability. A highly available resource is available a very high percentage of the time and may even approach 100% availability, but a small percentage of down time is acceptable and expected. • High availability can be defined as follows: A highly available resource is almost always operational and accessible to clients.”

High Availability • Unmanaged 90.0% • Managed 99.0% • Well Managed 99.9% • Fault-Tolerant 99.99% • High Availability 99.999% • Very High Availability 99.9999% • Ultra-Availability 99.9999% Source: Strategic Research Corp.

Measuring Availability • Most organizations do not currently measure end to end, but need to do so will increase. • How do you really measure? • What if individual workstations down? • What if some application components down? • Weight by number of users? • Who/where are the users anyhow? • What’s good enough? Available to network? • What’s acceptable? Who says? • User Surveys. Can be a good source. • Thought : Are SLA’s in higher-ed good, or a guarantee of failure?.

Continuous Operations • Architecting an application and the process components to schedule the application to allowuser access during expanded hours, often 24 hours per day, 7 days per week, or near 24/7. • Continuous Operations does NOT imply high availability. • Addresses minimizing planneddowntime.

Continuous Availability • The combination of High Availability and Continuous Operations. • If we schedule and enable the application to allow user access 24 x 7 (or near 24 x 7), and alsowant the application to achieve availability of 99% or greater, then this is continuous availability. • Addresses unplanned and planned downtime • Probably closest to what is meant by “we need 24 X 7”.

Availability BenchmarkSource: (GartnerGroup) 2 Unplanned/yr Planned/yr • Average 175+ hours 250+ hours 98% • Very Good 87 hours 200 hours 99% • Outstanding 43 hours 50 hours 99.5% • Best in Class* 9 hours 12 hours 99.9% *Fewer than 5 % achieve Best in Class

Fault Tolerance • The entire system and all the resources that are needed for an application to run must be duplicated. Cannot afford any downtime. • To eliminate allplanned, unplanned downtime. • As a result of this complete replication, fault-tolerant systems are much more expensive than highly available systems. • E.G. air traffic control, life support systems. • At IU, no strong business case apparent. • Thought: Are there any true business cases for Fault Tolerance in higher-ed?

The Road to High Availability • Simple: Reduce unplanneddowntime!

The Road to High Availability • 80% or more of unplanned downtime is the result of People and Processes, NOT hardware or O/S failures…… • Application failures • Software failures, errors in configurations • Scheduling errors • Operator errors • Out of space conditions • Batch prevented OLTP from being available on time • Data corruption • Unexpected or unplanned volumes

The Road to High Availability To address the 80%, invest money/time in: • Staffing, Training • Change management • Problem management • Job scheduling, restart procedures • Intelligent event management, tuning • Application architecture • Function, regression, integration, load testing • Test and time recovery scenarios • Production readiness reviews, standards • Application planning, capacity planning

The Road to High Availabilitysome technology stuff…. • Minimize SPOF- Single Point Of Failure • Environmental, facilities, network • Web load balancers, redundant dispatchers • RAID: level 5/0/1, mirroring, striping • ECC data protection • On site spares, hot swappable parts • “HA” solutions, clustering, auto fail over • Data Base replication, cloning • Oracle Parallel Server- OPS

The Road to Continuous Operations(Expanding the Scheduled Time for accessibility) • Understand the application architecture and constraints. • Understand all application dependencies and interrelationships to needed components. • Reduce batch interference. • Confront the “backup problem”. • Hot backup strategies, cloning, SAN’s. • Manage other plannedchanges.

The Road to Continuous Operations Manage the Planned downtime: • Infrastructure and facility work • Hardware changes and upgrades • Operating system level changes • Database changes and releases • Application changes and releases- “release tolerance” a key item • Increased need for infrastructure test environments. To some this is new. • Common “maintenance windows” • Expect increased coordination, staff overhead

The Road to Continuous Availability • Application availability dependent on design. • Transaction queuing, batch processing • Release tolerance, recovery • Set schedule and availability expectations early. • Have ‘some’ functions up 24 x 7, not all. • Continuous availability cost about 3.5X as much as a standard application. (GartnerGroup) 1

The Common Maintenance Window • Applications are interrelated and integrated with others more than ever. • Shared infrastructure elements are more common. • Managing a maintenance window for each application can be exceedingly complex. • A common maintenance window for infrastructure activity can be beneficial. • Saves negotiating time, sets expectations

Putting it all together • Now that you know some definitions, lots of numbers, and components to address, how do you get started on the road to “24 X 7”? • The following Business Process Steps represent an approach for IU.

Step 1 Define the Problem • A problem well defined is a problem 80% solved. • For each application area, determine what the problem/goal is with the correct user representative(s) . • Determine the schedule goal. • Separately, determine the availability goal. • Schedule and availability should be determined and designed in up front, just like any other application functional requirement. It’s more costly to retrofit.

Step 2 Categorize • Categorize the applications into groups. • For Example…. • Business Support Systems • Operational Support Systems • Self Service/E-Commerce • Management Support Systems

“Business Support” System • Mon-Fri: 6:00 a.m. to 10:00 p.m. EST • Sat: 6:00 a.m. to 6:00 p.m. EST • Sun: Normal maintenance window • Batch updates, data refreshes

“Operational Support” Systems • Round-the-clock operations, such as physical plant, security, hospitals • Near 24x 7 schedule • Occasional Sunday morning maintenance • Monthly cold backups • Batch, backups non-disruptive to users • Accessible about 8700 hours/year • The most extended schedule

“Self Service/ E-Commerce” • Near 24 by 7 schedule • Can tolerate 1-2 hours down per night • Accessible from 148 to 156 hours per week • Batch and backups during 1-2 hours per day

“Management Support” Systems • Systems used by “management” for such activities as reporting, queries. • Same schedule as Business Support Systems

Step 3 Know the Applications • Understand each application’s architecture, constraints, “release tolerance”, flexibility to change. • In-House vs. purchased. • Know the applications dependencies on other applications and components. • Architecture Diagrams, data flows are key.

Sample Architecture Diagram

Step 4 Know the Baseline • What is your current SOP with respect to technology? Procedures? Testing? • What is your current availability? What can you expect with existing budget? • If you haven’t already, at least start measuring something. • Identify root causes of unplanned downtime. • What are infrastructure constraints on expanding schedule?

Step 5 Know the Costs • What improvements can you make from existing budget? Training, testing, Q/A, etc. • Invest in the right areas for you to expand schedule and availability. • Know costs to expand schedulebeyond baseline to meet goals. • Know costs to increase availability beyond baseline to meet goals. • Expect involvement from all areas of IT

Step 6 The Business Case • Develop a consistent approach to weigh the business benefits vs. the cost. Maintain focus on the business problem/goal. • The “Steering Committee” or business owner(s) of the applications need to determine the business need. • It’s difficult to cost and plan for applications individually- categorizing may help. • Differentiate between “like to have” and true business need. Who pays? • May not be any “quick fix”.

Step 7 Execute The Plan • Have Commitment. • Sr. management commitment • Front-line management commitment • Define the resources, people, budget, etc. • Define ownership. • Develop, document a typical plan, with goals, activities, responsibilities, dates, etc. • Make it part of existing project plans • Manage and adjust. • Measure actual vs. goal.

In SummaryWhat we have covered • The challenge of “24 X 7” • Definitions- standard terminology • Elements of achieving “high availability” • Elements of achieving “continuous operations” • Business Steps as an approach to proceed

Questions and Discussion This presentation can be accessed(24 x 7?) at: www.indiana.edu/~uis/cumrec Dennis Cromwell email: dcromwell@indiana.edu Mike Egolf email: megolf@indiana.edu

References GartnerGroup: Building Continuous Availability Into E-Applications, COM-12-1325, 29 September 2000, D. Scott, Y. Natis GartnerGroup: Availability: How Do Your Applications Services Stack Up?, SPA-12-8280, 17 January 2001, D. Scott GartnerGroup: High Availability: A Perspective, DPRO-90193, 29 June 2000, Jane Wright, Ann Katan GartnerGroup: Measuring End-to End Application Service Availability, DF-13-1114, 19 March 2001, D. Scott GartnerGroup: 24 X 7 E-Commerce Availability, 25H, SYM10, 10/00, Donna Scott IBM: Helping to keep your critical systems up and running, 7/22/99, IBM Global Services

The Challenge of 24/7 Operations: Strategies for Continuous Availability