1 / 46

The Challenge of 24/7 Operations: Strategies for Continuous Availability

Explore the complexities of achieving 24/7 operations, high availability, and continuous availability in the digital age. Learn strategies and business process steps to ensure uninterrupted services for users worldwide.

javier
Download Presentation

The Challenge of 24/7 Operations: Strategies for Continuous Availability

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Are We ReadyFor “24 by 7” ? Dennis Cromwell Michael Egolf CUMREC 2001

  2. What We Will Cover • The Challenge of “24 X 7” • Definitions- A common ground • The road to “high availability” • The road to “continuous operations” • The road to “continuous availability” • Business process steps to get there

  3. What we will NOT cover • If you are in this session to learn about…. • Detailed technical architectures to install, and how to manage High Availability, HA, Fault Tolerance--- • To learn about “speeds and feeds”, “slots and watts”……. • Then you are in the WRONG session

  4. The Challenge • Global economy, internet, increased reliance on computers, server consolidation--------- • Demand/increased expectations to make services and applications accessible by users more hours per day, more days per week. • Distance education, internet registration, etc. • Users are not tolerant of unscheduled interruptions in service. • Users are less tolerant of scheduled interruptions in service as well.

  5. Have you ever been asked? • We need “24 x 7”, so we can just buy another server, or this fault-tolerant stuff, right? • We need “24 x 7”, so the system techies can take care of this, right? • We need “24 x 7”, so can’t YOU just change our batch schedule around, or do some special DBA things? • If we just bought the GizmoTech Belchfire Z-9000HA cluster, that would give us “24 X 7”, right? • Why can’t you people in IT do “24 X 7” like everyone else on the internet? It’s an IT problem, right?

  6. The Reality is--- • Achieving “24 x 7” requires a multi-dimensional strategy. • Cannot be bought off the shelf. • Requires substantial levels of cross-organizational people,process planning, discipline, and control. (unlike most IT projects) • Is expensive. • Fewer than 20% will achieve by 2005. (GartnerGroup)1

  7. Definitions • Reliability • Performance • Schedule of Operations • Availability • High Availability • Continuous Operations • Continuous Availability • Fault tolerance

  8. Reliability • The extent to which the application/service provides the same results on repeated trials. • Provides consistent, correct results. • Not the same as Availability.

  9. Performance • The amount of elapsed time for the service or application to provide the information or result to the end user. • Performance is not an absolute measure, it is relative to the end user. • Performance is acceptable if it allows the end user to be productive in his/her work. Ask- they’ll tell you.

  10. Schedule of Operations(scheduled time) • The negotiated, agreed-to, and published schedule (days, hours per day), that the application or service is to be accessible by the customer. • Not all applications are scheduled to be accessible 24 hours per day, or 7 days per week, nor need they be. • Thought: Should the service be accessible outside the schedule, even though it could be?

  11. Availability • “The percent of the time that the application or service is actually accessible by the customer, within the schedule of operations.” • “The proportion of time that a system can be used for productive work.” • Availability implies reliability and acceptable performance.

  12. Availability ChartBased on 24 X 7 X 52 Schedule Availability Unscheduled Down Time per Year Percent Minutes Minutes Hours 90.0% 50,000 52,560 876 99.0% 5,000 5,256 87.6 99.9% 500 525 8.76 99.99% 50 52 .87 99.999% 5 5 .08 99.9999% .5 .5 .008 99.99999% .05 .05 .0008

  13. Availability ChartFor a schedule of 6 am – 9 pm, 6 days/week Unscheduled Down Time per Year Percent Minutes Hours • 90% 28,080 468 • 99% 2808 46.8 • 99.9 280 4.68 • 99.99% 28 .46 • 99.999% 2 .046

  14. High Availability • Gartner Group: “A highly available application provides user access to applications and data a high percentage (e.g. 99 percent or greater) of scheduled time, despite unscheduled events.” • IBM: “High Availability isn’t a specific technology but is instead a balanced solution that addresses the people, process, and technology issues for specific systems.”

  15. High Availability Microsoft: • “One way to understand high availability is to contrast it with fault tolerance. These terms describe two different benchmarks measuring availability. Fault tolerance is defined as 100% availability 100% of the time, regardless of the circumstances. A fault tolerant system is designed to guarantee resource availability. • In contrast, a high-availability system is concerned with maximizing resource availability. A highly available resource is available a very high percentage of the time and may even approach 100% availability, but a small percentage of down time is acceptable and expected. • High availability can be defined as follows: A highly available resource is almost always operational and accessible to clients.”

  16. High Availability • Unmanaged 90.0% • Managed 99.0% • Well Managed 99.9% • Fault-Tolerant 99.99% • High Availability 99.999% • Very High Availability 99.9999% • Ultra-Availability 99.9999% Source: Strategic Research Corp.

  17. Measuring Availability • Most organizations do not currently measure end to end, but need to do so will increase. • How do you really measure? • What if individual workstations down? • What if some application components down? • Weight by number of users? • Who/where are the users anyhow? • What’s good enough? Available to network? • What’s acceptable? Who says? • User Surveys. Can be a good source. • Thought : Are SLA’s in higher-ed good, or a guarantee of failure?.

  18. Continuous Operations • Architecting an application and the process components to schedule the application to allowuser access during expanded hours, often 24 hours per day, 7 days per week, or near 24/7. • Continuous Operations does NOT imply high availability. • Addresses minimizing planneddowntime.

  19. Continuous Availability • The combination of High Availability and Continuous Operations. • If we schedule and enable the application to allow user access 24 x 7 (or near 24 x 7), and alsowant the application to achieve availability of 99% or greater, then this is continuous availability. • Addresses unplanned and planned downtime • Probably closest to what is meant by “we need 24 X 7”.

  20. Availability BenchmarkSource: (GartnerGroup) 2 Unplanned/yr Planned/yr • Average 175+ hours 250+ hours 98% • Very Good 87 hours 200 hours 99% • Outstanding 43 hours 50 hours 99.5% • Best in Class* 9 hours 12 hours 99.9% *Fewer than 5 % achieve Best in Class

  21. Fault Tolerance • The entire system and all the resources that are needed for an application to run must be duplicated. Cannot afford any downtime. • To eliminate allplanned, unplanned downtime. • As a result of this complete replication, fault-tolerant systems are much more expensive than highly available systems. • E.G. air traffic control, life support systems. • At IU, no strong business case apparent. • Thought: Are there any true business cases for Fault Tolerance in higher-ed?

  22. The Road to High Availability • Simple: Reduce unplanneddowntime!

  23. The Road to High Availability • 80% or more of unplanned downtime is the result of People and Processes, NOT hardware or O/S failures…… • Application failures • Software failures, errors in configurations • Scheduling errors • Operator errors • Out of space conditions • Batch prevented OLTP from being available on time • Data corruption • Unexpected or unplanned volumes

  24. The Road to High Availability To address the 80%, invest money/time in: • Staffing, Training • Change management • Problem management • Job scheduling, restart procedures • Intelligent event management, tuning • Application architecture • Function, regression, integration, load testing • Test and time recovery scenarios • Production readiness reviews, standards • Application planning, capacity planning

  25. The Road to High Availabilitysome technology stuff…. • Minimize SPOF- Single Point Of Failure • Environmental, facilities, network • Web load balancers, redundant dispatchers • RAID: level 5/0/1, mirroring, striping • ECC data protection • On site spares, hot swappable parts • “HA” solutions, clustering, auto fail over • Data Base replication, cloning • Oracle Parallel Server- OPS

  26. The Road to Continuous Operations(Expanding the Scheduled Time for accessibility) • Understand the application architecture and constraints. • Understand all application dependencies and interrelationships to needed components. • Reduce batch interference. • Confront the “backup problem”. • Hot backup strategies, cloning, SAN’s. • Manage other plannedchanges.

  27. The Road to Continuous Operations Manage the Planned downtime: • Infrastructure and facility work • Hardware changes and upgrades • Operating system level changes • Database changes and releases • Application changes and releases- “release tolerance” a key item • Increased need for infrastructure test environments. To some this is new. • Common “maintenance windows” • Expect increased coordination, staff overhead

  28. The Road to Continuous Availability • Application availability dependent on design. • Transaction queuing, batch processing • Release tolerance, recovery • Set schedule and availability expectations early. • Have ‘some’ functions up 24 x 7, not all. • Continuous availability cost about 3.5X as much as a standard application. (GartnerGroup) 1

  29. The Common Maintenance Window • Applications are interrelated and integrated with others more than ever. • Shared infrastructure elements are more common. • Managing a maintenance window for each application can be exceedingly complex. • A common maintenance window for infrastructure activity can be beneficial. • Saves negotiating time, sets expectations

  30. Putting it all together • Now that you know some definitions, lots of numbers, and components to address, how do you get started on the road to “24 X 7”? • The following Business Process Steps represent an approach for IU.

  31. Step 1 Define the Problem • A problem well defined is a problem 80% solved. • For each application area, determine what the problem/goal is with the correct user representative(s) . • Determine the schedule goal. • Separately, determine the availability goal. • Schedule and availability should be determined and designed in up front, just like any other application functional requirement. It’s more costly to retrofit.

  32. Step 2 Categorize • Categorize the applications into groups. • For Example…. • Business Support Systems • Operational Support Systems • Self Service/E-Commerce • Management Support Systems

  33. “Business Support” System • Mon-Fri: 6:00 a.m. to 10:00 p.m. EST • Sat: 6:00 a.m. to 6:00 p.m. EST • Sun: Normal maintenance window • Batch updates, data refreshes

  34. “Operational Support” Systems • Round-the-clock operations, such as physical plant, security, hospitals • Near 24x 7 schedule • Occasional Sunday morning maintenance • Monthly cold backups • Batch, backups non-disruptive to users • Accessible about 8700 hours/year • The most extended schedule

  35. “Self Service/ E-Commerce” • Near 24 by 7 schedule • Can tolerate 1-2 hours down per night • Accessible from 148 to 156 hours per week • Batch and backups during 1-2 hours per day

  36. “Management Support” Systems • Systems used by “management” for such activities as reporting, queries. • Same schedule as Business Support Systems

  37. Step 3 Know the Applications • Understand each application’s architecture, constraints, “release tolerance”, flexibility to change. • In-House vs. purchased. • Know the applications dependencies on other applications and components. • Architecture Diagrams, data flows are key.

  38. Sample Architecture Diagram

  39. Step 4 Know the Baseline • What is your current SOP with respect to technology? Procedures? Testing? • What is your current availability? What can you expect with existing budget? • If you haven’t already, at least start measuring something. • Identify root causes of unplanned downtime. • What are infrastructure constraints on expanding schedule?

  40. Step 5 Know the Costs • What improvements can you make from existing budget? Training, testing, Q/A, etc. • Invest in the right areas for you to expand schedule and availability. • Know costs to expand schedulebeyond baseline to meet goals. • Know costs to increase availability beyond baseline to meet goals. • Expect involvement from all areas of IT

  41. Step 6 The Business Case • Develop a consistent approach to weigh the business benefits vs. the cost. Maintain focus on the business problem/goal. • The “Steering Committee” or business owner(s) of the applications need to determine the business need. • It’s difficult to cost and plan for applications individually- categorizing may help. • Differentiate between “like to have” and true business need. Who pays? • May not be any “quick fix”.

  42. Step 7 Execute The Plan • Have Commitment. • Sr. management commitment • Front-line management commitment • Define the resources, people, budget, etc. • Define ownership. • Develop, document a typical plan, with goals, activities, responsibilities, dates, etc. • Make it part of existing project plans • Manage and adjust. • Measure actual vs. goal.

  43. In SummaryWhat we have covered • The challenge of “24 X 7” • Definitions- standard terminology • Elements of achieving “high availability” • Elements of achieving “continuous operations” • Business Steps as an approach to proceed

  44. Questions and Discussion This presentation can be accessed(24 x 7?) at: www.indiana.edu/~uis/cumrec Dennis Cromwell email: dcromwell@indiana.edu Mike Egolf email: megolf@indiana.edu

  45. References GartnerGroup: Building Continuous Availability Into E-Applications, COM-12-1325, 29 September 2000, D. Scott, Y. Natis GartnerGroup: Availability: How Do Your Applications Services Stack Up?, SPA-12-8280, 17 January 2001, D. Scott GartnerGroup: High Availability: A Perspective, DPRO-90193, 29 June 2000, Jane Wright, Ann Katan GartnerGroup: Measuring End-to End Application Service Availability, DF-13-1114, 19 March 2001, D. Scott GartnerGroup: 24 X 7 E-Commerce Availability, 25H, SYM10, 10/00, Donna Scott IBM: Helping to keep your critical systems up and running, 7/22/99, IBM Global Services

More Related