350 likes | 359 Views
Five 9s for SANs w/o Breaking the Bank. Presented by Marc Staimer President & CDS (Chief Dragon Slayer) Dragon Slayer Consulting. Agenda. What is Five 9s? How this relates to SANs Reality Check What you should do. What is Five 9s & What does it Really Mean?. Five 9s Generally Defined.
E N D
Five 9s for SANsw/o Breaking the Bank Presented by Marc Staimer President & CDS (Chief Dragon Slayer) Dragon Slayer Consulting
Agenda • What is Five 9s? • How this relates to SANs • Reality Check • What you should do
Five 9s Generally Defined • 99.999% is another term for “High Availability”
What does “Availability” mean? • Availability is the proportion of time that a system can be used for productive work
Then what does “five 9s” mean? • Scheduled & Unscheduled downtime does not exceed ~5 minutes per year • Perspective: Annual downtime = • Less time than it takes to drink a cup of coffee • 1/6th the time of the average daily commute
What about Four 9s or less? • Four 9s = ~ an hour of downtime/yr • Three 9s = ~ 9 hours of downtime/yr • Two 9s = ~ 4 days (88 Hours) of downtime/yr
Can you live two, three, or four 9s? …it Depends • On the Application • The types of outages you can live with • The cost of downtime for those applications • The cost of high availability such as five 9s.
Application Availability Dependencies • Mission criticalness • Productivity loss from downtime • Alternatives
Outage dependencies • You may be able to live w/two 9s if: • There are 88 separate outages of 1 hour each through the year • It is a different story if it is 1 outage nearly 4 days • This could put a business out of business
Cost of downtime • The cost of app downtime can be prohibitive
Direct costs of downtimeper Gartner Group Industry Average Loss/Hr. • Brokerage Operations $6,450,000 • Credit Card Authorizations $2,600,000 • E-commerce $240,000 • Package Shipping Services $150,250 • Home Shopping Channels $113,750 • Catalog Sales Center $90,000 • Airline Reservation Center $89,500 • Cellular Service Activation $41,000 • ATM Service Fees $14,500
Collateral damage of downtime is moreper Gartner Group Company Direct Cost Collateral Damage • eBay > $5,000,000 Dramatic Mkt cap reduction • ATT > $10,000,000 ~$40 million in rebates +SLAs • Collateral damage is more serious than temporary loss of business • Collateral damage severity increases as business moves online
Old rule of thumb: 1st 80% 20% of Cost Last 20% 80% of Cost Per IMEX Research Making “availability” five 9s, has cost too
Per IMEX Research There must be tradeoffs
Excessive System Costs Annual Business Downtime Cost System Cost Excessive Downtime Costs System Uptime Requirements 90% 99% 99.90% 99.99% 99.999% 100% Percent Available Finding the crossover point is key
How: Thorough Environment Knowledge • Systems • Hardware • Software • Data • Productivity • Direct cost of downtime and collateral damage
What about disasters & downtimeNot if, when • There will eventually be a major interruption of your business environment
Test, test, test • Whatever your business continuity plans • Make sure you can recover your business in the event of a failure • Test, test, test • One end-user claims to backup to tape every month, except he backs up onto the same tape every time, even when the system asks for a new tape
Reasons cited by European Enterprises for invocation of Business Continuity Plans From 1997-2000 • Hardware Failure 60% • Software 16% • Power Outage 7% • Bomb 3% • Fire 3% • Flooding 3% • Environmental 2% • Telecom Failure 1% • Denied access 1% • Miscellaneous 4%
Reasons cited by USA Enterprises for invocation of Business Continuity Plans From 1997-2000 • Regional Event 40% • Hardware Failure 36% • Software 10% • Power Outage 4% • Bomb 2% • Fire 2% • Flooding 2% • Environmental 1% • Telecom Failure 1% • Denied access 1% • Miscellaneous 1%
SANs have become the critical path of “high availability” or five 9s. • When an application server fails • Only the users using that app are affected • When shared storage goes down • Users of the applications using that storage are affected • When the SAN goes down • All users are affected
Complete availability vs. high availability w/reduced capabilities • Five 9s w/no loss of capabilities • Full Bandwidth all the time w/no pr • Five 9s w/reduced capabilities • Reduced Bandwidth • Higher probability of path congestion • Similar to differences between RAID 0,1 & RAID 5
Director class switches Full bandwidth between Initiators & target storage Even with a failure in the Director or fabric FC/9000 Five 9s SANs with full capabilities
Five 9s SANs with reduced capabilities • Core/edge networking • Oversubscribed B/W • Path failures mean • Auto failover • Reduced B/W • Increased possibilities of congestion
96 Port Resilient Core/Edge Fabric 128 Port Fault Tolerant Director Fabric or 128 Port Dual 64 Port Directors Edge Core Using 16-port Core Switch switches Edge Switch Fabric Comparison or Red Herring?
Directors - five 9s fully capable Cost ~ $2,500/port Mask failures Apps never know it fails Full B/W even with failures Simple to set up & manage Fault tolerant Network: up to 239 switches/directors Up to 256 ports/director Can be Core or Edge switch Switches - five 9s, w/reduced failure mode capabilities Cost ~ $1,000/port Oversubscribed B/W Congestion statistically unlikely Failures mean loss of B/W More difficult to set up/manage Fault resilient Network: up to 239 switches Up to 64 ports/switch Can be Core or Edge Switch FC/9000 Directors vs. Core/Edge Switches
Core/edge & Directors are not mutually exclusive Models can & should be mixed Some apps cannot handle fabric disruptions of any kind Some fabrics can never ever have reduce capacity Some apps do not have to have full B/W all the time FC/9000 FC/9000 Reality Check
Fabric Design “five 9s” Factors • The larger the switch/director nodes • The less likely there will be inter-switch/director traffic • The more oversubscribed your fabric can be w/o increased risk • The more important “HA” becomes in the node itself • FSPF has limited failover capabilities • The loss of a path in the fabric (ISL failure) will cause failover • Failover may not be fast enough to avoid SCSI device timeout • Edge device retransmissions or failover must be designed in
The Key is determining where to implement with what & when • Use the same ROE as before • Thorough knowledge of the data & environment • Hardware, software, systems, etc. • Match the type of SAN to the application
What you should do • Educate yourself about your data & environment • Design your SANs to meet the needs of the business • Provide five 9s with full capability for those apps that need it • Provide five 9s with less than full capability for those apps that don’t need it • Making your entire SAN environment completely five 9s w/no loss of capabilities could be cost prohibitive
Upgrade / Architectural change Design Maint. Implementation Add / Change/Remove /Mgt /Trouble shoot Data Collection Transition Data Analysis Release toProduction ArchDevelop Prototype and Test SAN Design Methodology
Other tools you can use • Interactive online high availability interrogator • Helps determine the cost of your downtime • White papers • http://www.available.com
Marc Staimer marcstaimer@earthlink.net 503-579-3763