290 likes | 311 Views
Learn how commissioning can ensure the availability of data centers and mitigate risks. Discover the cost of different tier levels and the interdependency between IT architecture and facility infrastructure. Understand the predictable nature of data center failures and strategies to manage and optimize fault-tolerant systems.
E N D
Current Trends in Data Center COMMISSIONING RICHARD L SAWYER, Strategist - HP Critical Facilities ACG– Chicago April 2013 • AGENDA: • WHAT IS A DATA CENTER? • DIRTY LITTLE SECRET • RISK MITIGATION • LEVERAGING COMMISSIONING • USING FAILURE TO SUCCEED
PAN Status Alarms 1 High Temperature Heating Low Temperature Cooling ON Loss of Air Flow Dehumidification Humidification High Humidity Low Humidity ALARM PRESENT SILENCE Change Air Filters Local Alarm Liebert system 3 OPEN What is a Data Center? • By NFPA 70: “Critical Operations Data System” • By Clients: Where ever I process data. • By Commissioning Agents: A power intensive critical space.
Successful Data Center Operations Start with Commissioning • Data Centers are designed to a certain availability expectation to meet business goals. • Whether or not they meet the designed goal depends on the contractor. • Commissioning is the only way to assure the availability of the design is achieved in practice!
It’s all about availability! Data Centers have specified design features. These are investments to deliver a specified availability…….
The cost is huge: Availability is expensive! Tier II, III, IV build costs ($/sq. ft.) related to power density $5,000 $4,500 Tier IV $4,000 $3,500 Tier III $3,000 Tier II $2,500 $2,000 $1,500 A 20K sf Tier III data center costs $35 Million@ 150 w/sf $1,000 $500 $- 50 w/sf 100 w/sf 150 w/sf 200 w/sf 250 w/sf 300 w/sf 1. Data center tier costs increase per sq. ft. (sqM) costs 2. As tier level increases, build cost rises. 3. Costs of Tier IV are almost double those of Tier II. HP data, based on a 40,000 sq. ft. raised-floor data center.
And the IT investment is even larger! • A 20,000 square foot data center built to 150 watts/square foot can accommodate 800 racks of IT equipment @3.75 kW per rack. • This 3,000 to 10,000 servers depending on architecture, form factor and configuration. • The IT investment in hardware, software and service can amount to 5 to 8 timesthe data center facility investment. Can you safely assume the data center investment will work as designed from Day One?
Availabilityinterdependency Formula: (Availability of IT) X (Availability of FI) = Total End-to-End Availability End-to-end availability is the product of the availability of the IT Architecture times the availability of the Facility Infrastructure (FI). (Tier 3 FI x MS Server) = Total availability 99.982%x 99.202% = 99.184% IT architecture and facility infrastructure are interdependent in meeting the data center goal. . .. . the speed of IT recovery is dependent on the speed of facility recovery!
Dirty Little Secret: Data Centers Fail Failure is: Expensive Inevitable Predictable Manageable Useful
Failure is Inevitable 5 YEAR PROBABILITY OF FAILURE AFCOM 2007: “Understanding Tier Systems”, Tom Roberts, Rick Sawyer
Utility Option 2N 2 Utilities 2 Generators 2 ATS 2 UPS Systems STS G Utility G Primary Bus 1 Primary Bus 2 UPS UPS Bypass Bypass Failure is Predictable Static Switch MTBF = 315,766 hours Availability = 99.9985% Probability of Failure in 5 years = 12.95% PDU Critical Load
Good News! Failure is Manageable • STRATEGY TO SURVIVE: • Design to Survive • Map Foreseeable Failures • Develop SOP’s, MOP’s, EOP’s • Commission! Test, Document, Train
Design to Survive Optimizing Fault tolerant system features (2N) Managed Data center has concurrent maintainability features Defined Data center systems have redundant features for resiliency (N+1) Repeatable Data center has dedicated cooling, generators, UPS, fire, security and monitoring systems Initial Data Center is basic server or network room, in a dedicated space having minimal dedicated infrastructure systems Absence No dedicated data center, processing is in office space Using ITSM Capability Maturity Model to assess Facility Infrastructure Design
Zoned Availability-Scalable Mission Critical infrastructure using Central UPS and Rack based UPS for 2N redundancy Rack based UPS Systems as needed for 2N redundancy M HEAT REJECT S E C U R UPS UPS UPS UPS F I R E Cold Aisle Hot Aisle Cold Aisle Hot Aisle Cold Aisle CRAC CRAC CRAC CRAC EPO pdu pdu pdu pdu HEAT REJECT Central UPS for one “N” side, scalable UPS System M UPS CRAC Site Availability – 99.995% SYSTEM MONITOR Battery WEBLINK
Develop MOP’s, SOP’s, EOP’s 1 Automate Networks Optimizing 2 Real time monitoring, continuous improvement features 3 Managed Runbook Automation Automate Servers 2 Documentation is complete, available, compliance is measured and trended Defined 3 Automate Storage 2 Procedures are associated with asset management systems and are tracked to completion, effectiveness Repeatable Standard, Maintenance and Emergency Operating Procedures exist and are site specific Initial Maintenance and operations are not site specific or complete, ad hoc and depend on staff memory/knowledge Absence No operational processes formally in place or measured
Simplified One-Line power supply diagram • Simplified One-Line UPS system diagram • Normal power flow diagrams • Emergency power flow diagrams • Automatic Transfer Control diagram • Location of equipment • Start-Up and Shut-Down procedure • Emergency response procedure • Recommended maintenance practices • Reference Engineering Prints • Reference MGE EPS 8000 Operations and Maintenance Manual O&M MGE EPS 8000 UPS System A, Module 01 Based on best available data 05/11- Verify against As-Builts
SG-3A01 Bypass Power Flow to UPS A01 For Maintenance on Modules or Module Failure Mode SG-3B01 SG-3A02 SG-3B02 B-3A04 k B-3B04 k B-3A29 B-3B33 NO UPS Systems A01 & B01 NO Based on best available data 05/11- Verify against As-Builts ATS-31A01 ATS-31B01 13.8 kV 13.8 kV Automatic Transfer Control T-31A01 480V 480V T-31B01 CB-01A001 CB-01B001 CB-01A002 CB-01B002 NC NC SG-01A01 NO NC SG-01B01 Load Bus Synchronization Control SG-01A02 SG-01B02 To SG-01A03 Critical UPS Load A To SG-01B03 Critical UPS Load B From SG-0A04 From SG-0A04
Process for installing a new IT server Burn-in functional test Physical Inspection Delivery Order Firmware verification Integration with existing systems Network assignment Install in rack Software verification Data test of software Online production
Process for “installing” a new datacenter Equipment startup Design Construct Physical inspection Equipment tests “Pull-the-plug” integrated test Controls and monitoring tests Failure mode tests System-level tests Capacity tests Turn over to IT and Operations
The Value of Commissioning • Assures design performance is achieved following construction • Verifies performance levels • Capacity • Availability (redundancies) • Provides documentation base for SOP’s, MOP’s, and EOP’s • Opportunity for “hands-on” training of operations staff which they may never see for years! • Video taping of procedures • Monitoring and alarm testing with response procedures • “New Employee” training guide development IT investment is 3-5X the data center investment. Commissioning assures the IT architecture support systems work, and can be recovered quickly when they fail.
Leverage Facility Commissioning Involve everyone: IT, management, vendors, contractor, engineers and operating staff. Manage your documents – capture everything methodically. Test everything that can be safely tested. Video tape procedures, especially risk mitigation procedures for SPOF’s. Know your data center!
Commissioning Trends • Standardized procedures to test standardized systems • Capacity testing to verify efficiency at all load levels • Staff training during the commissioning process • Video taping of test procedures for future training • Integrated testing of raised floor areas before IT equipment is installed • Digital data logging of system performance during commissioning to lower cost and provide better information.
Typical Integrated Test Utility G G Generator capacity and redundancy is tested by failing units Natural Vibrant 151, 191, 218 0, 83, 144 Primary Bus 1 0, 99, 147 176, 203, 82 UPS UPS Utility is failed to test transfer switch and generator performance 125, 206, 240 255, 221, 0 Bypass 255 233 0 193, 0, 66 Neutral 174, 160, 147 211, 204, 194 UPS redundancy is tested by failing modules and system Static switch sources are failed to test performance Static Switch Digital meters record performance at critical load PDU Load banks are installed to simulate critical load Critical Load
Use Failure as an Opportunity • When you’re down, you’re down. • Use the downtime to access, maintain or modify systems you can’t get to any other time • Verify breaker operation – “retro commission”! • Inspect and repair equipment in a powered down condition • Tie in valves and breakers for future use • Test systems and operations procedures Plan recovery procedures to leverage downtime opportunity for maintenance, testing and training!
Summary • Modern office building contain high power data center spaces • Availability of those spaces is a key client demand • Design can only do so much, performance must be proven- Through Commissioning! • Actual availability is an operational issue. • Data center performance is contingent on a strong commissioning program from the start!
Questions? Richard L. Sawyer Strategist, HP Critical Facility Services rsawyer@hp.com 518-857-9751