240 likes | 391 Views
ASGC Site Report. Jason Shih ASGC/OPS HEPiX Fall 2009 Umea, Sweden. Overview. Fire incident Hardware Network Storage Future remarks. Fire incident – event summary. Damage Analysis: fire was limited at the power room Severe damage of UPS wiring of power system, AHR
E N D
ASGC Site Report Jason Shih ASGC/OPS HEPiX Fall 2009 Umea, Sweden
Overview • Fire incident • Hardware • Network • Storage • Future remarks
Fire incident – event summary • Damage Analysis: fire was limited at the power room • Severe damage of UPS • wiring of power system, AHR • Smoke dust pervaded and smudged almost every where, including computing & storage systems • History and Planning • 16:53 Feb. 25 UPS battery burning • 19:50 Feb. 25 Fire extinguishment by Fire department • 10:00 Feb. 26 Fire scene investigation by Fire department • 15:00 Feb 26 ~ Mar 23 DC cleaning, re-partitioning, re-wiring, deoderization, and re-installation • from ceiling to ground under raised floor, from power room to machine room, from power system, air conditioning, fire prevention system to computing system • All facilities moved outside to cleaning • Mar 23 Computing System installation • Mar 23 ~ Apr 9 Recovery of Monitoring, Environment control and Access control system
Fire incident – recovery plan • DC Consultant will review the re-design on Mar. 11, schedule will be revised based on the inspection • Tier1/Tier2 services will be collocated at IDC for 3 months from Mar. 20
Fire incident – review/lessons (I) • DC Infrastructure Standards to comply with • ANSI TIA/EIA • ASHRAE thermal guideline for data processing env. • Guidelines for green data centers are available, e.g., LEED • NFPA: Fire suppression system • Capacity and type of UPS (min. scale) • Vary by the responding time of generators • Adjust rating of all breaks (NFB and ACB) • Location of UPS (open space & outside PR) • Regular maintenance of batteries • Inner resistance measurement
Fire incident – review/lessons (II) • Smoke damage: Fire stopping • Improvement of monitoring system • Re-design the monitoring sys. • Earlier pre-action: consider: VESDA • Emergent response and procedures • Routine Fire drill is indispensable • Disaster Recovery plan is necessary • Other improvement: • PP and H/C aisle splitting • Fiber panels: MDF and FOR • OH cable tray (exist: PWR tray in subfloor)+ Fiber guide • Raised floor grommets
Move out all facilities for cleaning Protect Racks from Dust Container as storage and humidification Ceiling Removal
Fire incident - Tape system • Snapshots of decommissioned tape drives after the incident
DC recovered – mid of May • FOR in area #1 • MDF move to center of DC area • H/C aisle fully split • Plan to replace racks to provide 1100mm depth
IDC Collocation (I) • Site selection and paper processing - one week • Preparation at IDC – one week • 15R + reservation for tape system (6R) • Power (14kW per racks) • cooling (perforated raise floor) • 10G protection SDH STM-64 networking between IDC and ASGC
IDC collocation (II) • Relocation of 50+% computing/storage – one week • 2k job slots (3.2MSI2K), 26 chassis of blade servers • 2.3PB storage (1PB allocated dynamically) • Cabling + setup + reconfiguration – one week
IDC collocation (III) • Facility install complete at Mar 27th • Tape system delay after Apr 9th • Realignment • RMA for faulty parts
T1 performance • 7G peak reach to Amsterdam • 9G peak observed between IDC/ASGC
Network – before May SINet APAN-JP KEK JPIX GE*2 KREONET2 GE GE GE CERNet CSTNet WIDE GE GE HKIX GE 2.5G WL non-protect HARNet GE JP, KDDI Otemachi GE M120 M120 Sinica, Taipei HK, Mega-iAdvantage 100M NUS NCIC - 2.5G(STM-16) SDH AARNet Pacnet IP Transit M320 GE GE TWGate IP Transit 100M SG, KIM CHUNG 622M(STM-4) SDH on APCN2 M20
Network - 2009 SINet APAN-JP KEK JPIX GE*2 KREONET2 GE GE GE CERNet CSTNet WIDE GE GE HKIX GE STM-16 SDH HARNet JP, KDDI Otemachi GE GE M120 M120 Sinica, Taipei HK, Mega-iAdvantage NUS 100M SingAREN 2.5G(STM-16) SDH AARNet Pacnet IP Transit M320 GE GE GE TWGate IP Transit 100M 622M(STM-4) SDH on EAC Singapore, Global Switch M20
ASGC Resource Level Targets • 2008 • 0.5PB expansion of Tape system in Q2 • Meet MOU target mid of Nov. • 1.3MSI2k per rack base on recent E5450 processor. • 2009 • 150 QC blade servers • 2TB per drives for raid subsystem • 42TB net capacity per chassis and 0.75PB in total
Hardware Profile and Selection (I) • CPU: • 2K8 Expansion: 330 blade server provide 3.6KSI2k • 7U height chassis • SMP Xeon E5430 processors, 16GB FB-DIMM • each blade provide 11KSI2k • 2 blade/U density, Web/SOL management • current capacity: 2.4MSI2k • Year end total computing power: ~5.6MSI2k • 22KSI2k/U (24 chassis in 168U)
Tape system • Before incident: • LTO3 * 8 + LTO4 * 4 • 720TB with LTO3 • 530TB with LTO4 • May 2009: • Two loan LOT3 drives • MES: 6 LTO4 drives end of May • Capacity: 1.3PB (old) + 0.8PB (LTO4) • New S54 model introduced • 2K slots with tier model • Upgrade ALMS • Enhanced gripper
iSCSI – 1Gb Roadmap – Host I/F 2009 3U16bay FC-SAS in May, 2U/12 and 4U/24 bay in June 8G FC ( ≈ 800 MB/sec) 4G FC ( ≈ 400 MB/sec) SAS 6G (4-lane ≈ 2400 MB/sec) SAS 3G (4-lane ≈ 1200 MB/sec) iSCSI – 10 Gb U320 - SCSI ( ≈ 320 MB/sec)
Roadmap – Drive I/F 2009 4G FC SAS 6G SAS 3G U320 - SCSI 2.5” SSD (B12F series) SATA-II
Est. Density • 2009 H1 1TB, 1 rack (42U)= 240TB • 2009 H2 2TB, 1 rack (42U)= 480TB • 2010 H1 2TB, 1 rack (42U)= 480TB • 2010 H2 3TB, 1 rack (42U)= 720TB • 2012 5TB…..
Future remarks • DC full restore end of May • Restart run-the-clock operation • Resources relocated fully involved in STEP09 • Facility relocation end of Jun from IDC • New resource expansion end of Jul • Improve DC monitoring
Water mist • Fire suppresion system • Review the implementation of Gas supression system • Consider water mist in power room • Wall cabinet outside data center area