150 likes | 297 Views
BINP/GCF Status Report. A.S.Zaytsev@inp.nsk.su. Jan 2010. Overview. Current status Resource accounting Summary of recent activities and achievements BINP/GCF & NUSC (NSU) integration BINP LCG site related activities Proposed hardware upgrades Future prospects.
E N D
BINP/GCF Status Report A.S.Zaytsev@inp.nsk.su Jan 2010
Overview • Current status • Resource accounting • Summary of recent activities and achievements • BINP/GCF & NUSC (NSU) integration • BINP LCG site related activities • Proposed hardware upgrades • Future prospects BINP/GCF Status Report
BINP LCG Farm: Present Status CPU: 40 cores (100 kSI2k) | 200 GB RAM HDD: 25 TB raw (22 TB visible) Input power limit: 15 kVA Heat output: 5 kW
Resource Allocation Accounting(up to 80 VM slots are now available within 200 GB of RAM) Computing Power Centralized Storage LCG: 0.5 TB (VM images) 15 TB (DPM + VO SW) KEDR: 0.5 TB (VM images) 4 TB (local backups) CMD-3: 1 TB is reserved for the scratch area & local home NUSC / NSU: up to 4 TB reserved for the local NFS/PVFS2 buffer • LCG: • 4 host systems now (40%) • 70% share is prospected for production with ATLAS VO (near future) • KEDR: • 4.0 – 4.5 host systems(40-45%) • VEPP-2000, CMD-3, SND, test VMs, etc.: • 1.5 – 2.0host systems(15-20%) 90% full, 150% reserved (200% limit) 35% full, 90% reserved (100% limit) BINP/GCF Status Report
BINP/GCF Activities in 2009Q4Sorted by priority (from highest to lowest) • [done] Testing and tuning 10 Gbps NSC/SCN channel to NSU and gettingit to production state • [done] Deploying a minimalistic LCG site locally at BINP • [done] BINP/GCF and NUSC (NSU) cluster network and virtualization systems integration • [done] Probing the feasibility of efficient use of resources under VMware with native KEDR VMs deployed in various ways • [done] Finding the long term stable configuration of KEDR VMs while running on several host systems in parallel • [in progress] Getting to production with ATLAS VO with 25 kSI2k / 15 TB SLC4 based LCG site configuration • [in progress] Preparing LCG VMs for running on NUSC (NSU) side • [in progress] Studying the impact of BINP-MSK & BINP-CERN connectivity issues on GStat & SAM test failures BINP/GCF Status Report
BINP/GCF & NUSC (NSU) Integration • BINP/GCF: XEN images • NUSC: VMware images (converted from XEN) • Various deployment options were studied: • IDE/SCSI virtual disk (VD) • VD performance/reliability tuning • Locally/centrally deployed • 1:1 and 2:1 VCPU/real CPU core modes • Allowing disabling swap on the host system • Up to 2 host systems with 16 VCPUs combined are tested(1 GB RAM/VCPU) • Long term stability (up to 5 days) is shownfor locally deployed VMs yet, most likely the problems are related to the centralized storage system of NUSC cluster • Works are now suspended due to the hardware failure of NSC/SCN switch on the BINP side (more news by the end of the week) BINP/GCF Status Report
BINP LCG Site Related Activities • STEP 1: DONE • Defining the basic site configuration, deploying the LCG VMs, going through the GOCDB registration, etc. • STEP 2: DONE • Refining the VMs configuration, tuning up the network, getting new RDIG host certs, VOs registration, handling the errors reported by SAM tests, etc. • STEP 3: IN PROGRESS • Getting OK for all the SAM tests (currently being dealt with) • Confirm the stability of operations for 1-2 weeks • Upscale the number of WNs to the production level(from 12 up to 32 CPU cores = 80 kSI2k max) • Ask ATLAS VO admins to install the experimental software on the site • Test the site for ability to run ATLAS production jobs • Check if the 110 Mbps SB RAS channel is capable to carry the loadof 80 kSI2k site • Get to production with ATLAS VO BINP/GCF Status Report
BINP/GCF Activities in 2010Q1-2Sorted by priority (from highest to lowest) • Recovering from the 10 Gbps NSC/SCN failure on the BINP side • Getting to production with 32-64 VCPUs for KEDR VMs on the NUSC side • Recovering BINP LCG site visibility under GStat 2.0 • Getting to production with ATLAS VO with 25 kSI2k / 15 TB LCG site configuration • Testing LCG VMs on NUSC (NSU) side • Finding stable configuration for LCG VMs for NUSC • Upscaling LCG site to 80-200 kSI2k by using both BINP/GCF and NUSC resources • Migrating LCG site to SLC5.x86_64 and CREAM CE as suggested by ATLAS VO and RDIG • Making a quantitative conclusion on how the existing NSC networking channel is limiting our LCG site performance/reliability • Allowing other local experiments to access NUSC resources via GRIDfarm interfaces (using the farm as pre-production environment) BINP/GCF Status Report
Future Prospects • Major upgrade of the BINP/GCF hardware focusing on the storage system capacity and performance • Up to 0.5 PB of online storage • Switch SAN fabric • Further extension of SC Network and virtualization environment • TSU with 1100+ CPU cores is the most attractive target • Solving the problem with NSK-MSK connectivity for the LCG site • Dedicated VPN to MSK-IX seem to be the best solution • Start getting the next generation hardware this year • 8x increase of CPU cores density • Adding DDR IB (20 Gbps) network to the farm • 8 Gbps FC based SAN • 2x increase of storage density • Establish private 10 Gbps links between the local experiments and BINP/GCF farm thus allowing them to use NUSC resources BINP/GCF Status Report
680 CPU cores/540 TB Configuration 2012 (prospected) 95 kVA UPS subsystem 1.4 M$ in total 16 CPU cores / 1U, 4 GB RAM / CPU core, 8 Gbps FC SAN fabric, 20 Gbps (DDR IB) / 10 Gbps (Ethernet) / 4x 1 Gbps (Ethernet) interconnect BINP/GCF Status Report
168 CPU cores/300 TB Configuration 2010 (proposed) 55 kVA UPS subsystem +14 MRub 5x CPU power, 10x storage capacity, adding DDR IB & 8 Gbps FC already BINP/GCF Status Report
PDU & Cooling Requirements • PDU • 15 kVA are available now (close to the limits, no way to plug the proposed 20 kVA UPS devices!) • 170-200 kVA (0.4kV) & APC EPO subsystems are needed (draft of the tech. specs was prepared in 2009Q2) • Engineering drawings for BINP/ITF hall have been recovered by CSD • The list of requirements is to be finalized yet • Cooling • 30-35 kW are available now (7 kW modules, open tech. water circuit) • 120-150 kW of extra cooling is required (assuming N+1 redundancy schema) • Various cooling schemas were studied though locally installed water cooled air conditioners seem to bethe best solution (18kW modules, closed water loop) • No final design yet Once the plans for hardware purchasing are settled for 2010 the upgrade must be initiated BINP/GCF Status Report
Prospected 10 Gbps SC Network Layout 1000+ CPU cores (2010Q3-4) 1100+ CPU cores (since 2007) BINP/GCF Status Report
Summary • Major success is achieved in BINP/GCF and NUSC (NSU) computing resources • The schema tested with KEDR VMs should be exploited by other experiments as well (e.g. CMD-3) • 10 Gbps channel (once restored) will allow the direct use of NUSC resources from the BINP site (e.g. ANSYS for needs of VEPP-2000) • LCG site may take advantage of using the NUSC resources as well (200 kSI2k will give us much better appearance) • The upgrade of the BINP/ITF infrastructure is required for installing the new hardware (at least for PDU subsystem) • If we are able to get extra networking hardware as proposed we may start plugging the experiments to the GRID farm and NUSC resources with 10 Gbps Ethernet uplinks this year BINP/GCF Status Report