390 likes | 575 Views
ASGC Tier1 Center & Service Challenges activities. ASGC, Jason Shih. Outlines. Tier1 center operations Resource status, QoS and utilizations User support Other activities in ASGC (exclude HEP) Biomed DC2 Service availability Service challenges SC4 disk to disk throughput testing
E N D
ASGC Tier1 Center & Service Challenges activities ASGC, Jason Shih
Outlines • Tier1 center operations • Resource status, QoS and utilizations • User support • Other activities in ASGC (exclude HEP) • Biomed DC2 • Service availability • Service challenges • SC4 disk to disk throughput testing • Future remarks • SA improvement • Resource expansion
Computing resources • Instability of IS cause ASGC service endpoints removed from exp. bdii • High load on CE have impact to site info published (site GIIS running on CE)
Job execution at ASGC • Instability of site GIIS cause dynamic information publish error • High load of CE lead to abnormal functioning of maui
OSG/LCG resource integration • Mature tech help integrating resources • GCB introduced to help integrating with IPAS T2 computing resources • CDF/OSG users can submit jobs by gliding-in into GCB box • Access T1 computing resources from “twgrid” VO • Customized UI to help accessing backend storage resources • Help local users not ready for grid • HEP users access T1 resources
ASGC Helpdesk • Currently support following services (queue): • CIC/ROC • PRAGMA • HPC • SRB • Classification of sub-queue of CIC/ROC: • T1 • CASTOR • SC • SSC
First run on partial of 36690 ligands (started at 4 April, 2006 Fourth run (started at 21 April) Biomed DC2 • Add 90KSI2k dedicate for DC2 activities, introduced additional subcluster in IS • Maintaining site functional to help utilizing grid jobs from DC2 • Troubleshooting grid-wide issues • Collaborate with biomed in AP operation • AP: GOG-Singapore devoted resources for DC2.
Biomed DC2 (cont’) • Two framework introduced • DIANE and wisdom • Ave. 30% contribution from ASGC, in 4 run (DIANE)
SC4 Disk to disk transfer • problem observed at ASGC: • system crash immediately when tcp buffer size increase • castor experts help in trouble shooting, but prob. remains for 2.6 kernel + xfs • download kernel to 2.4 + 1.2.0rh9 gridftp + xfs • again, crash if window sized tuned • problem resolved only when down grade gridftp to identical version for SC3 disk rerun (Apr. 27, 7AM) • try with one of disk server, and move forward to rest of three • 120+ MB/s have been observed • continue running for one week
Castor troubleshooting *gridftp bundled in castor+ ver. 2.4, 2.4.21-40.EL.cern, adopted from CERN** ver 2.4, 2.4.20-20.9.XFS1.3.1, introduced by SGI++ exact ver no 2.6.9-11.EL.XFS$ tcp window size tuned, max to be 128MB Stack size recompiled to 8k for each experimental kernel adopted
SC Castor throughput: GridView • disk to disk nominal rate • currently ASGC have reach120+ MB/s static throughput • Round robin SRM headnodes associate with 4 disk servers, each provide ~30 MB/s • debugging kernel/castor s/w issues early time of SC4 (reduction to 25% only, w/o further tuning)
Castor2@ASGC • testbed expected to deployed end of March • delayed due to • obtaining LSF license from Platform • DB schema trouble shooting • overlap HR in debugging castorSC throughput • revised 2k6 Q1 quarterly report • separate into two phase, phase • (I) w/o considering tape functional testing • plan to connect to tape system in next phase • expect mid May to complete phase (I) • phase (II) plan to finish mid of Jun.
Future remarks • Resource expansion plan • QoS improvement • Castor2 deployment • New tape system installed • Continue with disk to tape throughput validation • Resource sharing with local users • For users more ready using grid • Large storage resource required
Resource expansion: MoU *FTT: Federated Taiwan Tire2
Resource expansion (I) • CPU • Current status: • 430 KSI2k (composite by IBM HS20 and Quanta Blades) • Goal: • Quanta Blades • 7U, 10blades, Dual CPU, ~1.4 ksi2k/cpu • ratio 30 ksi2k/7U, to meet 950KSI2k • need 19 chassis (~4 racks) • IBM Blades • LV model available (save 70% power consumption) • Higher density, 54 processors (dual core + SMP Xeon) • Ratio ~80 KSI2k/7U, only 13 chassis needed (~3 racks)
Resource expansion (II) • Disk • Current status • 3U array, 400GB drive, 14 drives per rack • ratio: 4.4 TB/6U • Goal: • 400 TB ~ 90 Arrays needed • ~ 9 racks (assume 11 arrays per rack) • Tape • New 3584 tape lib installed mid of May • 4 x LTO4 tape drives provide ~ 80MB/s throughput • expected to be installed in mid-March • delayed due to • internal procurement • update items of projects from funding agency • Expected new tape system implemented at mid-May • full system in operation within two weeks after installed.
Resource expansion (III) • C2 area of IPAS new machine room • Rack space design • AC, cooling requirement: • for 20 racks: 1,360,000 BTUH or 113.3 tons of cooling - 2800 ksi2k • 36 racks: 1,150,000 BTUH or 95 tons - 1440 TB • HVAC: ~800 kVA estimated • HS20: 4000Watt * 5*20 + STK array: 1000Watt * 11*36 ) • generator
Summary • new tape system ready mid of May, full operation in two weeks • plan to have disk to tape throughput testing • Split batch system and CE • help stabilizing scheduling functionality (mid of May) • Site GIIS sensitive to high CPU load, move to SMP box • CASTOR2 deployed mid of Jun • Connect to new tape lib • migrate data from disk cache
Acknowledgment • CERN: • SC: Jamie, Maarten • Castor: Olof • Atlas: Zhong-Liang Ren • CMS: Chia-Ming Kuo • ASGC: • Min, Hung-Che, J-S • Oracle: J.H. • Network: Y.L., Aries • CA: Howard • IPAS: P.K., Tsan, & Suen
Disk server snapshot (I) • Host: lcg00116 • Kernel: 2.4.20-20.9.XFS1.3.1 • Castor gridftp ver.: VDT1.2.0rh9-1
Disk sever snapshot (II) • Host: lcg00118 • Kernel: 2.4.21-40.EL.cern • Castor gridftp ver.: VDT1.2.0rh9-1
Disk server snapshot (III) • Host: sc003 • Kernel version: 2.6.9-11.EL.XFS • Castor gridftp ver.: VDTALT1.1.8-13d.i386