1 / 39

ASGC Tier1 Center & Service Challenges activities

ASGC Tier1 Center & Service Challenges activities. ASGC, Jason Shih. Outlines. Tier1 center operations Resource status, QoS and utilizations User support Other activities in ASGC (exclude HEP) Biomed DC2 Service availability Service challenges SC4 disk to disk throughput testing

farrah
Download Presentation

ASGC Tier1 Center & Service Challenges activities

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ASGC Tier1 Center & Service Challenges activities ASGC, Jason Shih

  2. Outlines • Tier1 center operations • Resource status, QoS and utilizations • User support • Other activities in ASGC (exclude HEP) • Biomed DC2 • Service availability • Service challenges • SC4 disk to disk throughput testing • Future remarks • SA improvement • Resource expansion

  3. ASGC T1 operations

  4. WAN connectivity

  5. ASGC Network

  6. Computing resources • Instability of IS cause ASGC service endpoints removed from exp. bdii • High load on CE have impact to site info published (site GIIS running on CE)

  7. Job execution at ASGC • Instability of site GIIS cause dynamic information publish error • High load of CE lead to abnormal functioning of maui

  8. OSG/LCG resource integration • Mature tech help integrating resources • GCB introduced to help integrating with IPAS T2 computing resources • CDF/OSG users can submit jobs by gliding-in into GCB box • Access T1 computing resources from “twgrid” VO • Customized UI to help accessing backend storage resources • Help local users not ready for grid • HEP users access T1 resources

  9. ASGC Helpdesk • Currently support following services (queue): • CIC/ROC • PRAGMA • HPC • SRB • Classification of sub-queue of CIC/ROC: • T1 • CASTOR • SC • SSC

  10. ASGC TRS: Accounting

  11. First run on partial of 36690 ligands (started at 4 April, 2006 Fourth run (started at 21 April) Biomed DC2 • Add 90KSI2k dedicate for DC2 activities, introduced additional subcluster in IS • Maintaining site functional to help utilizing grid jobs from DC2 • Troubleshooting grid-wide issues • Collaborate with biomed in AP operation • AP: GOG-Singapore devoted resources for DC2.

  12. Biomed DC2 (cont’) • Two framework introduced • DIANE and wisdom • Ave. 30% contribution from ASGC, in 4 run (DIANE)

  13. Service Availability

  14. Service challenges - 4

  15. SC4 Disk to disk transfer • problem observed at ASGC: • system crash immediately when tcp buffer size increase • castor experts help in trouble shooting, but prob. remains for 2.6 kernel + xfs • download kernel to 2.4 + 1.2.0rh9 gridftp + xfs • again, crash if window sized tuned • problem resolved only when down grade gridftp to identical version for SC3 disk rerun (Apr. 27, 7AM) • try with one of disk server, and move forward to rest of three • 120+ MB/s have been observed • continue running for one week

  16. Castor troubleshooting *gridftp bundled in castor+ ver. 2.4, 2.4.21-40.EL.cern, adopted from CERN** ver 2.4, 2.4.20-20.9.XFS1.3.1, introduced by SGI++ exact ver no 2.6.9-11.EL.XFS$ tcp window size tuned, max to be 128MB Stack size recompiled to 8k for each experimental kernel adopted

  17. SC Castor throughput: GridView • disk to disk nominal rate • currently ASGC have reach120+ MB/s static throughput • Round robin SRM headnodes associate with 4 disk servers, each provide ~30 MB/s • debugging kernel/castor s/w issues early time of SC4 (reduction to 25% only, w/o further tuning)

  18. Tier-1 Accountings: Jan – Mar, 2006

  19. Accounting: VO

  20. Overall Accounting: CMS/Atlas

  21. CMS usage: CRAB monitoring

  22. SRM QoS monitoring: CMS Heartbeat

  23. Castor2@ASGC • testbed expected to deployed end of March • delayed due to • obtaining LSF license from Platform • DB schema trouble shooting • overlap HR in debugging castorSC throughput • revised 2k6 Q1 quarterly report • separate into two phase, phase • (I) w/o considering tape functional testing • plan to connect to tape system in next phase • expect mid May to complete phase (I) • phase (II) plan to finish mid of Jun.

  24. Future remarks • Resource expansion plan • QoS improvement • Castor2 deployment • New tape system installed • Continue with disk to tape throughput validation • Resource sharing with local users • For users more ready using grid • Large storage resource required

  25. Resource expansion: MoU *FTT: Federated Taiwan Tire2

  26. Resource expansion (I) • CPU • Current status: • 430 KSI2k (composite by IBM HS20 and Quanta Blades) • Goal: • Quanta Blades • 7U, 10blades, Dual CPU, ~1.4 ksi2k/cpu • ratio 30 ksi2k/7U, to meet 950KSI2k • need 19 chassis (~4 racks) • IBM Blades • LV model available (save 70% power consumption) • Higher density, 54 processors (dual core + SMP Xeon) • Ratio ~80 KSI2k/7U, only 13 chassis needed (~3 racks)

  27. Resource expansion (II) • Disk • Current status • 3U array, 400GB drive, 14 drives per rack • ratio: 4.4 TB/6U • Goal: • 400 TB ~ 90 Arrays needed • ~ 9 racks (assume 11 arrays per rack) • Tape • New 3584 tape lib installed mid of May • 4 x LTO4 tape drives provide ~ 80MB/s throughput • expected to be installed in mid-March • delayed due to • internal procurement • update items of projects from funding agency • Expected new tape system implemented at mid-May • full system in operation within two weeks after installed.

  28. IBM 3584 vs. STK SL8500

  29. Resource expansion (III) • C2 area of IPAS new machine room • Rack space design • AC, cooling requirement: • for 20 racks: 1,360,000 BTUH or 113.3 tons of cooling - 2800 ksi2k • 36 racks: 1,150,000 BTUH or 95 tons - 1440 TB • HVAC: ~800 kVA estimated • HS20: 4000Watt * 5*20 + STK array: 1000Watt * 11*36 ) • generator

  30. Summary • new tape system ready mid of May, full operation in two weeks • plan to have disk to tape throughput testing • Split batch system and CE • help stabilizing scheduling functionality (mid of May) • Site GIIS sensitive to high CPU load, move to SMP box • CASTOR2 deployed mid of Jun • Connect to new tape lib • migrate data from disk cache

  31. Acknowledgment • CERN: • SC: Jamie, Maarten • Castor: Olof • Atlas: Zhong-Liang Ren • CMS: Chia-Ming Kuo • ASGC: • Min, Hung-Che, J-S • Oracle: J.H. • Network: Y.L., Aries • CA: Howard • IPAS: P.K., Tsan, & Suen

  32. SRM & MSS deployed at each Tier-1

  33. Nominal network/disk rates by sites

  34. Target disk – tape throughput

  35. Disk server snapshot (I) • Host: lcg00116 • Kernel: 2.4.20-20.9.XFS1.3.1 • Castor gridftp ver.: VDT1.2.0rh9-1

  36. Disk sever snapshot (II) • Host: lcg00118 • Kernel: 2.4.21-40.EL.cern • Castor gridftp ver.: VDT1.2.0rh9-1

  37. Disk server snapshot (III) • Host: sc003 • Kernel version: 2.6.9-11.EL.XFS • Castor gridftp ver.: VDTALT1.1.8-13d.i386

  38. Accounting: normalized CPU time

More Related