1 / 29

The RHIC-ATLAS Computing Facility at BNL

The RHIC-ATLAS Computing Facility at BNL. HEPIX – Edinburgh May 24-28, 2004 Tony Chan RHIC Computing Facility Brookhaven National Laboratory. Outline. Background Mass Storage Central Disk Storage Linux Farm Monitoring Security & Authentication

Download Presentation

The RHIC-ATLAS Computing Facility at BNL

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The RHIC-ATLAS Computing Facility at BNL HEPIX – Edinburgh May 24-28, 2004 Tony Chan RHIC Computing Facility Brookhaven National Laboratory

  2. Outline • Background • Mass Storage • Central Disk Storage • Linux Farm • Monitoring • Security & Authentication • Future Developments • Summary

  3. Background • Brookhaven National Lab (BNL) is a U.S. gov’t funded multi-disciplinary research laboratory. • RACF formed in the mid-90’s to address computing needs of RHIC experiments. Became U.S. Tier 1 Center for ATLAS in late 90’s. • RACF supports HENP and HEP scientific computing efforts and various general services (backup, e-mail, web, off-site data transfer, Grid, etc).

  4. Background (continued) • Currently 29 staff members (4 new hires in 2004). • RHIC Year 4 just concluded. Performance surpassed all expectations.

  5. Staff Growth at the RACF

  6. RACF Structure

  7. Mass Storage • 4 StorageTek tape silos managed via HPSS (v 4.5). • Using 37 9940B drives (200 GB/tape). • Aggregate bandwidth up to 700 MB/s. • 10 data movers with 10 TB of disk. • Total over 1.5 PB of raw data in 4 years of running (capacity for 4.5 PB).

  8. The Mass Storage System

  9. Central Disk Storage • Large SAN served via NFS  DST + user home directories + scratch area. • 41 Sun servers (E450 & V480) running Solaris 8 and 9. Plan to migrate all to Solaris 9 eventually. • 24 Brocade switches & 250 TB of FB RAID5 managed by Veritas. • Aggregate 600 MB/s data rate to/from Sun servers on average.

  10. Central Disk Storage (cont.) • RHIC and ATLAS AFS cells software repository + user home directories. • Total of 11 AIX servers with 1.2 TB for RHIC and 0.5 TB for ATLAS. • Transarc on server side, OpenAFS on client side. • Considering OpenAFS for server side.

  11. The Central Disk Storage System

  12. Linux Farm • Used for mass processing of data. • 1359 rack-mounted, dual-CPU (Intel) servers. • Total of 1362 kSpecInt2000. • Reliable (about 6 hardware failures per month at current farm size). • Combination of SCSI & IDE disks with aggregate of 234+ TB of local storage.

  13. Linux Farm (cont.) • Experiments making significant use of local storage through custom job schedulers, data repository managers and rootd. • Requires significant infrastructure resources (network, power, cooling, etc). • Significant scalability challenges. Advance planning and careful design a must!

  14. The Growth of the Linux Farm

  15. The Linux Farm in the RACF

  16. Linux Farm Software • Custom RH 8 (RHIC) and 7.3 (ATLAS) images. • Installed with Kickstart server. • Support for compilers (gcc, PGI, Intel) and debuggers (gdb, Totalview, Intel). • Support for network file systems (AFS, NFS) and local data storage.

  17. Linux Farm Batch Management • New Condor-based batch system with custom PYTHON front-end to replace old batch system. Fully deployed in Linux Farm. • Use of Condor DAGman functionality to handle job dependencies. • New system solves scalability problems of old system. • Upgraded to Condor 6.6.5 (latest stable release) to implement advanced features (queue priority and preemption).

  18. Linux Farm Batch Management (cont.)

  19. Linux Farm Batch Management (cont.) • LSF v5.1 widely used in Linux Farm, specially for data analysis jobs. Peak rate of 350 K jobs/week. • LSF possibly to be replaced by Condor if the latter can scale to similar peak job rates. Current Condor peak rates of 7 K jobs/week. • Condor and LSF accepting jobs through GLOBUS. • Condor scalability to be tested in ATLAS DC 2.

  20. Condor Usage at the RACF

  21. Monitoring • Mix of open-source, RCF-designed and vendor-provided monitoring software. • Persistency and fault-tolerant features. • Near real-time information. • Scalability requirements.

  22. Mass Storage Monitoring

  23. Central Disk Storage Monitoring

  24. Linux Farm Monitoring

  25. Temperature Monitoring

  26. Security & Authentication • Two layers of firewall with limited network services and limited interactive access through secure gateways. • Migration to Kerberos 5 single sign-on and consolidation of password DB’s. NIS passwords to be phased-out. • Integration of K5/AFS with LSF to solve credential forwarding issues. Will need similar implementation for Condor. • Implemented Kerberos certificate authority.

  27. Future Developments • HIS/HTAR deployment for UNIX-like access to HPSS. • Moving beyond NFS-served SAN with more scalable solutions (Panasas, IBRIX, Lustre, NFS v.4.1, etc). • dCache/SRM being evaluated as a distributed storage management solution to exploit high-capacity, low-cost local storage in the 1300+ node Linux Farm. • Linux Farm OS upgrade plans (RHEL?).

  28. US ATLAS Grid Testbed giis01 Information Server Condor pool amds Mover HPSS Globus RLS Server aftpexp00 AFS server Gatekeeper Job manager Globus-client aafs GridFtp 70MB/S Grid Job Requests atlas02 Internet 17TB Disks Local Grid development currently focused on monitoring, user management and support for DC2 production activities

  29. Summary • RHIC run very successful. • Increasing staff levels to support increasing level of computing support activities. • On-going evaluation of scalable solutions (dCache, Panasas, Condor, etc) in a distributed computing environment. • Increased activity to support upcoming ATLAS DC 2 production.

More Related