290 likes | 576 Views
The RHIC-ATLAS Computing Facility at BNL. HEPIX – Edinburgh May 24-28, 2004 Tony Chan RHIC Computing Facility Brookhaven National Laboratory. Outline. Background Mass Storage Central Disk Storage Linux Farm Monitoring Security & Authentication
E N D
The RHIC-ATLAS Computing Facility at BNL HEPIX – Edinburgh May 24-28, 2004 Tony Chan RHIC Computing Facility Brookhaven National Laboratory
Outline • Background • Mass Storage • Central Disk Storage • Linux Farm • Monitoring • Security & Authentication • Future Developments • Summary
Background • Brookhaven National Lab (BNL) is a U.S. gov’t funded multi-disciplinary research laboratory. • RACF formed in the mid-90’s to address computing needs of RHIC experiments. Became U.S. Tier 1 Center for ATLAS in late 90’s. • RACF supports HENP and HEP scientific computing efforts and various general services (backup, e-mail, web, off-site data transfer, Grid, etc).
Background (continued) • Currently 29 staff members (4 new hires in 2004). • RHIC Year 4 just concluded. Performance surpassed all expectations.
Mass Storage • 4 StorageTek tape silos managed via HPSS (v 4.5). • Using 37 9940B drives (200 GB/tape). • Aggregate bandwidth up to 700 MB/s. • 10 data movers with 10 TB of disk. • Total over 1.5 PB of raw data in 4 years of running (capacity for 4.5 PB).
Central Disk Storage • Large SAN served via NFS DST + user home directories + scratch area. • 41 Sun servers (E450 & V480) running Solaris 8 and 9. Plan to migrate all to Solaris 9 eventually. • 24 Brocade switches & 250 TB of FB RAID5 managed by Veritas. • Aggregate 600 MB/s data rate to/from Sun servers on average.
Central Disk Storage (cont.) • RHIC and ATLAS AFS cells software repository + user home directories. • Total of 11 AIX servers with 1.2 TB for RHIC and 0.5 TB for ATLAS. • Transarc on server side, OpenAFS on client side. • Considering OpenAFS for server side.
Linux Farm • Used for mass processing of data. • 1359 rack-mounted, dual-CPU (Intel) servers. • Total of 1362 kSpecInt2000. • Reliable (about 6 hardware failures per month at current farm size). • Combination of SCSI & IDE disks with aggregate of 234+ TB of local storage.
Linux Farm (cont.) • Experiments making significant use of local storage through custom job schedulers, data repository managers and rootd. • Requires significant infrastructure resources (network, power, cooling, etc). • Significant scalability challenges. Advance planning and careful design a must!
Linux Farm Software • Custom RH 8 (RHIC) and 7.3 (ATLAS) images. • Installed with Kickstart server. • Support for compilers (gcc, PGI, Intel) and debuggers (gdb, Totalview, Intel). • Support for network file systems (AFS, NFS) and local data storage.
Linux Farm Batch Management • New Condor-based batch system with custom PYTHON front-end to replace old batch system. Fully deployed in Linux Farm. • Use of Condor DAGman functionality to handle job dependencies. • New system solves scalability problems of old system. • Upgraded to Condor 6.6.5 (latest stable release) to implement advanced features (queue priority and preemption).
Linux Farm Batch Management (cont.) • LSF v5.1 widely used in Linux Farm, specially for data analysis jobs. Peak rate of 350 K jobs/week. • LSF possibly to be replaced by Condor if the latter can scale to similar peak job rates. Current Condor peak rates of 7 K jobs/week. • Condor and LSF accepting jobs through GLOBUS. • Condor scalability to be tested in ATLAS DC 2.
Monitoring • Mix of open-source, RCF-designed and vendor-provided monitoring software. • Persistency and fault-tolerant features. • Near real-time information. • Scalability requirements.
Security & Authentication • Two layers of firewall with limited network services and limited interactive access through secure gateways. • Migration to Kerberos 5 single sign-on and consolidation of password DB’s. NIS passwords to be phased-out. • Integration of K5/AFS with LSF to solve credential forwarding issues. Will need similar implementation for Condor. • Implemented Kerberos certificate authority.
Future Developments • HIS/HTAR deployment for UNIX-like access to HPSS. • Moving beyond NFS-served SAN with more scalable solutions (Panasas, IBRIX, Lustre, NFS v.4.1, etc). • dCache/SRM being evaluated as a distributed storage management solution to exploit high-capacity, low-cost local storage in the 1300+ node Linux Farm. • Linux Farm OS upgrade plans (RHEL?).
US ATLAS Grid Testbed giis01 Information Server Condor pool amds Mover HPSS Globus RLS Server aftpexp00 AFS server Gatekeeper Job manager Globus-client aafs GridFtp 70MB/S Grid Job Requests atlas02 Internet 17TB Disks Local Grid development currently focused on monitoring, user management and support for DC2 production activities
Summary • RHIC run very successful. • Increasing staff levels to support increasing level of computing support activities. • On-going evaluation of scalable solutions (dCache, Panasas, Condor, etc) in a distributed computing environment. • Increased activity to support upcoming ATLAS DC 2 production.