340 likes | 468 Views
Site Report: The RHIC Computing Facility. HEPIX – Amsterdam May 19-23, 2003 A. Chan RHIC Computing Facility Brookhaven National Laboratory. Outline. Background Mass Storage Central Disk Storage Linux Farms Software Development Monitoring Security
E N D
Site Report: The RHIC Computing Facility HEPIX – Amsterdam May 19-23, 2003 A. Chan RHIC Computing Facility Brookhaven National Laboratory
Outline • Background • Mass Storage • Central Disk Storage • Linux Farms • Software Development • Monitoring • Security • Other services • Summary
Background • Brookhaven National Lab (BNL) is a U.S. gov’t funded multi-disciplinary research laboratory • RCF formed in the mid-90’s to address computing needs of RHIC experiments • Became U.S. Tier 1 Center for ATLAS in late 90’s • RCF is a multi-purpose facility (NHEP and HEP)
Background (continued) • Currently 25 staff members (need more) • RHIC first collisions in 2000, now in year 3 of operations • 5 RHIC experiments (BRAHMS, PHENIX, PHOBOS, PP2PP and STAR)
Mass Storage • 4 StorageTek tape silos managed via HPSS (9940A and 9940B ) • Peak raw data rate to silos 350 MB/s (can do better) • Peak data rate to/from Linux Farm 180 MB/s (can do better) • Experiments have accumulated 618 TB of raw data (capacity for 5x more) • 5 staff members oversee Mass Storage operations
Central Disk Storage • 24 Sun E450 servers running Solaris 8 • 140 TB of disks managed by Sun servers via Veritas • Fast access to processed (DST) data via NFS (back-up in HPSS) • Aggregate 600 MB/s data rate to/from Sun servers on average • 5 staff members oversee Central Disk Storage operations
Linux Farms • Provide the majority of CPU power in the RCF • Used for mass processing of RHIC data • Listed as 3rd largest cluster according to http://www.clusters500.org • 5 staff members oversee all Linux Farm operations
Linux Farm Hardware • Built with commercially available Intel-based servers • 1097 rack-mounted, dual CPU servers • 917,728 SpecInt2000 • Reliable (0.0052 hardware failures/month-machine –about 6 failures/month at current size)
Linux Farm Software • RedHat 7.2 (RHIC) and 7.3 (ATLAS) • Image installed with Kickstart • Support for compilers (gcc, PGI, Intel) and debuggers (gdb, Totalview, Intel) • Support for network file systems (AFS, NFS)
Linux Farm Software (continued) • Support for LSF and RCF-designed batch software • System administration software to monitor & control hardware, software and infrastructure • GRID-like software (Ganglia, Condor, GLOBUS, etc) • Scalability an important operational requirement
Software Development • GRID-like services for RHIC and ATLAS • GRID monitoring tools • GRID user management issues • 4 staff members involved
Monitoring • Mix of open-source, RCF-designed and vendor-provided monitoring software • Persistency and fault-tolerant features • Near real-time information • Scalability requirements
Security • Firewall to minimize unauthorized access • Most servers closed to direct, external access • User access through security-enhanced gateway systems • Security in the GRID-environment a big challenge
Other Services • E-mail • Limited printer support • Off-site data transfer services (bbftp, rftp, etc) • Nightly backups of critical file systems
Summary • Implementation of GRID-like services increasing • Hardware & software scalability more important as RCF grows • Security issues in the GRID-era an important issue