1 / 34

Site Report: The RHIC Computing Facility

Site Report: The RHIC Computing Facility. HEPIX – Amsterdam May 19-23, 2003 A. Chan RHIC Computing Facility Brookhaven National Laboratory. Outline. Background Mass Storage Central Disk Storage Linux Farms Software Development Monitoring Security

elvis-young
Download Presentation

Site Report: The RHIC Computing Facility

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Site Report: The RHIC Computing Facility HEPIX – Amsterdam May 19-23, 2003 A. Chan RHIC Computing Facility Brookhaven National Laboratory

  2. Outline • Background • Mass Storage • Central Disk Storage • Linux Farms • Software Development • Monitoring • Security • Other services • Summary

  3. Background • Brookhaven National Lab (BNL) is a U.S. gov’t funded multi-disciplinary research laboratory • RCF formed in the mid-90’s to address computing needs of RHIC experiments • Became U.S. Tier 1 Center for ATLAS in late 90’s • RCF is a multi-purpose facility (NHEP and HEP)

  4. Background (continued) • Currently 25 staff members (need more) • RHIC first collisions in 2000, now in year 3 of operations • 5 RHIC experiments (BRAHMS, PHENIX, PHOBOS, PP2PP and STAR)

  5. Mass Storage • 4 StorageTek tape silos managed via HPSS (9940A and 9940B ) • Peak raw data rate to silos  350 MB/s (can do better) • Peak data rate to/from Linux Farm  180 MB/s (can do better) • Experiments have accumulated 618 TB of raw data (capacity for 5x more) • 5 staff members oversee Mass Storage operations

  6. The Mass Storage System (1)

  7. The Mass Storage System (2)

  8. Central Disk Storage • 24 Sun E450 servers running Solaris 8 • 140 TB of disks managed by Sun servers via Veritas • Fast access to processed (DST) data via NFS (back-up in HPSS) • Aggregate 600 MB/s data rate to/from Sun servers on average • 5 staff members oversee Central Disk Storage operations

  9. Central Disk Storage (1)

  10. Central Disk Storage (2)

  11. Linux Farms • Provide the majority of CPU power in the RCF • Used for mass processing of RHIC data • Listed as 3rd largest cluster according to http://www.clusters500.org • 5 staff members oversee all Linux Farm operations

  12. Linux Farm Hardware • Built with commercially available Intel-based servers • 1097 rack-mounted, dual CPU servers • 917,728 SpecInt2000 • Reliable (0.0052 hardware failures/month-machine –about 6 failures/month at current size)

  13. The growth of the Linux Farm

  14. The Linux Farm in the RCF (1)

  15. The Linux Farm in the RCF (2)

  16. Linux Farm Software • RedHat 7.2 (RHIC) and 7.3 (ATLAS) • Image installed with Kickstart • Support for compilers (gcc, PGI, Intel) and debuggers (gdb, Totalview, Intel) • Support for network file systems (AFS, NFS)

  17. Linux Farm Software (continued) • Support for LSF and RCF-designed batch software • System administration software to monitor & control hardware, software and infrastructure • GRID-like software (Ganglia, Condor, GLOBUS, etc) • Scalability an important operational requirement

  18. Batch jobs in the Linux Farm (1)

  19. Batch Jobs in the Linux Farm (2)

  20. Software Development • GRID-like services for RHIC and ATLAS • GRID monitoring tools • GRID user management issues • 4 staff members involved

  21. The USATLAS GRID Testbed

  22. GRID Monitoring

  23. GRID User Management(1)

  24. GRID User Management (2)

  25. Monitoring • Mix of open-source, RCF-designed and vendor-provided monitoring software • Persistency and fault-tolerant features • Near real-time information • Scalability requirements

  26. Mass Storage Monitoring

  27. Central Data Storage Monitoring

  28. Linux Farm Monitoring

  29. Batch Job Control & Monitoring

  30. Infrastructure Monitoring

  31. Security • Firewall to minimize unauthorized access • Most servers closed to direct, external access • User access through security-enhanced gateway systems • Security in the GRID-environment a big challenge

  32. Security at the RCF

  33. Other Services • E-mail • Limited printer support • Off-site data transfer services (bbftp, rftp, etc) • Nightly backups of critical file systems

  34. Summary • Implementation of GRID-like services increasing • Hardware & software scalability more important as RCF grows • Security issues in the GRID-era an important issue

More Related