120 likes | 126 Views
This site report provides an overview of the Linux Farm at the RHIC Computing Facility (RCF), which offers computing facilities for RHIC users. It includes information on the general computing environment, data analysis facility, code development and distribution, as well as the storage and infrastructure for RHIC experiments. The report also covers the hardware and software configuration of the Linux Farm, security and monitoring measures, and plans for future expansion. Written by Ofer Rind.
E N D
HEPIX-HEPNT October 22-25, 2002 Ofer Rind RHIC Computing Facility Brookhaven National Laboratory Site Report: The Linux Farm at the RCF
RCF - Overview Provide computing facilities for RHIC users: • General computing environment • General interactive tasks (email, document processing, web) • Data analysis facility • Computing infrastructure for RHIC experiments • Code development, repository & distribution • Raw data recording & reconstruction • Data analysis ACF: US Atlas Tier 1 Computing Facility • Shared infrastructure and synergy with RCF Support staff: 25 FTE's (4 dedicated to Linux Farm) Ofer Rind - RHIC Computing Facility Site Report
RCF - Structure Ofer Rind - RHIC Computing Facility Site Report
RCF - Component Summary Mass Storage Subsystem • StorageTek library managed by HPSS • 4 Silos, 1.2PB capacity (expanding to 4.5PB) • In Run-2, raw data recorded at a common rate of 70MB/sec for a total of 170TB • Total data store ~300TB Disk Storage • Fibre channel SAN served by NFS • ~110TB Raid5 • 14 Sun 450, Solaris 8 [2-02] (5 Sun 480 coming online) • IBM AFS servers (AIX) Linux Server Farm Ofer Rind - RHIC Computing Facility Site Report
Linux Farm Hardware • 840 1U and 2U servers (pre-'99 towers have been retired) • 69 kSPECint95, expanding to 100 kSPECint95 (2+ TFLOPS) • Most have 1GB mem (at least 500MB) • Local SCSI disks up to 140GB/node • Allocated by experiment • Further allocated for Raw Data Reconstruction (CRS) and Re- constructed Data Analysis (CAS) VA Linux PIII 450Mz 148 Jun 99 VA Linux PIII 700Mz 48 Aug 00 VA Linux PIII 800Mz 168 Nov 00 IBM PIII 1000Mz 316 Aug 01 IBM PIII 1400Mz 160 Oct 02 Ofer Rind - RHIC Computing Facility Site Report
Linux Farm Software Configuration • RedHat 7.2 upgraded to 2.4.9-31 kernel • Image(s) installed via Kickstart server and customized for RCF environment via rpm • NFS + AFS home directory and file access • Interactive login allowed on selected nodes • Job management: (CAS) LSF 4.2 - slightly re-architected for robustness. Peak throughput before summer conferences was >150K jobs/week. (CRS) Locally produced Perl-based batch system (AIX needed for HPSS API). Approx. 670K jobs processed for Run-2. • Expanding use of distributed disk models (rootd, ??) • Atlas Grid testbed Ofer Rind - RHIC Computing Facility Site Report
Tracking LSF Usage Star queues weekly job statistics (week of Oct. 10) Job starts/hr Avg runtime/hr Runtime Ofer Rind - RHIC Computing Facility Site Report
Security and Monitoring Security: • RCF firewall within BNL site firewall • SSH2 only access through gateway bastion nodes (Solaris x86) • User access restricted to a subset of systems (CAS only) Monitoring: • 24 hr. on-call staff for critical systems during RHIC operation • Cluster mgmt. software: • VACM (VA Linux) • xCAT (IBM, http://www.x-cat.org) • Cron scripts to "clean" nodes and head off possible problems (memory leaks, full disks, etc.) • CTS system for problem reports Ofer Rind - RHIC Computing Facility Site Report
Farm Alert System Web-monitoring (user-accessible) plus paging/email alerts Python scripts running locally transferring node status information to a MySQL database. Notification of problems with NFS/AFS (e.g. stale file handles), LSF daemons, high load, etc. Ofer Rind - RHIC Computing Facility Site Report
Network Operation Status Perl scripts monitor network service connectivity for all nodes (ssh, yp, etc.) Ofer Rind - RHIC Computing Facility Site Report
Load Monitoring and History MySQL database for usage history History available back to Sept. '01 via web interface. CPU Load averaged over (98) Phenix machines during the month of September. Ofer Rind - RHIC Computing Facility Site Report
Plans for the Near Future • 160 newly delivered IBM nodes to be brought online • Expect purchase bid to go out for ~220 more nodes at beginning of FY03 (pending funding approval) • Scaling up data storage capacity and throughput for Run-3 (up to 10X data increase over Run-2, starting in December) • Evaluation of LSF 5 and Condor ongoing, with an eye towards distributed disk services • Expanding Atlas GRID services Ofer Rind - RHIC Computing Facility Site Report