200 likes | 414 Views
BNL Site Report. Ofer Rind Brookhaven National Laboratory rind@bnl.gov Spring HEPiX Meeting, CASPUR April 3, 2006. BNL Site Report. Ofer Rind Brookhaven National Laboratory rind@bnl.gov Spring HEPiX Meeting, CASPUR April 4, 2006. (Brief) Facility Overview.
E N D
BNL Site Report Ofer Rind Brookhaven National Laboratory rind@bnl.gov Spring HEPiX Meeting, CASPUR April 3, 2006
BNL Site Report Ofer Rind Brookhaven National Laboratory rind@bnl.gov Spring HEPiX Meeting, CASPUR April 4, 2006
(Brief) Facility Overview • RHIC/ATLAS Computing Facility is operated by BNL Physics Dept. to support the scientific computing needs of two large user communities • RCF is the “Tier-0” facility for the four RHIC expts. • ACF is the Tier-1 facility for ATLAS in the U.S. • Both are full-service facilities • >2400 Users, 31 FTE • RHIC Run6 (Polarized Protons) started March 5th BNL Site Report, Spring Hepix 2006
Mass Storage • Soon to be in full production.... • Two SL8500’s: 2 X 6.5K tape slots, ~5 PB capacity • LTO-3 drives: 30 X 80 MB/sec; 400 GB/tape (native) • All Linux movers: 30 RHEL4 machines, each with 7 Gbps ethernet connectivity and aggregate 4 Gbps direct attached connection to DataDirect S2A fibre channel disk • This is in addition to the 4 STK Powderhorn silos already in service (~4 PB, 20K 9940B tapes) • Transition to HPSS 5.1 is complete • “It’s different”...learning curve due to numerous changes • PFTP: client incompatibilities and cosmetic changes • Improvements to Oak Ridge Batch System optimizer • Code fixed to remove long-time source of instability (no crashes since) • New features being designed to improve access control BNL Site Report, Spring Hepix 2006
Centralized Storage • NFS: Currently ~220 TB of FC SAN; 37 Solaris 9 servers • Over the next year, plan to retire ~100 TB of mostly NFS served storage (MTI, ZZYZX) • AFS: RHIC and USATLAS cells • Looking at Infortrend disk (SATA + FC frontend + RAID6) for additional 4 TB (raw) per cell • Future: upgrade to OpenAFS 1.4 • Panasas: 20 shelves, 100 TB, heavily used by RHIC BNL Site Report, Spring Hepix 2006
Panasas Issues • Panasas DirectFlow (version 2.3.2) • High performance and fairly stable, but...problematic from an administrative perspective: • Occasional stuck client-side processes left in uninterruptible sleep • DirectFlow module causes kernel panics from time to time • Can always panic a kernel with panfs mounted by running a Nessus scan on the host • Changes in ActiveScale server configuration (e.g. changing the IP addresses of non-primary director blades), which the company claims are innocuous, can cause clients to hang. • Server-side NFS limitations • NFS mounting was tried and found to be unfeasible as a fallback option with our configuration heavy NFS traffic causes director blades crash; Panasas suggests limiting to <100 clients per director blade. BNL Site Report, Spring Hepix 2006
Update on Security • Nessus scanning program implemented as part of ongoing DOE C&A process • Constant low-level scanning • Quarterly scanning: more intensive port exclusion scheme to protect sensitive processes • Samhain • Filesystem integrity checker (akin to Tripwire) with central management of monitored systems • Currently deployed on all administrative systems BNL Site Report, Spring Hepix 2006
Linux Farm Hardware • >4000 processors, >3.5 MSI2K • ~700 TB of local storage (SATA, SCSI, PATA) • SL3.05(03) for RHIC (ATLAS) • Evaluated dual-core Opteron & Xeon for upcoming purchase • Recently encountered problems with Bonnie++ I/O tests using RHEL4 64 bit w/software RAID+LVM on Opteron • Xeon (Paxville) gives poor SI/watt performance BNL Site Report, Spring Hepix 2006
Power & Cooling • Power & Cooling now significant factors in purchasing • Added 240KW to facility for ‘06 upgrades • Long term: possible site expansion • Liebert XDV Vertical Top Cooling Modules to be installed on new racks • CPU and ambient temperature monitoring via dtgraph and custom python scripts BNL Site Report, Spring Hepix 2006
Distributed Storage • Two large dCache instances (v1.6.6) deployed in hybrid server/client model: • PHENIX: 25 TB disk, 128 servers, >240 TB data • ATLAS: 147 TB disk, 330 servers, >150 TB data • Two custom HPSS backend interfaces • Perf. tuning on ATLAS write pools • Peak transfer rates of >50 TB/day • Other: large deployment of Xrootd (STAR), rootd, anatrain (PHENIX) BNL Site Report, Spring Hepix 2006
Batch Computing • All reconstruction and analysis batch systems have been migrated to Condor, except STAR analysis ---which still awaits features like global job-level resource reservation --- and some ATLAS distributed analysis (these use LSF 6.0) • Configuration: • Five Condor (6.6.x) pools on two central managers • 113 available submit nodes • One monitoring/Condorview server and one backup central manager • Lots of performance tuning • Autoclustering of jobs for scheduling; timeouts; negotiation cycle; socket cache; collector query forking; etc.... BNL Site Report, Spring Hepix 2006
Condor Usage • Use of heavily modified CondorView client to display historical usage. BNL Site Report, Spring Hepix 2006
Condor Flocking • Goal: Full utilization of computing resources on the farm • Increasing use of a “general queue” which allows jobs to run on idle resources belonging to other experiments, provided that there are no local resources available to run the job • Currently, such “opportunistic” jobs are immediately evicted if a local job places a claim on the resource • >10K jobs completed so far BNL Site Report, Spring Hepix 2006
Condor Monitoring • Nagios and custom scripts provide live monitoring of critical daemons • Place job history from ~100 submit nodes into central database • This model will be replaced by Quill. • Custom statistics extracted from database (i.e. general queue, throughput, etc.) • Custom startd, schedd, and “startd cron” ClassAds allow for quick viewing of the state of the pool using Condor commands • Some information accessible via web interface • Custom startd ClassAds allow for remote and peaceful turn off of any node • Not available in Condor • Note that the “condor_off -peaceful” command (v6.8) cannot be canceled, must wait until running jobs exit BNL Site Report, Spring Hepix 2006
Nagios Monitoring • 13958 services total and 1963 hosts: average of 7 services checked per host. • Originally had one nagios server (dual 2.4 Gz).... • Tremendous latency: services reported down many minutes after the fact. • Web interface completely unusable (due to number of hosts and services) • ...all of this despite a lot of nagios and system tuning... • Nagios data written to ramdisk • Increased no. of file descriptors and no. of processes allowed • Monitoring data read from MySQL database on separate host • Web interface replaced with lightweight interface to the database server • Solution: split services roughly in half between two nagios servers. • Latency is now very good • Events from both servers logged to one MySQL server • With two servers there is still room for many more hosts and a handful of service checks. BNL Site Report, Spring Hepix 2006
Nagios Monitoring BNL Site Report, Spring Hepix 2006
ATLAS Tier-1 Activities • OSG 0.4, LCG 2.7 (this wk.) • ATLAS Panda (Production And Distributed Analysis) used for production since Dec. ‘05 • Good performance in scaling tests, with low failure rate and manpower requirements • Network Upgrade • 2 X 10 Gig LAN and WAN • Terapath QoS/MPLS (BNL, UM, FNAL, SLAC, ESNET) • DOE supported project to introduce end-to-end QoS network into data mgmt. • Ongoing intensive development w/ESNET SC 2005 BNL Site Report, Spring Hepix 2006
ATLAS Tier-1 Activities • SC3 Service Phase (Oct-Dec 05) • Functionality validated for full production chain to Tier-1 • Exposed some interoperability problems between BNL dCache and FTS (fixed now) • Needed further improvement in operation, performance and monitoring. • SC3 Rerun Phase (Jan-Feb 06) • Achieved performance (disk-disk, disk-tape) and operations benchmarks BNL Site Report, Spring Hepix 2006
ATLAS Tier-1 Activities • SC4 Plan: • Deployment of storage element, grid middleware, (LFC, LCG, FTS), and ATLAS VO box • April: Data throughput phase (disk-disk and disk-tape); goal is T0 to T1 operational stability • May: T1 to T1 data exercise. • June: ATLAS Data Distribution from T0 to T1 to select T2. • July-August: Limited distributed data processing, plus analysis • Remainder of ‘06: Increasing scale of data processing and analysis. BNL Site Report, Spring Hepix 2006
Recent P-P Collision in STAR BNL Site Report, Spring Hepix 2006