380 likes | 420 Views
ATLAS Tier 1 Meeting. Linux Farm & Facility Operations May 22, 2007 Lyon, France. Outline. Overview Linux Farm Condor Virtualization Monitoring Facility Operations Experience with RHIC operations A New Operational Model for the RACF Summary. Overview.
E N D
ATLAS Tier 1 Meeting Linux Farm & Facility Operations May 22, 2007 Lyon, France
Outline • Overview • Linux Farm • Condor • Virtualization • Monitoring • Facility Operations • Experience with RHIC operations • A New Operational Model for the RACF • Summary
Overview • RACF is a heterogeneous large-scale, multi-purpose facility with 24x7 year-round support at Brookhaven. • Supports computing activities for RHIC experiments, ATLAS and LSST (processing, storage, www, email, backup, printing, etc). • Primary computing facility for RHIC experiments; Tier 1 Center for ATLAS computing in the U.S.; a component of LSST computing. • Over 7 PB of tape storage capacity, 200+ TB of centralized disk storage, 1.3+ PB of distributed storage and almost 5 million SI2K of computing capacity. • Maintained and operated by 36 staff members (and growing).
Linux Farm (cont.) • Majority of processing power and distributed storage at RACF. • Consumes significant infrastructure resources (power, cooling, space and network). • Selected interactive systems provide access to Linux Farm resources. • Challenge is managing large numbers of similar servers, not high-availability systems innovation and automation to keep staff workload manageable. • Migration to dual-core in 2006 (quad-core evaluation in 2008). • Migration 32-bit SL-4.x (for now).
Linux Farm (cont.) • Established procedure for evaluation and procurement of Linux Farm computing and Linux Farm-based distributed storage. • Adding 1.5 million SI2K and 800 TB of distributed storage in 2007. • Total of 5500 cpu`s (Intel and AMD) in nearly 1700 rack-mounted servers by mid-2007. • Productivity gains from: • Condor • Virtualization • Meeting projected resource requirements for ATLAS will be a challenge.
Condor • Replaced LSF in 2004 (still some legacy LSF support for STAR experiment). • Steep learning curve for Condor configuration required significant man-hours in meetings with developers. • Custom configuration for each experiment. • Common (general) queue for all experiments to allow back-fill of unused cpu cycles on a low-priority basis. • Higher utilization but also more complexity.
RCF/ACF Farm Occupancy Condor general queue fully enabled in Aug. 2006
Condor Jobs in the RACF Linux Farm Transition from LSF to Condor began on June 2004 (*) May 1-15, 2007 only
Virtualization • Condor is built for throughput, not performance built-in inefficiency. • Addressed with general queue concept blurring boundaries between experiment-exclusive resources. • Not entirely successful different software environments for different experiments. • Can have higher utilization levels if we support multiple software environments on the same physical hardware virtualization. • Increasing hardware/software support for virtualization (ie, RedHat and Intel). • Pursuing both Xen and VMware. • Xen testbed created in February 07. Vmware testbed available summer 07. • Virtualization can also be used to create a software testbed without taking resources from the production environment.
Monitoring • Evolution of RACF from local to globally available resource highlights the importance of a reliable, well-instrumented monitoring system. • RACF monitors service availability, system performance and facility infrastructure (power and cooling). • Mixture of commercial, open-source and RACF-written components. • RT • Ganglia • Nagios • Infrastructure • Condor • Choices guided by desired features: historical logs, alarm escalation, real time information.
RT • Trouble tickets system. • Historical records available. • Currently coupled to the monitoring software for alarm escalation and event logging. • Integration into SLA for ATLAS.
Ganglia • Open-source, distributed hierarchical monitoring tool for federations of clusters. • Leverages existing tools (ie, XML for data representation and RRDtool for data storage and visualization) for ease of management. • Low-overhead and scalable to thousands of systems at RACF. • Monitors cluster performance (storage capacity, computing throughput, etc).
Nagios • Open-source software used to monitor service availability. • Host-based daemons configured to use externally-supplied “plugins” to obtain service status. • Host-based alarm configured to take specified actions (e-mail notification, system reboot, etc). • Native web-interface not scalable. • Connected to RT ticketing system for alarm escalation and logging.
Infrastructure • The growth of the RACF has put considerable strain on power and cooling to the building`s infrastructure. • UPS back-up power for RACF equipment. • Custom RACF-written script to monitor power and cooling issues. • Alarm escalation through RT ticketing system. • Automatic shutdown of Linux Farm during cooling or power failures.
Condor • Condor batch system does not provide a monitoring interface. • RACF created its own web-based, monitoring interface. • Interface available to staff for performance tuning and to facility users. • Connected to RT for critical servers. • Monitoring functions • Throughput • Service Availability • Configuration Optimization
Facility Operations • Facility operations is a manpower-intensive activity at the RACF. • Careful choice of technologies required for scaling of capacity and services. • Operational responsibility divided among major support groups within the facility (tape storage, disk storage, linux farm, general computing) • Software upgrades • Hardware lifecycle management • Integrity of facility services • User account lifecycle management • Cyber-security • Experience of RHIC operations for the past 8 years. • Can be used as a starting point for ATLAS Tier 1 facility operations.
Experience with RHIC Operations • 24x7 year-round operations already in place with RHIC experiments since 2000. • Facility components classified into 3 categories: non-essential, essential and critical. • Response to component failure commensurate with component classification: • Critical components are covered 24x7 year-round. Immediate response is expected from on-call staff. • Essential components have built-in redundancy/duplication and are addressed the next business day. Escalated to “critical” if large number of essential components fail and compromise service availability. • Non-essential components are addressed the next business day. • Staff provides primary coverage during normal business hours. • Operators are first point of contact during off-hours and weekends.
Experience with RHIC Operations (cont.) • Operators are responsible for contacting appropriate on-call person. • Users report problems via e-mail-based trouble ticketing system, pagers and phone. • Monitoring software instrumented with alarm system. • Alarm system connected to selected pagers and cell phones. • Alarm escalation procedure for staff (ie, contact back-up if primary is not available) during off-hours and weekends. • Periodic rotation of primary and back-up on-call list for each subsystem. • Automatic response to alarm conditions in certain cases (ie, shutdown of Linux Farm cluster in case of cooling failure). • Facility operations in RHIC has worked well over past 8 years.
Table 1 Summary of RCF Services and Servers Service Server Rank Comments Network to Ring 1 Internal Network 1 External Network 1 ITD handles RCF firewall 1 ITD handles HPSS rmdsXX 1 AFS Server rafsXX 1 AFS File systems 1 NFS Server 1 NFS home directories rmineXX 1 CRS Management rcrsfm, rcras 1 Rcrsfm is 1, rcras is 2 Web server (internet) www.rhic.bnl.gov 1 Web server (intranet) www.rcf.bnl.gov 1 NFS data disks rmineXX 1 Instrumentation 2 SAMBA rsmb00 DNS rnisXX 2 Should fail over NIS rnisXX 2 Should fail over NTP rnisXX 2 Should fail over RCF gateways 2 Multiple gateway machines ADSM backup 2 Wincenter rnts00 2/3 CRS Farm 2 LSF rlsf00 2 CAS Farm 2 rftp 2 Oracle 2 Objectivity 2 MySQL 2 Email 2/3 Printers 3 Service Level Agreement
A New Operational Model for the RACF • RHIC facility operations is a system-based approach. • ATLAS needs support for (mostly) remote users. • Service-based operational approach better suited for a distributed computing environment. • Dependency of services and systems for an integrated approach. • Service Coordinators responsible for availability of services. • New SLA for RACF to incorporate service-based approach. • Implementation details and timelines not yet finalized.
Summary • Growth in scale and complexity of the RACF operations. • Virtualization can help increase cluster productivity. • Monitoring integrated with alarm system for increased productivity. • Well-established procedures from RHIC operational experience. • New service-based SLA for distributed computing model needed. • New operational approach to meet ATLAS requirements to be implemented.