1 / 38

ATLAS Tier 1 Meeting

ATLAS Tier 1 Meeting. Linux Farm & Facility Operations May 22, 2007 Lyon, France. Outline. Overview Linux Farm Condor Virtualization Monitoring Facility Operations Experience with RHIC operations A New Operational Model for the RACF Summary. Overview.

butleraaron
Download Presentation

ATLAS Tier 1 Meeting

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ATLAS Tier 1 Meeting Linux Farm & Facility Operations May 22, 2007 Lyon, France

  2. Outline • Overview • Linux Farm • Condor • Virtualization • Monitoring • Facility Operations • Experience with RHIC operations • A New Operational Model for the RACF • Summary

  3. Overview • RACF is a heterogeneous large-scale, multi-purpose facility with 24x7 year-round support at Brookhaven. • Supports computing activities for RHIC experiments, ATLAS and LSST (processing, storage, www, email, backup, printing, etc). • Primary computing facility for RHIC experiments; Tier 1 Center for ATLAS computing in the U.S.; a component of LSST computing. • Over 7 PB of tape storage capacity, 200+ TB of centralized disk storage, 1.3+ PB of distributed storage and almost 5 million SI2K of computing capacity. • Maintained and operated by 36 staff members (and growing).

  4. Tape Storage System

  5. Disk Storage

  6. Linux Farm

  7. Linux Farm (cont.) • Majority of processing power and distributed storage at RACF. • Consumes significant infrastructure resources (power, cooling, space and network). • Selected interactive systems provide access to Linux Farm resources. • Challenge is managing large numbers of similar servers, not high-availability systems  innovation and automation to keep staff workload manageable. • Migration to dual-core in 2006 (quad-core evaluation in 2008). • Migration 32-bit SL-4.x (for now).

  8. Linux Farm (cont.) • Established procedure for evaluation and procurement of Linux Farm computing and Linux Farm-based distributed storage. • Adding 1.5 million SI2K and 800 TB of distributed storage in 2007. • Total of 5500 cpu`s (Intel and AMD) in nearly 1700 rack-mounted servers by mid-2007. • Productivity gains from: • Condor • Virtualization • Meeting projected resource requirements for ATLAS will be a challenge.

  9. RACF Linux Farm Computing Power

  10. RACF Linux Farm Distributed Storage

  11. Expected Computing Capacity Evolution

  12. Expected Storage Capacity Evolution

  13. Projected Power Requirements

  14. Projected Space Requirements

  15. Condor • Replaced LSF in 2004 (still some legacy LSF support for STAR experiment). • Steep learning curve for Condor configuration  required significant man-hours in meetings with developers. • Custom configuration for each experiment. • Common (general) queue for all experiments to allow back-fill of unused cpu cycles on a low-priority basis. • Higher utilization but also more complexity.

  16. Condor Policy for ATLAS

  17. Condor configuration for ATLAS

  18. RCF/ACF Farm Occupancy Condor general queue fully enabled in Aug. 2006

  19. Condor Jobs in the RACF Linux Farm Transition from LSF to Condor began on June 2004 (*) May 1-15, 2007 only

  20. Virtualization • Condor is built for throughput, not performance  built-in inefficiency. • Addressed with general queue concept  blurring boundaries between experiment-exclusive resources. • Not entirely successful  different software environments for different experiments. • Can have higher utilization levels if we support multiple software environments on the same physical hardware  virtualization. • Increasing hardware/software support for virtualization (ie, RedHat and Intel). • Pursuing both Xen and VMware. • Xen testbed created in February 07. Vmware testbed available summer 07. • Virtualization can also be used to create a software testbed without taking resources from the production environment.

  21. Virtualization

  22. Monitoring • Evolution of RACF from local to globally available resource highlights the importance of a reliable, well-instrumented monitoring system. • RACF monitors service availability, system performance and facility infrastructure (power and cooling). • Mixture of commercial, open-source and RACF-written components. • RT • Ganglia • Nagios • Infrastructure • Condor • Choices guided by desired features: historical logs, alarm escalation, real time information.

  23. RT • Trouble tickets system. • Historical records available. • Currently coupled to the monitoring software for alarm escalation and event logging. • Integration into SLA for ATLAS.

  24. Ganglia • Open-source, distributed hierarchical monitoring tool for federations of clusters. • Leverages existing tools (ie, XML for data representation and RRDtool for data storage and visualization) for ease of management. • Low-overhead and scalable to thousands of systems at RACF. • Monitors cluster performance (storage capacity, computing throughput, etc).

  25. Nagios • Open-source software used to monitor service availability. • Host-based daemons configured to use externally-supplied “plugins” to obtain service status. • Host-based alarm configured to take specified actions (e-mail notification, system reboot, etc). • Native web-interface not scalable. • Connected to RT ticketing system for alarm escalation and logging.

  26. Nagios (cont.)

  27. Infrastructure • The growth of the RACF has put considerable strain on power and cooling to the building`s infrastructure. • UPS back-up power for RACF equipment. • Custom RACF-written script to monitor power and cooling issues. • Alarm escalation through RT ticketing system. • Automatic shutdown of Linux Farm during cooling or power failures.

  28. Condor • Condor batch system does not provide a monitoring interface. • RACF created its own web-based, monitoring interface. • Interface available to staff for performance tuning and to facility users. • Connected to RT for critical servers. • Monitoring functions • Throughput • Service Availability • Configuration Optimization

  29. Condor (cont.)

  30. Condor (cont.)

  31. Facility Operations • Facility operations is a manpower-intensive activity at the RACF. • Careful choice of technologies required for scaling of capacity and services. • Operational responsibility divided among major support groups within the facility (tape storage, disk storage, linux farm, general computing) • Software upgrades • Hardware lifecycle management • Integrity of facility services • User account lifecycle management • Cyber-security • Experience of RHIC operations for the past 8 years. • Can be used as a starting point for ATLAS Tier 1 facility operations.

  32. Experience with RHIC Operations • 24x7 year-round operations already in place with RHIC experiments since 2000. • Facility components classified into 3 categories: non-essential, essential and critical. • Response to component failure commensurate with component classification: • Critical components are covered 24x7 year-round. Immediate response is expected from on-call staff. • Essential components have built-in redundancy/duplication and are addressed the next business day. Escalated to “critical” if large number of essential components fail and compromise service availability. • Non-essential components are addressed the next business day. • Staff provides primary coverage during normal business hours. • Operators are first point of contact during off-hours and weekends.

  33. Experience with RHIC Operations (cont.) • Operators are responsible for contacting appropriate on-call person. • Users report problems via e-mail-based trouble ticketing system, pagers and phone. • Monitoring software instrumented with alarm system. • Alarm system connected to selected pagers and cell phones. • Alarm escalation procedure for staff (ie, contact back-up if primary is not available) during off-hours and weekends. • Periodic rotation of primary and back-up on-call list for each subsystem. • Automatic response to alarm conditions in certain cases (ie, shutdown of Linux Farm cluster in case of cooling failure). • Facility operations in RHIC has worked well over past 8 years.

  34. Table 1 Summary of RCF Services and Servers Service Server Rank Comments Network to Ring 1 Internal Network 1 External Network 1 ITD handles RCF firewall 1 ITD handles HPSS rmdsXX 1 AFS Server rafsXX 1 AFS File systems 1 NFS Server 1 NFS home directories rmineXX 1 CRS Management rcrsfm, rcras 1 Rcrsfm is 1, rcras is 2 Web server (internet) www.rhic.bnl.gov 1 Web server (intranet) www.rcf.bnl.gov 1 NFS data disks rmineXX 1 Instrumentation 2 SAMBA rsmb00 DNS rnisXX 2 Should fail over NIS rnisXX 2 Should fail over NTP rnisXX 2 Should fail over RCF gateways 2 Multiple gateway machines ADSM backup 2 Wincenter rnts00 2/3 CRS Farm 2 LSF rlsf00 2 CAS Farm 2 rftp 2 Oracle 2 Objectivity 2 MySQL 2 Email 2/3 Printers 3 Service Level Agreement

  35. A New Operational Model for the RACF • RHIC facility operations is a system-based approach. • ATLAS needs support for (mostly) remote users. • Service-based operational approach better suited for a distributed computing environment. • Dependency of services and systems for an integrated approach. • Service Coordinators responsible for availability of services. • New SLA for RACF to incorporate service-based approach. • Implementation details and timelines not yet finalized.

  36. A Dependency Matrix

  37. A New Response Approach

  38. Summary • Growth in scale and complexity of the RACF operations. • Virtualization can help increase cluster productivity. • Monitoring integrated with alarm system for increased productivity. • Well-established procedures from RHIC operational experience. • New service-based SLA for distributed computing model needed. • New operational approach to meet ATLAS requirements to be implemented.

More Related