280 likes | 379 Views
Developing & Managing A Large Linux Farm – The Brookhaven Experience. CHEP2004 – Interlaken September 27, 2004 Tomasz Wlodek - BNL. Background. Brookhaven National Lab (BNL) is a multi-disciplinary research laboratory funded by US government.
E N D
Developing & Managing A Large Linux Farm – The Brookhaven Experience CHEP2004 – Interlaken September 27, 2004 Tomasz Wlodek - BNL
Background • Brookhaven National Lab (BNL) is a multi-disciplinary research laboratory funded by US government. • BNL is the site of Relativistic Heavy Ion Collider (RHIC) and four of its experiments. • The Rhic Computing Facility (RCF) was formed in the mid 90’s, in order to address computing needs of RHIC experiments.
Background (cont.) • BNL has also been chosen as the site of Tier-1 ATLAS Computing Facility (ACF) for the Atlas experiment in CERN. • RCF/ACF supports HENP and HEP scientific computing efforts and various general services (backup, e-mail, web, off-site data transfer, Grid, etc).
Background (cont.) • The Linux Farm is the main source of CPU (and increasingly storage) resources in the RCF/ACF • RCF/ACF is transforming itself from a local resource into a national and global resource • Growing design and operational complexity • Increasing staffing levels to handle additional responsibilities
The Pre-Grid Era • Rack-mounted commodity hardware • Self-contained, localized resources • Resources available only to local users • Little interaction with external resources at remote locations • Considerable freedom to set own usage policies
The (Near-Term) Future • Resources available globally • Distributed computing architecture • Extensive interaction with remote resources requires closer software inter-operability and higher network bandwidth • Constraints on freedom to set own policies
How do we get there? • Change in management philosophy • Evolution in hardware requirements • Evolution in software packages • Different security protocol(s) • Change in access policy
Change in Management Philosophy • Automated monitoring & management of servers in large clusters a must • Remote power management, predictive hardware failure analysis and preventive maintenance are important • High-availability based on large number of identical servers, not on 24-hour support • Increasingly larger clusters only manageable if servers are identical avoid specialized servers
Evolution in Hardware Requirements • Early acquisitions emphasized CPU power over local storage capacity • Increasing affordability of local disk storage has changed this philosophy • Hardware chosen by optimal combination of CPU power, storage capacity, server density and price • Buy from high-quality vendors to avoid labor-intensive maintenance issues
The Factors Enforcing Evolution in Software Packages • Cost • Farm size / scalability • Security • External influences / wide acceptance
Cost • Red Hat Linux →Scientific Linux • LSF →Condor
Farm Size / Scalability • Home built batch system for data reconstruction→ Condor based batch system • Home built monitoring system → Ganglia
Security • Started with NIS/telnet in the 90’s • Cyber-security threats prompted the installation of firewalls, gatekeepers and migration to ssh scricter security standards than in the past • On-going change to Kerberos 5. Ongoing phase-out of NIS passwords. • Testing GSI limited support for GSI
Security Changes (cont.) • Authorization & authentication controlled by local site (NIS and Kerberos) • Migration to GSI requires a central CA and regional VO’s for authentication local sites performs final authentication before granting access • Accept certificates from multiple CA’s? • Difficult transition from complete to partial control over security issues
External Influences / Wide Acceptance • Ganglia – used by RHIC experiments to monitor the RCF and external farms in order to manage their job submission. • HRM / dCACHE – used by other labs • Condor – widely used by Atlas community
Summary • RCF/ACF going through a transition from a local facility to a regional (global) facility many changes • Linux Farm built with commodity hardware is increasingly affordable and reliable • Distributed storage is also increasingly affordable management software issues.
Summary (cont.) • Inter-operability with remote sites (software and services) plays an increasingly important role in our software choices • Transition with security and access issues • Migration will take longer and be more difficult than generally expected change in hardware and software needs to be complemented by a change in management philosophy