220 likes | 331 Views
HEPiX Trip Report Jefferson Laboratory 9 -13 October 2006. Martin Bly – RAL Tier1 HEPSysMan – Cambridge 23 October 2006. Introduction. Site issues Subject talks. Sites: CERN. Successfully negotiated new LCG-wide licences for Oracle All Physics databases now migrated to Oracle RAC hosting
E N D
HEPiX Trip ReportJefferson Laboratory 9 -13 October 2006 Martin Bly – RAL Tier1 HEPSysMan – Cambridge 23 October 2006
Introduction • Site issues • Subject talks HEPSysMan - Cambridge - Autumn 2006
Sites: CERN • Successfully negotiated new LCG-wide licences for Oracle • All Physics databases now migrated to Oracle RAC hosting • SLC4 for LHC start up, SLC3 support ends October 2007 • Lemon Alarm System (LAS) replacing SURE • Central CVS service running well • Looking at Subversion • First Opteron systems in CERN CC • Insecure mail protocols forbidden/blocked • POP/IMAP etc must use SSL • No compromise on performance of disk servers to get ‘fat’ systems HEPSysMan - Cambridge - Autumn 2006
Sites: FermiLab • Multiple 10Gb/s connections to Starlight • Efforts to automate computer security • Replace home-grown tools with commercial utilities • New computer rooms • Overhead power and networking • Plastic curtains to trap cold air in front of machines • US-CMS • 700TB dCache space • Expected to be 2.5Pb by autumn 2007 • 700 node cluster expanding to 1600 nodes • BlueArc NAS for online storage • Expensive… HEPSysMan - Cambridge - Autumn 2006
Sites: GridKa • Issues with recent Opteron procurement • MSI K1-1000D motherboard, AMD 270s • BIOS issues, BMC and NIC firmware updates • Issues with water cooled racks traced to leaks in chillers • NEC supplying 4500TB storage • 28 Storage controllers, RAID 6, 60 file servers • Report on latest benchmarks • Woodcrest performs very well HEPSysMan - Cambridge - Autumn 2006
Sites: NERSCNational Energy Research Scientific Computing Center, Berkely • NERSC Global Filesystem (NFG) in production • 70TB for project file space (subject of separate talk) • Aim to procure ‘just storage’ • 10Gb/s internal/external networks • 10Gb/s ‘jumbo’ network • Cray Hood system • 19000+ CPUs, 70TB disk, 102 cabinets • Nagios for monitoring, extending to the Cray • Computer room full, need more power, space HEPSysMan - Cambridge - Autumn 2006
Sites: INFN • 10Gb/s link to GARR backbone • T2s now at 1Gb/s • GPFS now robust enough to be adopted by many sites • Lustre also being tested by a few sites • Testing iSCSI • Satisfactory but not completely satisfying • Looking at new EMC device and home-grown solutions to try and resolve issues HEPSysMan - Cambridge - Autumn 2006
Sites: GSI Darmstadt • Issues with large storage farm • 100/120 nodes failed to boot after move to new racks • Had been OK for 6 months previously in old racks • Traced to vibration resonance between disk and CPU cooling fans • Issues with cooling in racks • Keep cold and warm air flows separate • Blanking plates important HEPSysMan - Cambridge - Autumn 2006
Sites: SLAC • SLAC now a US-Atlas site • Procurements to start soon • Non-HEP experiment computing building up • Many old clusters being decommissioned to make space • Plan for 150/200-node infiniband cluster • Model check-pointing is a challenge • Testing Lustre • Need to move away from AFS (K4) token passing • SSH/K5 with GSSAPI to pass K5 tickets • New wireless registration scheme to enable users to be contacted should their machine cause problems HEPSysMan - Cambridge - Autumn 2006
Sites: INFN-CNAF • CPU capacity upgrade delayed while cooling system upgraded after cooling issues during summer • Using Quattor/Lemon • CERN customisations sometimes a problem • Staying with SLC3 (v3.0.8 supports Woodcrest) • SLC4 when EGEE moves HEPSysMan - Cambridge - Autumn 2006
Sites: LAL • VMware still preferred Linux-on-desktop solution • Installed gLite3 on SL4 without modification • Using Quattor and Lemon • Having removed CERN specifics HEPSysMan - Cambridge - Autumn 2006
Sites: General • Moving to specifying computing capacity requirements in performance terms for CPUs • Needs ‘common’ benchmarking • Require vendors to do it (and prove it!) • Corresponding interest in benchmarking and how to do it so it means something • 10Gb links now very common • Big Condor pools in use at some sites • Waiting for Grid middleware to be ported to SL4 HEPSysMan - Cambridge - Autumn 2006
Scientific Linux Update • UK top by download (no stats from mirrors) • FTP repository moved from GFS to NAS • New plone version for scientificlinux.org site • SL 4.4 Oct 2006 for i386, x86_64 • SL 3.0.8 release candidate available soon • Now available… • Bug fix repositories for SL variants • bugfixNN where NN is version • SL 3.0.8 should be the last of the 3 series • Support plan as previously published: till Autumn 2007 • Working on SL5 (installers etc) • SL5 alphas to be based on TUV beta releases HEPSysMan - Cambridge - Autumn 2006
Core Services/Infrastructure (1) • Tail of FermiLab’s run in with SpamCop • SpamCop don’t respond to any requests • Takes 24 hrs to ‘fall off’ list • Remove bounce messages and verify local addresses • Trap obvious Spam • Have alternative ip addresses for email gateways • Propose ‘white list’ of HEP sites… HEPSysMan - Cambridge - Autumn 2006
Core Services/Infrastructure (2) • Service Level Status service • CERN tool for displaying the status of services rather than individual nodes • Status defined by managers in terms of dependencies and dependants, and what service availability levels mean • Services and meta-services • Displays Key Performance Indicators of service levels compared to targets HEPSysMan - Cambridge - Autumn 2006
Core Services/Infrastructure (3) • RT used to manage Installation workflow (SLAC) • High Availability methods and experiences at GSI • Scientific Linux Inventory Project (FermiLab) • Need to monitor software inventory and hardware of a machine HEPSysMan - Cambridge - Autumn 2006
Compute Clusters & Storage • Hazards of Fast tape Drives (JLAB) • Is your memory buffer big enough to prevent the tape drive having to stop, rewind and take a run up to speed when more data is available to write? • CERN report 100MB/sec using two stage tape serving, with large (8GB) RAM on the L1 caches • NGF: NERSC’s Global File System (NERSC) • Benchmark Updates (CERN) • Spec.org results unreliable for HEP purposes • Don’t match our conditions • Requires vendors to use ‘fixed’ configuration of SPEC2000 benchmark • HPL used to benchmark ‘power’ perfomance HEPSysMan - Cambridge - Autumn 2006
Security • No Bob Cowles • Therefore, no ‘scare the pants off everyone’ talk • But: • The Stakkato Intrusion • The tail of the long-running intrusion at the Swedish National Supercomputer Centre, 2004-2005 • Network Security Monitoring • How it is done at Brookhaven National Lab, with Sguil HEPSysMan - Cambridge - Autumn 2006
Grid Projects • Issues and problems around Grid Site Management (+discussions) – Ian Bird • Measuring site availability: T1s poor • Instabilities in site availabilities observed • Strategies: • Improve sites, improve job direction • SAM (Site Availability Monitor) • An expansion of SFT functionality • Sensors integrated with submission framework or standalone • Integrated tests done by test job submission • Analysis of job efficiencies (failure rates): reasons non-trivial • Good sites change daily! • Plan to use Job wrappers to test as submitting-VO view rather than OPS-VO view • Better view of system ‘weather’ HEPSysMan - Cambridge - Autumn 2006
IHEPCCC • IHEPCCC discussing collaborating with HEPiX on areas of mutual interest, particularly benchmarking and global file systems • RTAG format proposed • Short-term study groups, report to HEPiX/IHEPCCC • Lots of interest in participating, particularly in benchmarking and discussing whether SPEC2006 is appropriate HEPSysMan - Cambridge - Autumn 2006
Next meetings • Spring 2007: • April 23rd to 27th in DESY Hamburg • Topics suggested included benchmarking, cluster file systems, VoIP and in general, ‘discussion topics’ (as opposed to LCG workshops) likely to attract LCG Tier 2 sites. • Autumn/Fall 2007: • possibly early November at either Berkeley or FermiLab, hopefully in the week preceding Supercomputing’07 in Reno • Spring 2008: • CERN HEPSysMan - Cambridge - Autumn 2006
References • Abstracts and slides form HEPiX Fall 2006: https://indico.fnal.gov/conferenceDisplay.py?confId=384 • Alan Silverman’s comprehensive trip report: https://www.hepix.org/mtg/fall_06_jlab/HEPiX%20_Lab_Trip_Report_silverman.pdf HEPSysMan - Cambridge - Autumn 2006