1 / 54

HEPiX Report

HEPiX Report. Helge Meinhard , Pedro Andrade, Giacomo Tenaglia / CERN-IT Technical Forum/Computing Seminar 11 May 2012. Outline. Meeting organisation ; site reports; business continuity ( Helge Meinhard ) IT infrastructure; computing (Pedro Andrade)

doria
Download Presentation

HEPiX Report

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. HEPiX Report HelgeMeinhard, Pedro Andrade,GiacomoTenaglia / CERN-IT Technical Forum/Computing Seminar11 May 2012

  2. Outline • Meeting organisation; site reports; business continuity (HelgeMeinhard) • IT infrastructure; computing (Pedro Andrade) • Security and networking; storage and file systems; grid, cloud and virtualisation (GiacomoTenaglia) HEPiX report – Helge.Meinhard at cern.ch – 11-May-2012

  3. HEPiX • Global organisation of service managers and support staff providing computing facilities for HEP • Covering infrastructure and all platforms of interest (Unix/Linux, Windows, Grid, …) • Aim: Present recent work and future plans, share experience, advise managers • Meetings ~ 2 / y (spring in Europe, autumn typically in North America) HEPiX report – Helge.Meinhard at cern.ch – 11-May-2012

  4. HEPiX Spring 2012 (1) • Held 23 – 27 April at the Czech Academy of Sciences, Prague, Czech Republic • Tier 2 centre in LCG • Excellent local organisation • Milos Lokajicek and his team proved experienced conference organisers • Organised events included a reception, an excellent concert with subsequent ‘refreshments’, a guided tour leading to the banquet place on the river, … • Prague: town with a very rich history, impressive architecture, good beer, rich food, … • Sponsored by a number of companies HEPiX report – Helge.Meinhard at cern.ch – 11-May-2012

  5. HEPiX Spring 2012 (2) • Format: Pre-defined tracks with conveners and invited speakers per track • Extremely rich, interesting and (over-)packed agenda • Novelty: three weeks before the meeting, the agenda was full – abstracts needed to be rejected, accepted talk slots shortened • New track on business continuity, convener: Alan • Judging by number of submitted abstracts, most interesting topic was IT infrastructure (17 talks!). 10 storage, 7 Grid/cloud/virtualisation, 6 network and security, 5 computing, 4 business continuity, 3 miscellaneous… plus 22(!) site reports • Full details and slides:http://indico.cern.ch/conferenceDisplay.py?confId=160737 • Trip report by Alan Silverman available, toohttp://cdsweb.cern.ch/record/1447137 HEPiX report – Helge.Meinhard at cern.ch – 11-May-2012

  6. HEPiX Spring 2012 (3) • 97 registered participants, of which 12/13 from CERN • Andrade, Bonfillou, Cass, Gomez Blanco, Høimyr, Iribarren, Meinhard, Mendez Lorenzo, Mollon, Moscicki, Salter, (Silverman), Tenaglia • Many sites represented for the first (?) time: PIC, KISTI, U Michigan, …Vendor representation: Western Digital, Dell, IBM, Netapp, EMC, SGI, Proact • Compare with Vancouver (autumn 2011): 98 participants, of which 10/11 from CERN; GSI (spring 2011): 84 participants, of which 14 from CERN HEPiX report – Helge.Meinhard at cern.ch – 11-May-2012

  7. HEPiX Spring 2012 (4) • 74 talks, of which 22 from CERN • Compare with Vancouver: 55 talks, of which 15 from CERN • Compare with GSI: 54 talks, of which 13 from CERN • Next meetings: • Autumn 2012: Beijing (hosted by IHEP); 15 to 19 October • Spring 2013: Bologna (date to be decided) • Autumn 2013: Interest by U Michigan, Ann Arbor, US HEPiX report – Helge.Meinhard at cern.ch – 11-May-2012

  8. HEPiX Spring 2012 (5) • HEPiX Board meeting: new European co-chair elected (HM) • Succession of Michel Jouvin, who stepped down following his election to become GDB chair • 4 candidates from 3 European institutes • Large number of participants and talks this time round • Interest for a North-American meeting in 18 months’ time • Interest by WLCG to use HEPiX as advisory body for site matters • HEPiX appears to be very healthy HEPiX report – Helge.Meinhard at cern.ch – 11-May-2012

  9. Site reports (1): Hardware • CPU servers: same trends • 12...48 core dual-CPU servers, AMD Interlagos and Intel Westmere/SandyBridge mentioned equally frequently, 2...4 GB/core. Typical chassis: 2U Twin2, Supermicro preferred, some Dell around (need to shutdown all four systems to work on one). HP C-class blades • Disk servers • Everybody hit more or less hard by disk drive shortage • Still a number of problems in interplay of RAID controllers with disk drives • External disk enclosures gaining popularity (bad support experience with an A-brand) • DDN used increasingly HEPiX report – Helge.Meinhard at cern.ch – 11-May-2012

  10. Site reports (2): Hardware (cont’d) • Tapes • An increasing number of sites mentioned T10kC in production • LTO popular, many sites investigating (or moving to) LTO5 • HPC • Smaller IB clusters replaced by single AMD boxes with 64 cores • IB still popular, 10GE ramping up • GPUs ever more popular – Tesla (and Kepler) on top, some sites looking into MIC • Odds and ends • Networking: Router picky, only accepting DA cables of the same manufacturer • Networking: A week of outage/unstability at a major site • Issues with PXE and Mellanox IB/10GE cards HEPiX report – Helge.Meinhard at cern.ch – 11-May-2012

  11. Site reports (3): Software • Storage • CVMFS becoming a standard service – little issues only • Lustre mentioned often, smooth sailing. GSI with IB. IHEP considering HSM functionality. DESY migrating to IBM Sonas • PSI: AFS with object extensions / HDF5 file format • Enstore: small file aggregation introduced • Gluster investigated in a number of places • Batch schedulers • Grid Engine rather popular; which one, Oracle, Univa, one of several OS forks? • GESS: Grid Engine Scientific Sites launched • Repeated reports of (scalability?) issues with PBSpro / Torque-MAUI • Condor, SLURM on the rise, good experience HEPiX report – Helge.Meinhard at cern.ch – 11-May-2012

  12. Site reports (4): Software (cont’d) • OS • Windows 7 being rolled out at many sites • Issue with home directories on NAS filer • MacOS support: Kerberos or AD? Home directories in AFS or DFS? • Virtualisation • Most sites experimenting with KVM • Some use of VMware (and complaints about cost level…) • NERSC: CHOS • Clouds • Openstack • OpenNebula • Mail/calendaring services • Exchange 2003 and/or Lotus to Exchange 2010 (FNAL: 3’000 accounts total) • Open Xchange being considered • Web services: GSI moving to Typo3 HEPiX report – Helge.Meinhard at cern.ch – 11-May-2012

  13. Site reports (5): Infrastructure • Service management • Service-now introduced in some places • CC-IN2P3 chose OTRS (only considered open-source tools) • Infrastructure • Water-cooled racks ubiquitous • Both active and passive cooling – PUE meaningful? • Chimney racks • Remote data centre extensions elsewhere • Configuration management • Puppet seems to be clear winner • Chef, Quattor used as well • Declining interest in cfengine (3) • ManageEngine used for Windows • Monitoring • Some sites migrating from Nagios to Icinga, one site considering Zenoss HEPiX report – Helge.Meinhard at cern.ch – 11-May-2012

  14. Site reports (6): Miscellaneous • Identity management discussed at DESY • Interest in our unified communication slide • Desktop support: FNAL concluded contract with an A-brand for service desk, deskside including printers, logistics, network cabling • Splunk mentioned a few times • SLAC moving documents into Sharepoint (rather than Invenio!) • Issue tracking: Redmine HEPiX report – Helge.Meinhard at cern.ch – 11-May-2012

  15. Business continuity • CERN talks by Wayne and Patricia • FNAL: ITIL, multiple power feeds, multiple rooms; plans for power failures. Vital admin data copied to ANL (and vice-versa). Plans for major cut from Internet • RAL: Formal change process introduced, helped by ITIL. Results are coherent, acceptance among service providers rapidly increasing • AGLT2: 2 sites, CPU and disk in both, services only in 1 so far. Addressing this by virtualising services with VMware HEPiX report – Helge.Meinhard at cern.ch – 11-May-2012

  16. Pedro Andrade (IT-GT) IT Infrastructure TRACK 16

  17. Overview 17

  18. Overview 6 on topics recently presented in ITTF • Total of 17 talks in 6 sessions (Mon to Fri) • Distribution by topic • 5 talks on computer center facilities • 1 talk on operating system • 8 talks on infrastructure services • 2 talks on database services • 1 talk on service management • Distribution by organization • 11 talks from CERN: CF, DB, DI, GT, OIS, PES • 6 talks from others: CEA, Fermilab, RAL, U.Gent, U.Oxf 18

  19. Overview • 6 talks on topics recently presented in ITTF • CERN Infrastructure Projects Update (1 talk) • Many questions about new CERN remote site: computer center temperature, lowest bid, SLA and penalties, PUE, CERN staff on site, remote management outside CERN • The Agile Infrastructure Project (4 talks) • Focus on the tool chain approach • Most questions related to selected tools • DB on Demand (1 talk) • Most questions related to support and administration 19

  20. ARTEMIS (RAL) Simple solution for CC temperature monitoring Problems with computer center temperature: no monitoring overview, effect of load/failures not understood, temperatures around racks unknown, hot spots location unknown Existing solutions are messy to modify and do not support multiple sensor types: deploy their own solution based on standard open source tools Raised cold aisle temperature from 15C to 21C and a estimated reduction of 0.3 PUE 20

  21. ARTEMIS (RAL) 21

  22. HATS (CERN) • Evolution of hardware testing tools • Simple application aimed at certifying that newly purchased hardwareissuitable for production use • Used as part of a recertification process on accepted deliveries where a major hardware update/change has been performed • Successor of the former burn in test system • Operational overhead was too heavy • Confined software environment of a live OS image prevented detection of complex hardware errors 22

  23. HATS (CERN) Runs on a dedicated server. Tests wrapped in bash scripts. Communication via SSH. 23

  24. HATS (CERN) • Provides an operational environment • Remote power control, remote console • And in addition can also • Upgrade BIOS, BMC, RAID controller, drives firmware • Run performance measurements (HEPSPEC, FIO, etc...) • Execute any system administration task • Runs on any Linux distribution • Moving to production by summer 2012 • Much simpler to evaluate fully configured systems • Operational overhead is significantly reduced 24

  25. Scientific Linux (Fermilab) Regular status update on Scientific Linux 25

  26. Scientific Linux (Fermilab) • SL 4.9 end of life February 2012 • ftp.scientificlinux.org 4.x tree's will be moved to “obsolete” at the end of May 2012 • SL 5.8 released on April 2012 • Wireless drivers, Java, Openafs1.4.14-80, lsb_release-a now reports the same as SL6, yum-autoupdatesupports PRERUN and POSTRUN • SL 6.2 released on February 2012 • Openafs1.6.0-97.z2, yum-autoupdateadds PRERUN and POSTRUN, livecd-tools, liveusb-creator 1.3.4, x86_64 adobe repo added 26

  27. Scientific Linux (Fermilab) • SL future plans • Continue with security updates for all releases • Continue with fastbugupdates only for the latest releases • New features for SL5 will stop at Q4 2012 • RH future plans • RHEL 6.3 beta released • RHEL 6.3 will probably be released within 2 months • Extension of lifetime from 7 to 10 years • RHEL 5 until 2017 • RHEL 6 until 2020 27

  28. Easy Build (U. Gent) • Single presentation related to build • Software installation framework • Automatize building and installing new software versions • Based on specification files (eb) describing a sw package • Allows diverting from default configure/make (easyblocks) • Builds/install binaries, creates module files • Key properties • Allows for sharing and reusing the eb and easyblocks • Installs different versions of a program next to each other • Save .ebspecification files under version control • Keep track of installation logs and build statistics 28

  29. Easy Build (U. Gent) • What easy build does • Generate installation path and create build directory • Resolve dependencies (build dependency first) • Unpack source (optionally try to download it first) • Apply patches • Configure build, Build, Test • Install, Install extra packages • Generate module files • Clean up, verify installation (not there yet) • In use for more then 3 years • >250 supported software packages • Tested under SL5/SL6 29

  30. Other Talks • GridPP Computer Room Experiences (U. Oxford) • History, hot/cold aisles containment, temperature monitoring • Procurement Trends (CERN) • Disk server evolution, SSDs, 2011 overview, 2012 plans • Quattor Update (RAL) • History, requirements, Aquilon, dashboards, community • Monitoring (CEA) • Architecture, Nagios, check_mk, operations • CERN infrastructure services (CERN) • TWiki, VCS, engineering tools, licenses, JIRA, BOINC • Database Access Management (CERN) • Requirements, architecture, interfaces, implementation, execution • Service Management (CERN) • Why, how, service catalog, service desk, service portal 30

  31. Pedro Andrade (IT-GT) COMPUTING TRACK 31

  32. Overview 32

  33. Overview • 5 talks in 1 session (Wed) • Distribution by topic • 2 talks on CPU Benchmarking • 1 talk on Deduplication and Disk Performance • 2 talks on Oracle Grid Engine • Distribution by organization • No CERN talks • DESY, FZU, IN2P3, INFN, KIT 33

  34. CPU Benchmarking (INFN, KIT) • Regular status update on CPU benchmarking • AMD (Bulldozer) • Available on Q4/2011 • Core multithreading (2 chips per module) • Up to 16 cores per chip • Up to 2.6 GHz (+ ~30% turbo frequency) • Intel (Sandy Bridge) • Available on Q1/2012 • Up to 8 cores (16 hyperthreaded cores) per chip • Up to 3.1 GHz (+ ~30-40% turbo frequency) 34

  35. CPU Benchmarking (INFN, KIT) • Comparing with previous versions • Performance p/ box: better for Intel and AMD • Performance p/ core: better for Intel, same for AMD 35

  36. CPU Benchmarking (INFN, KIT) • 64 bit • Better results than what is measured by 32bit HSPEC06 • Better performance: 25% on AMD and 15-20% on Intel • SL5 and SL6 • Boost when migrating from SL5 to SL6 • Performance boost: 30% on AMD and 5-15% on Intel • Power efficiency • Efficiency improvements measured on Intel • Approximately 1W per HSPEC06 • Pricing • Surprising pricing of new Sandy Bridge chips (KIT) 36

  37. Deduplication & Disks (FZU) • Targeted performance tests • Deduplication • Big hype on data compression for storage • Performance is very data dependent • Ratio: all zero 1:1045, ATLAS data set 1:1,07, VM backup 1:11,7 • Testing on your real data is a must • Disk Performance • How many disks for 64 core worker nodes? • Sequential read/write using dd, ATLAS analysis jobs • IO scheduler settings have big impact on IO performance • >8 drives does not help(3-4 drives should be ok) 37

  38. Grid Engine (DESY, IN2P3) • Regular status update on Grid Engine • Work at DESY to evaluated other GE solutions • Univa Grid Engine: doesn't look active • Open Grid Scheduler: SoGE patches, more conservative • Son of Grid Engine: regularly updated, more liberal, seems to works with grid middleware • Work at IN2P3 to move to Oracle GE • Moved from BQS to Oracle GE • Support: good reactivity but lack of efficiency • FTEs decreased from 3 to 1.5 • First F2F meeting to try to initiate a GE community 38

  39. Final Considerations • Learn from the experience/tests of others • CPU benchmarking, SL status, Deduplication tests, etc. • Get feedback and check the impact of our choices • Understand if we are on the right track • Try to understand patterns and trends • All sites improving their facilities and reducing costs • Virtualization has been taken by everyone • A lot of private tools, but many refs to tools on the market • ITIL seen as important but not really adopted 39

  40. Security and Networking HEPiX report – GiacomoTenaglia – 11-May-2012

  41. Security and Networking IPv6: important for US institutions! http://www.cio.gov/Documents/IPv6MemoFINAL.pdf • All public facing services to be IPv6-ready by the end of 2012, all internal client applications to use native IPv6 by the end of 2014 • Fermilab optimistic about 2012 (and almost there), 2014 much more difficult to achieve HEPiX report – GiacomoTenaglia – 11-May-2012

  42. FNAL IPv6 Planning: Strategic view • What you see shouldn’t sink your ship… • What you don’t see might… IPv6 at Fermilab

  43. Security and Networking HEPiX IPv6 WG“reality check” • Use FTS entirely over IPv6 • Implies Oracle (11g is IPv6-friendly) -> commercial dependencies dictate their timelines • Had to relink, patch Globus FTP client -> “works with IPv6” doesn’t imply it’s enabled by default • Maybe good to have “test days” on LS HEPiX report – GiacomoTenaglia – 11-May-2012

  44. Security and Networking Federated IdM for HEP: • Drafted documents • WLCG Security TEG just started on IdF • Looking for a good pilot project HEPiX report – GiacomoTenaglia – 11-May-2012

  45. Security and Networking Security takeaways: • Mac users are less security aware • Big intrusions are opportunities to clean up • Poor system administration is still a major problem • Log, log, log, log, correlate, have the logs reviewed by humans HEPiX report – GiacomoTenaglia – 11-May-2012

  46. Storage and Filesystems HEPiX report – Giacomo Tenaglia – 11-May-2012

  47. Storage and Filesystems RAID5 is dead: • ..and RAID6 (8+2) looks bad for >10PB • Erasure Coding • break data into m fragments, recorded into n fragments (n>m) -> • Redundant Array of Inexpensive Nodes • RAID across nodes instead of across disk arrays • Vendors already offering EC+RAIN • EOS docet HEPiX report – Giacomo Tenaglia – 11-May-2012

  48. Storage and Filesystems Tapes at CERN: • Hypervisors backing up VM images starts to become a problem (57 TB/day) • Sometimes a fault-tolerant architecture creates more problems than it solves! • Simplified setup aligning with AFS AFS at CERN: • Need to be aware of other sites’ upgrade plans to mitigate the impact of incidents • We can get 10GB for $HOME \o/ HEPiX report – Giacomo Tenaglia – 11-May-2012

  49. Storage and Filesystems IB-based lustre cluster at GSI: • Deployed on “minicube” (allows “3D” IB connections) • Everything over IB (even booting!) • Chef for fabric management • 2400 disks, 50 file servers • Very happy with the results HEPiX report – Giacomo Tenaglia – 11-May-2012

  50. HEPiX report – Giacomo Tenaglia – 11-May-2012

More Related