330 likes | 342 Views
IHEPCCC Meeting. Wolfgang von Rüden IT Department Head, CERN 22 September 2006. CERN Site Report based on the input from many IT colleagues, with additional information in hidden slides. General Infrastructure and Networking. Computer Security (IT/DI). Incident analysis
E N D
IHEPCCC Meeting Wolfgang von Rüden IT Department Head, CERN 22 September 2006 CERN Site Report based on the input from many IT colleagues, with additional information in hidden slides Wolfgang von Rüden, CERN, IT Department September 2006
General Infrastructure and Networking Wolfgang von Rüden, CERN, IT Department September 2006
Computer Security (IT/DI) • Incident analysis • 14 compromised computers on average per month in 2006 • Mainly due to user actions on Windows PCs, e.g. trojan code installed • Detected by security tools monitoring connections to IRC/botnets • Some Linux systems were compromised by knowledgeable attacker(s) • Motivation appears to be money earned from controlled computers • Security improvements in progress • Strengthened computer account policies and procedures • Ports closed in CERN main firewall (http://cern.ch/security/firewall) • Controls networks separated and stronger security policies applied • Logging and traceability extended to better identify cause of incidents • Investigation of intrusion detection at 10Gbps based on netflow data • What is the policy of other labs concerning high-numbered ports? Wolfgang von Rüden, CERN, IT Department September 2006
Timeline for Security IncidentsMay 2000 – August 2006 Wolfgang von Rüden, CERN, IT Department September 2006
Computing and Network Infrastructure for Controls: CNIC (IT/CO & IT/CS) • Problem: • Control systems now based on TCP/IP and commercial PCs and devices • PLCs and other controls equipment cannot currently be secured • Consequences: Control Systems vulnerable to viruses and hacking attacks • Risks: Down-time or physical damage of accelerators and experiments • Constraints: • Access to control systems by off-site experts is essential • Production systems can only be patched during maintenance periods • Actions Taken: Set up CNIC Working Group • Establish multiple separate Campus and Controls network domains • Define rules and mechanisms for inter-domain & off-site communications • Define policies for access to and use of Controls networks • Designate persons responsible for controls networks & connected equipment • Define & build suitable management tools for Windows, Linux and Networks • Test security of COTS devices and request corrections from suppliers • Collaborate with organizations & users working on better controls security • Ref: A 'defence-in-depth' strategy to protect CERN's control systems http://cnlart.web.cern.ch/cnlart/2006/001/15 Wolfgang von Rüden, CERN, IT Department September 2006
Networking Status (IT/CS) • Internal CERN Network infrastructure progressing on time • New campus backbone upgraded. • Farm router infrastructure in-place. • New Infrastructure for external connectivity in-place. • CERN internal infrastructure (starpoints) upgrade in progress to provide better desktop connectivity. • Management tools to improve security control have been developed and put into production (control of connections between the Campus and the Technical networks). This is part of the CNIC project. • New firewall infrastructure being developed to improve aggregate bandwidth and integrate into a common management scheme. • Large parts of the experimental areas and pits now cabled. • The Dante POP for Geant-2 was installed at CERN towards the end of 2005. • Current work items • Improved wireless network capabilities being studied for the CERN site. Wolfgang von Rüden, CERN, IT Department September 2006
LHCOPN Status • LHCOPN links coming online: • Final circuits to 8 Tier-1’s • Remaining 3 due before the end of the year. • LHCOPN Management • Operations and Monitoring responsibilities being shared between EGEE (Layer3) and Dante (Layer 1/2) • Transatlantic link contracts passed to USLHCNet (Caltech) to aid DoE transparency • 3 links to be commissioned this year, Geneva-Chicago, Geneva-NewYork and Amsterdam-NewYork • Current work items • Improve management of multiple services across transatlantic links using VCAT/LCAS technology. Being studied by the USLHCNet group. • Investigate the use of cross border fiber for path redundancy in the OPN. Wolfgang von Rüden, CERN, IT Department September 2006
LHCOPN L2 CIRCUITS Cross Border Fiber 3x10G 2x10G 1x10G Wolfgang von Rüden, CERN, IT Department September 2006 <10G Bandwidth Managed
Scientific Linux @ CERN (IT/FIO) • See https://www.scientificlinux.org/distributions/roadmap. • Many thanks to FNAL for all their hard work. • Support for SL3 ends 31st October 2007 • SLC4 • CERN specific version of SL4 • Binary compatible for end user • Adds AFS, tape hardware support, ... required at CERN • Certified for general use at CERN end-March • Interactive and Batch services available since June • New CPU servers commissioned in October (1MSI2K) will all be installed with SLC4 • Switch of the default from SLC3 to SLC4 foreseen (hoped!) for end October/November • Depends on availability of EGEE middleware • Will almost certainly be 32-bit • Too much software not yet 64-bit compatible • SLC5 is low priority • Could arrive 1Q07 at the earliest and no desire to switch OS just before LHC startup • But need to start planning for 2008 soon. Wolfgang von Rüden, CERN, IT Department September 2006
Internet Services (IT/IS) • Possible subjects for HEP-wide coordination • E-mail coordination for anti-spam, attachments, digital signatures, secure E-mail and common policies for visitors • Single sign on and integration with Grid certificates • Managing Vulnerabilities in Desktop Operating systems and applications. Policies concerning “root” and “Administrator” rights on Desktop computers. Antivirus and anti-spyware policies • Common Policies for Web hosting, role of CERN as a “catch-all” web hosting service for small HEP labs, conferences and activities distributed across multiple organizations. • Desktop Instant messaging and IP telephony? Protocols, integration with email, presence information? Wolfgang von Rüden, CERN, IT Department September 2006
Conference and AV Support (IT/UDS) • Video conferencing services • HERMES H323 MCU • Joint project with IN2P3 (host), CNRS and INSERM • VRVS preparing EVO rollout • Seamless Audio/Video conference integration through SIP (beta test) • SMAC: conference recording (Smart Multimedia Archive for Conferences) • Joint project with EIF (engineering school) and Uni Fribourg • Pilot in main auditorium • Video Conference rooms refurbishment • Pilot rooms in B.40: standard (CMS), fully-featured (ATLAS) • 12 more requested before LHC turn-on • Multimedia Archive Project • Digitisation: Photo / Audio / Video • CDS Storage and Publication • e.g. http://cdsweb.cern.ch/?c=Audio+Archives Wolfgang von Rüden, CERN, IT Department September 2006
Indico & Invenio Directions (IT/UDS) • Indico as the “single interface” • Agenda migration virtually complete • VRVS booking done • HERMES and eDial booking soon • CRBS: Physical room booking under study • Invenio for Indico search • CDS powered by Invenio • Released in collaboration with EPFL • Finishing major code refresh into Python • Flexible output formatting; XML, BibTeX • RSS feeds; Google Scholar interfacing • In 18 languages (contributions from around the globe) • Collaborative tools • Baskets, reviewing, commenting • Document “add on” • Citation extraction linking (SLAC planning to collaborate) • Key-wording (ontology with DESY) Wolfgang von Rüden, CERN, IT Department September 2006
Open Access • Preprints – already wholly OA • Operational circular 6 (rev 2001) requires every CERN author to submit a copy of their scientific documents to the CERN Document Server (CDS) • Institutional archive & HEP Subject archive • Publications • Tripartite Colloquium December 2005: “OA Publishing in Particle Physics” • Authors, publishers, funding agencies • Task force (report June 2006) • …to study and develop sustainable business models for particle physics • Conclude: a significant fraction of particle physics journals are ready for a rapid transition to OA under a consortium funded sponsoring model Wolfgang von Rüden, CERN, IT Department September 2006
Oracle related issues (IT/DES) • Serious bug causing logical data corruption (wrong cursor sharing, side effect of new algorithm enabled by default in RDBMS 10.2.0.2) • LFC and VOMS affected • Problem reported 11 Aug • Workaround in place 21 Aug (with small negative side-effect) • First pre-patch released 29 Aug • Second pre-patch released 14 Sep • Prod-patch expected any day now • Support request escalated to highest level • “In one of the most complex parts of the product” • Regular phone conferences with Critical Account Manager • What to learn: • We feel we got good attention but still took time • Not always good to be on the latest release! Wolfgang von Rüden, CERN, IT Department September 2006
CERN openlab Concept • Partner/contributor sponsors latest hardware, software and brainware (young researchers) • CERN provides experts, test and validation in Grid environment • Partners: 500’000 €/ year, 3 years • Contributors: 150’000 €, 1 year Current Activities • Platform competence centre • Grid interoperability centre • Security activities • Joint events Wolfgang von Rüden, CERN, IT Department September 2006
WLCG Update Wolfgang von Rüden, CERN, IT Department September 2006
WLCG depends on two major science grid infrastructures …. EGEE - Enabling Grids for E-Science OSG - US Open Science Grid
Grid progress this year • Baseline services from the TDR are in operation • Agreement (after much discussion) on VO Boxes.. • gLite 3 • Basis for startup on EGEE grid • Introduced (just) on time for SC4 • New Workload Management System - now entering production • Metrics • accounting introduced for Tier-1s and CERN (cpu and storage) • site availability measurement system introduced – reporting for Tier-1s & CERN from May • job failure analysis • Grid operations • All major LCG sites active • Daily monitoring and operations now mature – EGEE and OSG – taken in turn by 5 sites for EGEE • Evolution of EGEE regional operations support structure
Data Distribution 1.6 GBytes/sec • Pre-SC4 April tests CERN T1s – SC4 target 1.6 GB/s reached – but only for one day • But – experiment-driven transfers (ATLAS and CMS) sustained 50% of the targetunder much more realistic conditions • CMS transferred a steady 1 PByte/month between Tier-1s & Tier-2s during a 90 day period • ATLAS distributed 1.25 PBytes from CERN during a 6-week period 0.8 GBytes/sec
August 2006 • two sites not yet integrated in measurement framework • SC4 target - 88% availability • 10-site average – 74% • best 8 sites average – 85% • reliability (excludes scheduled down time) ~1% higher
Job Reliability Monitoring • Ongoing work • System to process and analyse job logs implemented for some of the major activities in ATLAS and CMS Errors identified, frequency reported to developers, TCG • Expect to see results feeding through from development to products in a fairly short time • More impact expected when the new RB enters in full production (old RB is frozen) • Daily report on most important site problems • allows the operation team to drill down from site, to computing elements to worker nodes • In use by the end of August • Intention is to report long-termtrends by site, VO FNAL
2006 2007 2008 CommissioningSchedule SC4 – becomes initial service when reliability and performance goals met Introduce residual services Full FTS services; 3D; SRM v2.2; VOMS roles Continued testing of computing models, basic services Testing DAQTier-0 (??) & integrating into DAQTier-0Tier-1data flow Building up end-user analysis support Exercising the computing systems, ramping up job rates, data management performance, …. Initial service commissioning – increase performance, reliability, capacity to target levels, experiencein monitoring, 24 X 7 operation, …. 01jul07 - service commissioned - full 2007 capacity, performance first physics
Challenges and Concerns • Site reliability • Achieve MoU targets – with a more comprehensive set of tests • Tier-0, Tier-1 and (major) Tier-2 sites • Concerns on staffing levels at some sites • 24 X 7 operation needs to be planned and tested – will be problematic at some sites, including CERN, during the first year when unexpected problems have to be resolved • Tier-1s and Tier-2s learning exactly how they will be used • Mumbai workshop, Tier-2 workshops • Experiment computing model tests • Storage, data distribution • Tier-1/Tier-2 interaction • Test out data transfer services, network capability • Build operational relationships • Mass storage • Complex systems difficult to configure • Castor 2 not yet fully mature • SRM v2.2 to be deployed – and storage classes, policies implemented by sites • 3D Oracle - Phase 2 – sites not yet active/staffed
Challenges and Concerns • Experiment service operation • Manpower intensive • Interaction with Tier-1s, large Tier-2s • Need sustained test load – to verify site and experiment readiness • Analysis on the Grid is very challenging • Overall grow in usage very promising • CMS has the lead with over 13k jobs/day submitted by ~100 users using ~75 sites (July 06) • They will continue to have an impact on and uncover weaknesses in services at all levels • Understanding the CERN Analysis Facility • DAQ testing looks late • the Tier-0 needs time to react to any unexpected requirements and problems
Tier0 Update Wolfgang von Rüden, CERN, IT Department September 2006
CERN Fabric progress this year CERN Fabric • Tier-0 testing has progressed well • Artificial system tests .. and ATLAS Tier-0 testing at full throughput • Comfortable that target data rates, throughput can be met .. Including CASTOR 2 • But DAQ systems not yet integrated in these tests • CERN Analysis Facility (CAF) • Testing of experiment approaches to this have started only in the past few months • Includes PROOF evaluation by ALICE • Much has still to be understood • Essential to maintain Tier-0/CAF flexibility for hardware during early years • CASTOR 2 • Performance is largely understood • Stability and the ability to maintain a 24 X 365 service is now the main issue
CERN Tier0 Summary (IT/FIO) • Infrastructure • A difficult year for cooling but the (long delayed) upgrade to the air conditioning system is now complete. • The upgrade to the electrical infrastructure should be complete in early 2007 with the installation of an additional 2.4MW of UPS capacity • No spare UPS capacity for physics services until then; the additional UPS systems are required before we install the hardware foreseen for 2007. • Looking now at possible future computer centre as rise in power demand for computing systems seems inexorable—demand likely to exceed current 2.5MW limit by 2009/10. • Water cooled racks as installed at the experiments seem to be more cost-effective than air cooling. • Procurement • We have evaluated tape robots from IBM and STK and also their high-end tape drives over the past 9 months. • Re-use of media means high-end drives are more cost-effective over a 5 year period. • Good performance seen from equipment from both vendors • CPU and Disk server procurement continues with regular calls for tender • Long time between start of process and equipment delivery remains, but process is well established Wolfgang von Rüden, CERN, IT Department September 2006
Tier0, suite • Readiness for LHC Production • Castor2 now seems on track • All LHC experiments fully migrated to Castor2 • Meeting testing milestones [images to show on next slides] • Still some development required, but effort is now focussed on the known problem areas as opposed to firefighting. • Grid services now integrated with other production services • Service Dashboard at https://cern.ch/twiki/bin/view/LCG/WlcgScDash shows readiness of services for production operation • Significant improvement in readiness over past 9 months. [see later for image] • Now a single daily meeting for all T0/T1 services • Still concerns over possible requirement for 24x7 support by engineers • Many problems still cannot be debugged by on-call technicians • For data distribution, full problem resolution is likely to require contact with remote site. Wolfgang von Rüden, CERN, IT Department September 2006
Grid Service dashboard Wolfgang von Rüden, CERN, IT Department September 2006
The EGEE project • Phase 1 • 1 April 2004 – 31 March 2006 • 71 partners in 27 countries(~32M€ funding from EU) • Phase 2 • 1 April 2006 – 31 March 2008 • 91 partners in 32 countries(~37M€ EU funding) • Status • Large-scale, production-quality grid infrastructure in use by HEPand other sciences(~190 sites, 30,000 jobs/day) • gLite3.0 Grid middlewaredeployed EGEE provides essential support to the LCG project
EU projects related to EGEE Wolfgang von Rüden, CERN, IT Department September 2006
Sustainability: Beyond EGEE-II • Need to prepare for permanent Grid infrastructure • Production usage of grid infrastructure requires long-term planning • Ensure a reliable and adaptive support for all sciences • Independent of short project cycles • Modelled on success of GÉANT • Infrastructure managed in collaboration with national grid initiatives
EGEE’06 Conference • EGEE’06 – Capitalising on e-infrastructures • Keynotes on state-of-the-art and real-world use • Dedicated business track • Demos and business/industry exhibition • Involvement of international community • 25-29 September 2006 • Geneva, Switzerland, organised by CERN • http://www.eu-egee.org/egee06