240 likes | 371 Views
INFN-T1 site report. Andrea Chierici On behalf of INFN-T1 staff HEPiX Spring 2014. Outline. Common services Network Farming Storage. Common services. Cooling problem in march. Problem at cooling system, we had to switch the whole center off
E N D
INFN-T1 site report Andrea Chierici On behalf of INFN-T1 staff HEPiX Spring 2014
Outline • Common services • Network • Farming • Storage Andrea Chierici
Cooling problem in march • Problem at cooling system, we had to switch the whole center off • Obviously the problem happened on Sunday at 1am • Took almost a week to completely recover and have our center 100% back on-line • But LHC exp. opened after 36h • We learned a lot from this (see separate presentation) Andrea Chierici
New dashboard Andrea Chierici
Example: Facility Andrea Chierici
Installation and configuration • CNAF seriously evaluating to move to puppet + foreman as common installation and configuration infrastructure • INFN-T1 historically a quattor supporter • New man power, wider user base and activities pushing us to change • Quattor would stay around as much as needed • at least 1 year to allow for the migration of some critical services Andrea Chierici
Heartbleed • No evidence of compromised nodes • Updated SSL and certificates on bastions hosts and critical services (grid nodes, Indico, wiki) • Some hosts were not exposeddue to older version installed Andrea Chierici
Grid Middleware status • EMI-3 update status • All core services updated • All WNs updated • Some legacy services (mainly UIs) still at EMI-1/2, will be phased out asap Andrea Chierici
Cisco7600 NEXUS WAN Connectivity • RAL • PIC • TRIUMPH • BNL • FNAL • TW-ASGC • NDFGF LHC OPN IN2P3 SARA LHC ONE GARR Bo1 General IP 10 Gb/s CNAF-FNAL CDF (Data Preservation) 10 Gb/s For General IP Connectivity 10Gb/s 40Gb/s 10Gb/s 40 Gb Physical Link (4x10Gb) shared for LHCOPN and LHCONE. T1 resources Andrea Chierici
Current connection model LHCOPN/ONE INTERNET 10Gb/s 4X10Gb/s cisco 7600 10Gb/s bd8810 nexus 7018 10Gb/s Disk Servers 2x10Gb/s Up to 4x10Gb/s Oldresources 2009-2010 4X1Gb/s Farming Switch Farming Switch 20 Worker Nodes per switch WorkerNodes • Core switches and routers are fully redundant (power, CPU, fabrics) • Every Switch is connected with load sharing on different port modules • Core switches and routers have a strict SLA (next solar day) for maintenance Andrea Chierici
Computing resources • 150K HS-06 • Reduced compared to last WS • Old nodes have been phased-out(2008 and 2009 tender) • Whole farm running on SL6 • Supporting a few VOs that still require sl5 via WNODeS Andrea Chierici
New CPU tender • 2014 tender delayed • Funding issues • We were running over-pledged resources • Trying to take into account TCO (energy consumption) not only sales price • Support will cover 4 years • Trying to open it as much as possible • Last tender only 2 bidders • “Relaxed” support constrains • Would like to have a way to easily share specs, experiences and hints about other sites procurements Andrea Chierici
Monitoring & Accounting (1) Andrea Chierici
Monitoring & Accounting (2) Andrea Chierici
New activities (last ws) • Did not migrate to Grid Engine, we stick to LSF • Mainly INFN-wide decision • Man power • Testing zabbix as a platform for monitoring computing resources • More time required • Evaluating APEL as an alternative to DGAS for grid accounting system not done yet Andrea Chierici
New activities • Configure Ovirt cluster to manage service VMs done • standard libvirt mini-cluster for backup, with GPFS shared storage • Upgrade LSF to v.9 • Setup of a new HPC cluster (Nvidia GPUs + Intel MIC) • Multicore task force • Implement log analysis system (logstash, kibana) • Move some core grid services to OpenStack infrastructure (first one will be site-BDII) • Evaluation of Avoton CPU (see separate presentation) • Add more VOs to WNODeS Andrea Chierici
Storage Resources • Disk Space: 15 PB-N (net) on-line • 4 EMC2 CX3-80 + 1 EMC2 CX4-960 (~1,4 PB) + 80 servers (2x1 gbps connections) • 7 DDN S2A 9950 + 1 DDN SFA 10K + 1 DDN SFA 12K(~13.5PB) + ~90 servers (10 gbps) • Upgrade of the latest system (DDN SFA 12K) was completed 1Q 2014. Aggregate bandwidth: 70 GB/s • Tape library SL8500 ~16 PB on line with 20 T10KB drives, 13 T10KC drives and 2 T10KD drives • 7500 x 1 TB tape capacity, ~100MB/s of bandwidth for each drive • 2000 x 5 TB tape capacity, ~200MB/s of bandwidth for each drive The 2000 tapes can be ‘‘re-used’’ with the T10KD tech with 8.5 TB tape capacity • Drives interconnected to library and servers via dedicated SAN (TAN). 13 Tivoli Storage manager HSM nodes access the shared drives • 1 Tivoli Storage Manager (TSM) server common to all GEMSS instances • A tender for additional 3000 x 5TB/8.5TB tape capacity for 2014-2017 is ongoing • All storage systems and disk-servers on SAN (4Gb/s or 8Gb/s) Andrea Chierici
Storage Configuration • All disk space is partitioned in ~10 GPFS clusters served by ~170 servers • One cluster per main experiment (LHC) • GPFS deployed on SAN implements a full High Availability system • System scalable to tens of PBs and able to serve thousands of concurrent processes with an aggregate bandwidth of tens GB/s • GPFS coupled with TSM offers a complete HSM solution: GEMSS • Access to storage granted through standard interfaces (posix, SRM, XRootD and WebDAV) • FS directly mounted on WNs Andrea Chierici
Storage research activities • Studies on more flexible and user-friendly methods for accessing storage over WAN • Storage federations, based on http/WebDAV for Atlas (production) and LHCb (testing) • Evaluation of different file systems (CEPH) and storage solutions (EMC2 Isilon over OneFS). • Integration between GEMSS Storage System and Xrootd in order to match the requirements of CMS, Atlas, Alice and LHCb using ad-hoc Xrootd modifications • This is currently in production Andrea Chierici
LTDP • Long Term Data preservation (LTDP) for CDF experiment • FNAL-CNAF Data Copy Mechanism is completed • Copy of the data will follow this timetable: • end 2013 - early 2014 → All data and MC user level n-tuples (2.1 PB) • mid 2014 → All raw data (1.9 PB) + Databases • Bandwidth of 10 Gb/s reserved on transatlantic Link CNAF ↔ FNAL • 940 TB already at CNAF • code preservation: CDF legacy software release (SL6) under test • analysis framework: in the future, CDF services and analysis computing resources will possibly be instantiated on demand on pre-packaged VMs in a controlled environment Andrea Chierici