270 likes | 459 Views
The INFN-Tier1. Luca dell’Agnello INFN-CNAF FNAL, May 18 2012. INFN-Tier1. Italian Tier-1 computing centre for the LHC experiments ATLAS, CMS, ALICE and LHCb .... … but also one of the main Italian processing facilities for several other experiments:. BaBar and CDF
E N D
The INFN-Tier1 Luca dell’AgnelloINFN-CNAF FNAL, May 18 2012
INFN-Tier1 • Italian Tier-1 computing centre for the LHC experiments ATLAS, CMS, ALICE and LHCb.... • … but also one of the main Italian processing facilities for several other experiments: • BaBar and CDF • Astro and Space physics • VIRGO (Italy), ARGO (Tibet), AMS (Satellite), PAMELA (Satellite) and MAGIC (Canary Islands) • And more (e.g. Icarus, Borexino, Gerda etc…)
INFN-Tier1: numbers • > 20 supported experiments • ~ 20 FTEs • 1000 m2 room with capability for more than 120 racks and several tape libraries • 5 MVA electrical power • Redundant facility to provide 24hx7d availability • Within May 2012 resources ready • 1300 server with about 10000 cores available • 11 PBytes of disk space for high speed access and 14 PBytes on tapes (1 tape library) • Aggregate bandwith to storage: ~ 50 GB/s • WAN link at 30 Gbit/s • 2x10 Gbit/s over OPN • With forthcoming GARR-X bandwith increase is expected
Electrical delivery CED (new room) Chiller floor UPS room
T1-T2’s • CNAF General purpose NEXUS 7600 • RAL • PIC • TRIUMPH • BNL • FNAL • TW-ASGC • NDFGF Current WAN Connections Router CERN per T1-T1 WAN LHC-OPN (20 Gb/s) T0-T1 +T1-T1 in sharing Cross Border Fiber (from Milan) CNAF-KIT CNAF-IN2P3 CNAF-SARA T0-T1 BACKUP 10Gb/s GARR 10Gb/s 20Gb/s 2x10Gb/s LHC ONE T2’s • 20 Gb phisical Link (2x1Gb) for LHCOPN and LHCONE Connectivity • 10 Gigabit Link for General IP connectivity • LHCHONE and LHC-OPN are sharing the same phisical ports now but they are managed as two completely different links (different VLANS are used for the point-to-point interfaces). • All the TIER2s wich are not connected to LHCONE are reached only via General IP. T1 resources A 10 Gbps link Geneva-Chicago for GARR/Geant us foreseen before the end of Q2 2012
T1-T2’s • CNAF General purpose NEXUS 7600 • RAL • PIC • TRIUMPH • BNL • FNAL • TW-ASGC • NDFGF Future WAN Connection (Q4 2012) Router CERN per T1-T1 WAN LHC-OPN (20 Gb/s) T0-T1 +T1-T1 in sharing Cross Border Fiber (from Milan) CNAF-KIT CNAF-IN2P3 CNAF-SARA T0-T1 BACKUP 10Gb/s GARR 10Gb/s 20Gb/s 10Gb/s 2x10Gb/s LHC ONE T2’s • 20 Gb phisical Link (2x1Gb) for LHCOPN and LHCONE Connectivity • 10 Gigabit phisical Link to LHCONE (dedicated to T1-T2’s traffic LHCONE) • 10 Gigabit Link for General IP connectivity • A new 10Gb/s dedcated to LHCONE link will be added. T1 resources
Computing resources • Currently ~ 110K HS06 available • We host other sites • T2 LHCb (~5%) with shared resources • T3 UniBO (~2%) with dedicated resources • New tender (June 2012) will add ~ 31K HS06 • 45 enclosures, 45x4 mb, 192 HS06/mb • Nearly constant number of boxes (~ 1000) • Farm is nearly always 100% used • ~ 8900 job slots ( 9200 job slots) • > 80000 jobs/day • Quite high efficiency (CPT/WCT ~ 83%) • Even with chaotic analysis InstallazioneCPU 2011 Farm usage (last 12 months WCT/CPT (last 12 months)
Storage resources TOTAL of 8.6 PB on-line (net) disk space • 7 EMC2CX3-80 + 1 EMC2CX4-960 (~1.8 PB) + 100 servers • In phase-out nel 2013 • 7 DDN S2A 9950 (~7 PB) + ~60 servers • Phase-out nel 2015-2016 … and under installation 3 Fujitsu EternusDX400 S2 (3 TB SATA) : + 2.8 PB • Tape library Sl8500 9PB + 5PB (just installed) on line with 20 T10KBdrives and 10 T10KC drives • 9000 x 1 TB tape capacity, ~ 100MB/s of bandwidth for each drive • 1000 x 5 TB tape capacity, ~ 200MB/s of bandwidth for each drive • Drives interconnected to library and servers via dedicated SAN (TAN). 13 Tivoli Storage manager HSM nodes access to the shared drives. • 1 Tivoli Storage Manager (TSM) server common to all GEMSS instances. • All storage systems and disk-servers are on SAN (4Gb/s or 8Gb/s) • Disk space is partitioned in several GPFS clusters served by ~100 disk-servers (NSD + gridFTP) • 2 FTE employed to manage the full system;
Whatis GEMSS? • GEMSS is the integration of StoRM, GPFS and TSM • GPFS parallel file-system by IBM, TSM archival system by IBM • GPFS deployed on the SAN implements a full HA system • StoRMis an srm 2.2 implementation developed by INFN • Already in use at INFN T1 since 2007 and at other centers for the disk-only storage • designedtoleverage the advantagesofparallel file systemsand common POSIX file systemsin a Gridenvironment • We combined the features of GPFS and TSM with StoRM, to provide a transparent grid-enabled HSM solution. • The GPFS Information Lifecycle Management (ILM) engine is used to identify candidates files for migration to tape and to trigger the data movement between the disk and tape pools • An interface between GPFS and TSM (named YAMSS) was also implemented to enable tape-ordered recalls
GEMSS resources layout TAPE (14PB avaliable 8.9PB used) GPFS NSD diskserver WAN or TIER1 LAN ~100 Diskservers with 2 FC connections STK SL8500 robot (10000 slots) 20 T10000B drives 10 T10000C drives GPFS client nodes Farm Worker Nodes (LSF Batch System) for 120 HS-06 i.e 9000 job slot DATA access from the FARM Worker Nodes use the TIER1 LAN. The NSD diskserversuse 1 or 10Gb network connections. Fibre Channel (4/8 gb/s) DISK ACCESS Fibre Channel (4/8 gb/s) TAPE ACCESS SAN/TAN Fibre Channel (4/8 gb/s) DISK&TAPE ACCESS • TIVOLI STORAGE MANAGER (TSM) HSM nodes Fibre Channel (4/8 gb/s) DISK ACCESS • 13 server with triple FC connections • 2 FC to the SAN (disk access) • 1 FC to the TAN (tape access) 13 GEMSS TSM HSM nodes provide all the DISK TAPE data migration thought the SAN/TAN fibre channel network. 1 TSM SERVER NODE RUNS THE TSM INSTANCE DISK ~8.4PB net space
CNAF in the grid • CNAF is part of the WLCG/EGI infrastructure, grantingaccesstodistributedcomputing and storageresources • Access tocomputing farm via the EMI CREAM ComputeElements • Access tostorageresources, on GEMSS, via the srmend-points • Also “legacy” access (i.e. local access allowed) • Cloud paradigm also supported ( see Davide’s slides on WnoDeS)
DATA FILE DATA FILE DATA FILE DATA FILE DATA FILE DATA FILE DATA FILE DATA FILE DATA FILE DATA FILE DATA FILE DATA FILE DATA FILE DATA FILE DATA FILE DATA FILE DATA FILE DATA FILE GEMSS DATA MIGRATIONPROCESS GEMSS DATA RECALLPROCESS ILM StoRM TSM TAN GridFTP GPFS SAN TAN SAN SAN WORKER NODE WAN LAN SAN LAN Building blocks of GEMSS system Disk-centric system with five building blocks • GPFS: disk-storage software infrastructure • TSM: tape management system • StoRM: SRM service • TSM-GPFS interface • GlobusGridFTP: WAN data transfers
GEMSS data flow (1/2) Recall TSM DB SRM request StoRM HSM-1 HSM-2 TSM-Server Migration WAN I/O GridFTP Storage Area Network Tape Area Netwok Tape Library Disk
GEMSS data flow (2/2) SortingFilesbyTape TSM DB SRM request StoRM HSM-2 HSM-1 TSM-Server WAN I/O GridFTP SAN TAN 1 2 3 Tape Library Disk GPFS Server LAN WorkerNode
GEMSS layout for a typical Experiment at INFN Tier-1 10x10 Gbps 2x2x2 Gbps 2 GridFTP servers (2 x 10 Gbps on WAN) 8 disk-servers for data (8 x 10 Gbps on LAN) 2 disk-servers for metadata (2 xxGbps) 2.2 PB GPFS file-system
GEMSS in production Running and pending jobs on the farm Running Pending • Gbit technology (2009) • Using the file protocol (i.e. direct access to the file) • Up to 1000 concurrent jobs recalling from tape ~ 2000 files • 100% job success rate • Up to 1.2 GB/s from the disk pools to the farm nodes • 10 Gbit technology (since 2010) • Using the file protocol • Up to 2500 concurrent jobs accessing fileson disk • ~98% job success rate • Up to ~ 6 GB/s from the disk pools to the farm nodes • WAN links towards saturation CMS queue (May 15) Farm- CMS storage traffic Aggregate traffic on eth0 network cards (x2)
DATA FILE DATA FILE DATA FILE DATA FILE DATA FILE DATA FILE DATA FILE DATA FILE DATA FILE DATA FILE DATA FILE DATA FILE DATA FILE DATA FILE DATA FILE DATA FILE DATA FILE DATA FILE GEMSS DATA MIGRATIONPROCESS GEMSS DATA RECALLPROCESS ILM StoRM TSM TAN GridFTP GPFS SAN TAN SAN SAN WORKER NODE WAN LAN SAN LAN Building blocks of GEMSS system Disk-centric system with five building blocks • GPFS: disk-storage software infrastructure • TSM: tape management system • StoRM: SRM service • TSM-GPFS interface • GlobusGridFTP: WAN data transfers
GEMSS data flow (1/2) Recall TSM DB SRM request StoRM HSM-1 HSM-2 TSM-Server Migration WAN I/O GridFTP Storage Area Network Tape Area Netwok Tape Library Disk
GEMSS data flow (2/2) SortingFilesbyTape TSM DB SRM request StoRM HSM-2 HSM-1 TSM-Server WAN I/O GridFTP SAN TAN 1 2 3 Tape Library Disk GPFS Server LAN WorkerNode
GEMSS layout for a typical Experiment at INFN Tier-1 10x10 Gbps 2x2x2 Gbps 2 GridFTP servers (2 x 10 Gbps on WAN) 8 disk-servers for data (8 x 10 Gbps on LAN) 2 disk-servers for metadata (2 xxGbps) 2.2 PB GPFS file-system
GEMSS in production Running and pending jobs on the farm Running Pending • Gbit technology (2009) • Using the file protocol (i.e. direct access to the file) • Up to 1000 concurrent jobs recalling from tape ~ 2000 files • 100% job success rate • Up to 1.2 GB/s from the disk pools to the farm nodes • 10 Gbit technology (since 2010) • Using the file protocol • Up to 2500 concurrent jobs accessing fileson disk • ~98% job success rate • Up to ~ 6 GB/s from the disk pools to the farm nodes • WAN links towards saturation CMS queue (May 15) Farm- CMS storage traffic Aggregate traffic on eth0 network cards (x2)
Yearly statistics Aggregate GPFS traffic (file protocol) Aggregate WAN traffic (gridftp) Tape-disk data movement (over the SAN) Mounts/hour
Why GPFS • Original idea since the very beginning: we did not like to rely on a tape centric system • First think to the disk infrastructure, the tape part will come later if still needed • We wanted to follow a model based on well established industry standard as far as the fabric infrastructure was concerned • Storage Area Network via FC for disk-server to disk-controller interconnections • This lead quite naturally to the adoption of a clustered file-system able to exploit the full SAN connectivity to implement flexible and highly available services • There was a major problem at that time: a specific SRM implementation was missing • OK, we decided to afford this limited piece of work StoRM
Basicsofhow GPFS works • The idea behind a parallel file-system is in general to stripe files amongst several servers and several disks • This means that, e.g., replication of the same (hot) file in more instances is useless you get it “for free” • Any “disk-server” can access every single device with direct access • Storage Area Network via FC for disk-server to disk-controller interconnection (usually a device/LUN is some kindof RAID array) • In a few words, all the servers share the same disks, but a server is primarily responsible to serve via Ethernet just some disks to the computing clients • If a server fails, any other server in the SAN can take over the duties of the failed server, since it has direct access to its disks • All filesystem metadata are saved on disk along with the data • Data and metadata are treated simmetrically, striping blocks of metadata on several disks and servers as if they were data blocks • No need of external catalogues/DBs: it is a true filesystem
Some GPFS key features • Very powerful (only command line, no other way to do it) interface for configuring, administering and monitoring the system • In our experience this is the key feature which allowed to keep minimal manpower to administer the system • 1 FTE to control every operation (and scaling with increasing volumes is quite flat) • Needs however some training to startup, it is not plug and pray… but documentation is huge and covers (almost) every relevant detail • 100% POSIX compliant by design • Limited amount of HW resources needed (see later for an example) • Support for cNFSfilesystem export to clients (parallel NFS server solution with full HA capabilities developed by IBM) • Stateful connections between “clients” and “servers” are kept alive behind the data access (file) protocol • No need of things like “reconnect” at the application level • Native HSM capabilities (not only for tapes, but also for multi-tiered disk storage)
GEMSS in production for CMS • GEMSS went in production for CMS in October 2009 • w/o major changes to the layout • only StoRM upgrade, with checksum and authzsupportbeing deployed soon also Good-performance achieved in transfer throughput • High use of the available bandwidth • (up to 8 Gbps) Verification with Job Robot jobs in different periods shows that CMS workflows efficiency was not impacted by the change of storage system • “Castor + SL4” vs “TSM + SL4” vs “TSM + SL5” As from the current experience, CMS gives a very positive feedback on the new system • Very good stability observed so far CNAF ➝ T1-US_FNAL CNAF ➝ T2_CH_CAF