Storage resources management and access at TIER1 CNAF

Storage resources management and access at TIER1 CNAF Ricci Pier Paolo, Lore Giuseppe, Vagnoni Vincenzo on behalf of INFN TIER1 Staff pierpaolo.ricci@cnaf.infn.it ACAT 2005 May 22-27 2005 DESY Zeuthen, Germany

TIER1 INFN CNAF Storage HSM (400 TB) NAS (20TB) STK180 with 100 LTO-1 (10Tbyte Native) NAS1,NAS4 3ware IDE SAS 1800+3200 Gbyte W2003 Server with LEGATO Networker (Backup) Linux SL 3.0 clients (100-1000 nodes) RFIO NFS PROCOM 3600 FC NAS34700 Gbyte WAN or TIER1 LAN CASTOR HSM servers H.A. PROCOM 3600 FC NAS29000 Gbyte STK L5500 robot (5500 slots) 6 IBM LTO-2, 2 (4) STK 9940B drives NFS-RFIO-GridFTP oth... SAN 1 (200TB) SAN 2 (40TB) Diskservers with Qlogic FC HBA 2340 Infortrend 4 x 3200 GByte SATA A16F-R1A2-M1 IBM FastT900 (DS 4500) 3/4 x 50000 GByte 4 FC interfaces 2 Brocade Silkworm 3900 32 port FC Switch 2 Gadzoox Slingshot 4218 18 port FC Switch STK BladeStore About 25000 GByte 4 FC interfaces AXUS BROWIE About 2200 GByte 2 FC interface Infortrend 5 x 6400 GByte SATA A16F-R1211-M2 + JBOD ACAT 2005 DESY pierpaolo.ricci@cnaf.infn.it

CASTOR HSM Point to Point FC 2Gb/s connections STK L5500 2000+3500 mixed slots 6 drives LTO2 (20-30 MB/s) 2 drives 9940B (25-30 MB/s) 1300 LTO2 (200 GB native) 650 9940B (200 GB native) 8 tapeserver Linux RH AS3.0 HBA Qlogic 2300 Sun Blade v100 with 2 internal ide disks with software raid-0 running ACSLS 7.0 OS Solaris 9.0 1 CASTOR (CERN)Central Services server RH AS3.0 1 ORACLE 9i rel 2 DB server RH AS 3.0 WAN or TIER1 LAN 6 stager with diskserver RH AS3.0 15 TB Local staging area 8 or more rfio diskservers RH AS 3.0 min 20TB staging area Indicates Full rendundancy FC 2Gb/s connections (dual controller HW and Qlogic SANsurfer Path Failover SW) SAN 2 SAN 1 ACAT 2005 DESY pierpaolo.ricci@cnaf.infn.it

CASTOR HSM (2) • In general we obtained: • Good performances when writing into the staging area (disk buffer) and from staging area to tapes (2 parallel streams on tape give about 40MB/s) • General good reliability on the stager service (Every LHC experiment has its own dedicated stager and policies) and high reliability on the central castor services • Bad realiability on LTO-2 drives when writing and reading. This results in tapes marked readonly or disabled when writing and in locking or failure when trying to stage-in files in random order. • We could trigger with the experiment coordination a temporary increase of the staging area (disk buffe)r and an optimized sequencial stage-in of data just before analysis phase. Then the analysis job could run directly over rfio or grid tool on castor with an high probability to find the file directly on disk (LHCB). After the end of the analysis phase the disk buffer could be re-assigned to another exp. • We decide to acquire and use more STK 9940B drives for random access to the data • The access to the CASTOR HSM system is • Direct using rf<command> direcly on the user interfaces or on the WN (rfcp,rfrm or API...) • Throught front-end with gridftp interface to castor and srm • 1 ACAT 2005 DESY pierpaolo.ricci@cnaf.infn.it

DISK access WAN or TIER1 LAN GB Eth. connections: nfs,rfio,xrootd,GPFS, GRID ftp Generic Diskserver Supermicro 1U 2 Xeon 3.2 Ghz 4GB Ram,GB eth. 1 or 2 Qlogic 2300 HBA Linux AS or CERN SL 3.0 OS 1 2 3 4 F1 F2 LUN0 => /dev/sda LUN1 => /dev/sdb ... 1 or 2 2Gb FC connections every Diskserver 2 Brocade Silkworm 3900 32 port FC Switch ZONED (50TB Unit with 4 Diskservers) 2 x 2GB Interlink connections FARMS of rack mountable 1U biprocessors nodes (actually about 1000 nodes for 1300 KspecInt2000) 2Gb FC connections • FC Path Failover HA: • Qlogic SANsurfer • IBM or STK Rdac for Linux 2TB Logical Disk LUN0 LUN1 ... 50 TB IBM FastT 900 (DS 4500) Dual redundant Controllers (A,B) Internal MiniHub (1,2) A1 A2 B1 B2 RAID5 • Application HA: • NFS server, rfio server with Red Hat Cluster AS 3.0(*) • GPFS with configuration NSD Primary Secondary • /dev/sda Primary Diskserver 1; Secondary Diskserver2 • /dev/sdb Primary Diskserver 2; Secondary Diskserver3 • ..... (*) tested but not actually used in production 4 Diskservers every 50TB Unit => every controller can perform a maximum of 120MByte/s R-W ACAT 2005 DESY pierpaolo.ricci@cnaf.infn.it

DISK access (2) • We have different protocols in production for accessing the disk storage. In our diskservers and Grid SE front-ends we corrently have: • NFS on local filesystem:ADV. Easy client implementation and compatibility and possibility of failover (RH 3.0). DIS. Bad perfomance scalability for an high number of access (1 client 30MB/s 100 client 15MB/s throughtput) • RFIO on local filesystem: ADV. Good performance and compatibility with Grid Tools and possibility of failover. DIS. No scalability of front-ends for the single filesystem, no possibility of load-balancing • Grid SE Gridftp/rfio over GPFS (CMS,CDF): ADV: Separation from GPFS servers (accessing the disks) and SE GPFS clients. Load balancing and HA on the GPFS servers and possibility to implement the same on the Grid SE services (see next slide). DIS. GPFS layer requirements on OS and Certified Hardware for support. • Xrootd (BABAR):ADV: Good performance DIS: No possibility of load-balancing for the single filesystem backends, not grid compliant (at present...) • NOTE The IBM GPFS 2.2 is a CLUSTERED FILESYSTEM so is possible from many front-ends (i.e. gridftp or rfio server) to access simultaneously the SAME filesystem. Also can use bigger filesystem size (we use 8-12TB). • 1 ACAT 2005 DESY pierpaolo.ricci@cnaf.infn.it

CASTOR Grid Storage Element • GridFTP access through the castorgrid SE, a dns cname pointing to 3 server. • Dns round-robin for load balancing • During LCG Service Challenge2 introduced also a load average selection: every M minutes the ip of the most loaded server is replaced in the cname (see graph) ACAT 2005 DESY pierpaolo.ricci@cnaf.infn.it

Monitoring/notifications (Nagios) ACAT 2005 DESY pierpaolo.ricci@cnaf.infn.it

LHCb CASTOR tape pool # processes on a CMS disk SE eth0 traffic through a CASTOR LCG SE ACAT 2005 DESY pierpaolo.ricci@cnaf.infn.it

Disk accounting Pure disk space (TB) CASTOR disk space (TB) ACAT 2005 DESY pierpaolo.ricci@cnaf.infn.it

Parallel Filesystem Test Test Goal: Evaluation and Comparison of Parallel Filesystems (GPFS, Lustre) for the implementation of a powerful disk I/O infrastructure for the TIER1 INFN CNAF. • A moderately high-end testbed has been used: • 6 IBM xseries 346 file servers connected via FC SAN to 3 IBM FAStT 900 (DS4500) providing a total of 24TB • Maximum available throughput to client nodes (30) using Gb Ethernet: 6 Gbps • PHASE 1: Generic test and tuning • PHASE 2: Realistic physics analysis jobs reading data from a Parallel Filesystem • Dedicated tools for test (PHASE 1) and monitoring have been written: • The Benchmarking tool allows the user to start, stop and monitor the test on all the clients from a single point • Completely automatized ACAT 2005 DESY pierpaolo.ricci@cnaf.infn.it

PHASE 1 Generic Benchmark • GPFS: Very stable, reliable, fault tolerant, indicated for storage of critical data and no charge for educational or research use • Lustre: Commercial product, easy to install, but fairly invasive (need patched kernel) and has a node license cost PHASE1 Generic Benchmark • Sequential write/read from a variable number of clients simultaneously performing I/O with 3 different protocols (native GPFS, rfio over GPFS, nfs over GPFS). • 1 to 30 Gb clients, 1 to 4 processes per client • Sequential write/read of zeroed files by means of dd • File sizes ranging from 1 MB to 1 GB ACAT 2005 DESY pierpaolo.ricci@cnaf.infn.it

Effective average throughput (Gb/s) # of simultaneous read/writes Generic Benchmark Raw ethernet throughput vs time(20 x 1GB file simultaneous reads with Lustre) Results of read/write(1GB different files) ACAT 2005 DESY pierpaolo.ricci@cnaf.infn.it

Generic Benchmark(here shown for 1 GB files) • Numbers are reproducible with small fluctuations • Lustre tests with NFS export not yet performed ACAT 2005 DESY pierpaolo.ricci@cnaf.infn.it

PHASE 2 Realistic analysis • We focus on the Analysis Jobs since they are generally the most I/O bound processes of the experiment activity. • Realistic LHCb analysis algorithm runs on 8 TB of data served by RFIO daemons running on GPFS parallel filesystem servers • The analysis algorithm performs a selection of an LHCb physics channel by reading sequentially input DST (Data Summary Tape) files and producing ntuple files in output • Analysis jobs submitted to the production LSF batch system of TIER1 INFN (RFIO was the simplest and most effective choice) • 14000 jobs submitted, 500 jobs in simultaneous RUN state Steps of the jobs • RFIO-copy to the local WN disk the file to be processed • Analize the data • RFIO-copy back the output of the algorithm • Cleanup files from the local disk ACAT 2005 DESY pierpaolo.ricci@cnaf.infn.it

Realistic analysis results • 8.1 TB of data processed in 7 hours, all 14000 jobs completed succesfully • > 3 Gb/s raw sustained read throughput from the file servers with GPFS (about effective 320MByte/s) • Write throughput of output data negligible • Just 1 MB per job The results are very satisfactory and give us a good impression of the whole infrastructure layout. Test for the Lustre configuration are in progress. (we don’t expect big difference using rfio protocol over parallel filesystem) ACAT 2005 DESY pierpaolo.ricci@cnaf.infn.it

Conclusions In these slides we presented: • A general overview of the Italian TIER1 INFN CNAF storage resources hardware and access methods: • HSM Software (CERN CASTOR) for Tape Library Mass Storage • Disk over SAN with different software protocols • Some simple management implementations for monitoring and optimizing our storage resources access • Results from Clustered Parallel Filesystem (Lustre/GPFS) performance measurements: • Step 1: Generic Filesystem Benchmark • Step 2: Realistic LHC analysis jobs results Thank you to everybody for your attention ACAT 2005 DESY pierpaolo.ricci@cnaf.infn.it

Benchmarking tools • Dedicated tools for benchmarking and monitoring have been written • The benchmarking tools allow the user to start, stop and monitor the evolution of simultaneous read/write operationsfrom an arbitrary number of clients, reporting at the end of the test the aggregated throughput • Realized as a set of bash scripts and C programs • The tool implements network bandwith measurements by means of the netperf suite and sequential read/write with dd • Thought to be of general use, can be reused with minimal effort for any kind of storage benchmark • Completely automatized • The user does not need to install anything on the target nodes as all the software is copied by the tool via ssh (and also removed in the end) • The user has only to issue a few commands from the shell prompt to control everything • Can perform complex unattended and personalized tests by means of very simple scripts, collect and save all the results and produce plots ACAT 2005 DESY pierpaolo.ricci@cnaf.infn.it

Monitoring tools • The monitoring tools allow to measure the time dependence of the raw network traffic of each server with a granularity of one second • Following the time dependence of the I/O gives important insights and can be very important for a detailed understanding and tuning of the network and parallel filesystem operational parameters • The existing tools didn’t provide such a low granularity, so we have written our own, reusing a work made for the LHCb online farm monitoring (consider that writing/reading one file of 1 GB from a single client requires just a few seconds) • The tool automatically produces a plot of the aggregated network traffic of the file servers for each test in pdf format • The network traffic data files corresponding to each file server are saved to ascii files in case one wants to make a detailed per-server analysis ACAT 2005 DESY pierpaolo.ricci@cnaf.infn.it

GPFS features • Very stable, reliable, fault tolerant, indicated for storage of critical data and no charge for educational or research use. • Commercial product, initially developed by IBM for the SP series and then ported to Linux • Advanced command line interface for configuration and management • Easy to install, not invasive • Distributed as binaries in RPM packages • No patches to standard kernels are required, just a few kernel modules for POSIX I/O to be compiled for the running kernel • Data and metadata striping • Possibility to have data and metadata redundancy • Expensive solution, as it requires the replication of the whole files, indicated for storage of critical data • Data recovery for filesystem corruption available • Fault tolerant features oriented to SAN and internal health monitoring through network heartbeat ACAT 2005 DESY pierpaolo.ricci@cnaf.infn.it

Lustre features • Commercial product, easy to install, but fairly invasive • Distributed as binaries and sources in RPM packages • Requires own Lustre patches to standard kernels, but binary distribution of patched kernels are made available • Aggressive commercial effort, the developers sell it as an “Intergalactic Filesystem” scalable to 10000+ nodes • Advanced interface for configuration and management • Possibility to have Metadata redundancy and Metadata Server fault tolerance • Data recovery for filesystem corruption available • POSIX I/O ACAT 2005 DESY pierpaolo.ricci@cnaf.infn.it

Sequential Write/Read benchmarks • Sequential write/read from a variable number of clients simultaneously performing I/O • 1 to 30 Gb clients, 1 to 4 processes per client • Sequential write/read of zeroed files by means of dd • File sizes ranging from 1 MB to 1 GB • After having been written, files are read back • Particular attention to read the whole files from disk (i.e. no caching at all on the client side nor on the server side) • Before starting tests, appropriate sync’s are issued to unload the operating system buffers in order not to have interference between consecutive tests ACAT 2005 DESY pierpaolo.ricci@cnaf.infn.it

Hardware testbed • Disk storage • 3 IBM FAStT 900 (DS4500) • Each FAStT 900 serves 2 RAID5 arrays, 4 TB each (17 x 250 GB disks + 1 hot spare) • Each RAID5 is further subdivided in two LUNs of 2 TB each • In total 12 LUNs and 24 TB of disk space (102 x 250GB disks + 8 hot spares) • File System Servers • 6 IBM xseries 346, dual Xeon, 2 GB RAM, Gigabit NIC • QLogic fiber channel PCI card on each server connected to the DS4500 via a Brocade switch • 6 Gb/s available bandwidth to/from the clients • Clients • 30 SuperMicro nodes, dual Xeon, 2 GB RAM, Gigabit NIC ACAT 2005 DESY pierpaolo.ricci@cnaf.infn.it

Realistic analysis results (with graphs) • 8 TB of data processed in about 7 hours, about 14000 jobs submitted, all completed succesfully • 500 analysis jobs in simultaneous RUN state, the rest in PENDING • 3 Gb/s sustained read throughput from the file servers (with RFIO on top of GPFS) • Write throughput of output data negligible • Just about 1 MB per job LHCb LSF batch queue occupancy during tests ACAT 2005 DESY pierpaolo.ricci@cnaf.infn.it

Abstract Title: Storage resources management and access at TIER1 CNAF Abstract: At presents at LCG TIER1 CNAF we have 2 main different mass storage systems for archiving the HEP experiment data: a HSM software system (CASTOR) and about 200TB of different storage devices over SAN. This paper briefly describe our hardware and software environtment and summarize the simple technical improvements we have implemented in order to obtain a better avaliability and the best data access throughtput from the front-end machines. Also some test results for different file systems over SAN are reported. ACAT 2005 DESY pierpaolo.ricci@cnaf.infn.it

Storage resources management and access at TIER1 CNAF