Gridka: Xrootd SE with tape backend

Gridka: Xrootd SE with tape backend Artem Trunov Karlsruhe Institute of Technology

LHC Data Flow Illustrated photo courtesy Daniel Wang

WLCG Data Flow Illustrated

ALICE Xrootd SE at GridKa • Xrootd has been requested by ALICE, proposal approved by GridKa Technical Advisory Board. • The solution aims at implementing ALICE use cases for archiving custodial data at GridKa T1 center using Xrootd SE: • Transfer of RAW, ESD data from CERN using FTS and SRM protocol • Serving RAW data to reprocessing jobs via ROOT protocol • Receiving custodial reprocessed ESD, AOD data from WNs using ROOT protocol • Archiving custodial RAW, ESD data on tapes • Recalling RAW data from tape for reprocessing. • GridKa special focus: • Low maintenance solution • SRM contingency (in case ALICE will need it) • Deployment timeline • (xrootd is ideployed at GridKa since 2002 for BaBar) • Nov 2008: Xrootd disk-only SE, 320 TB • Sep 2009: Xrootd SE with tape backend + srm, 480 TB • Nov 2009 – ALICE uses xrootd exclusively • From July 2010: Expansion up to 2500TB of total space

Storage setup at GridKa • Clusters of 2 or more servers with direct- or SAN-attached storage • GPFS local file systems • Not global, not across all gridka servers and WNs • All servers in cluster see all the storage pools • Redundancy • Data is accessible if one server fails • Most of data is behind servers with 10G NICs • Plus some older 2x1G File server File server FC Storage File server File server File server Disk SAN FC Storage …

Storage setup at GridKa + xrootd • Xrootd maps well onto this setup • Client can access a file via all servers in a cluster • Redundant data path Xrootd server Xrootd server

Storage setup at GridKa + xrootd • Xrootd maps well onto this setup • Client can access a file via all servers in a cluster • Redundant data path • On a server failure a client will be redirected to the other one Xrootd server Xrootd server Xrootd server Xrootd server

Storage setup at GridKa + xrootd • Xrootd maps well onto this setup • Client can access a file via all servers in a cluster • Redundant data path • On a server failure a client will be redirected to the other one • Scalability • Automatic load balancing • All xrootds can serve the same (“hot”) file to different clients Xrootd server Xrootd server Xrootd server Xrootd server Xrootd server Xrootd server

Admin nodes redundancy • Xrootd allows to have two admin nodes (“redirectors”, “managers”) in a redundant configuration • Support is build into ROOT client • User has one access address, which has two IP addresses (DNS type-a records) • Clients choose one of managers randomly • Load balancing • Failure of one redirector is recognized by a ROOT client and second address is tried. • Two admin nodes – twice the transactional throughput Dns alias (a-record)for two redirectors Xrootd server Xrootd server Xrootd manager Xrootd manager Xrootd server Xrootd server

High-Availability • Any component can fail at anytime and the failure doesn’t result in reduced uptime or inaccessible data • Maintenance can be done without taking the whole SE offline, without announcing downtime • Rolling upgrades, one server at a time • Real case • A server failed on Thursday evening. • A failed component was replaced on Monday • VO didn’t notice anything. • Site admins and engineers appreciated that such cases could be handled without emergency

MSS backend is a part of vanilla Xrootd since Day 1 Policy-based migration daemon Policy-based purging daemon (aka garbage-collector) Prestaging daemon Migration and purging queues Also with 2-level priority On-demand stage-in While user’s job wait on file open Bulk “bring-online” requests Async. notification of completed requests via UDP (Hi Tony!) Current “mps” scripts are being rewritten by Andy File Residency Manager, “frm” Adapting to site’s MSS: Need to write own glue scripts Stat command Transfer command Thus completely generic GridKa uses TSM and in-house middelware to control migration and recall queues, written by Jos van Wezel. The same mechanism used by dCache Worth to mention that ALICE uses this MSS backend to fetch missing files from any other ALICE site. Called vMSS – “virtual” Files are located via a global redirector Adding tape backend

Adding tape backend - details • One of the nodes in a GPFS cluster is connected to tape drives via SAN • This node migrates all files from the GPFS cluster • Reduces the number of tape mounts • Recalls all files and evenly distributes across all GPFS file systems in the cluster. Xrootd server Disk SAN FC Storage Xrootd server Xrootd server Tape SAN +Migration and staging daemons;TSS FC Tape drives

Adding tape backend – relevant to Strawman • One of the nodes in a GPFS cluster is connected to tape drives via SAN • This node migrates all files from the GPFS cluster • Reduces the number of tape mounts • Recalls all files and evenly distributes across all GPFS file systems in the cluster. • When a file is not on disk, it will be looked in the ALICE cloud AND the in local tape archive • Thus can avoid recall from tape and fetch over the network. • Still need to test this • Subject of VO policy as well. Xrootd server Disk SAN FC Storage Xrootd server Xrootd server Tape SAN +Migration and staging daemons;TSS FC Tape drives Xrootd server Xrootd server ALICE global redirector Xrootd server +Migration and staging daemons;TSS FC Tape drives

Adding tape backend – another possibility • One of the nodes in GPFS cluster is connected to tape drives via SAN • This node migrates all files from the gpfs cluster • Reduces the number of tape mounts • Recalls all files and evenly distributes across gpfs file systems. • Can use vMSS to migrate/recall files to/from another xrootd clusters via LAN. Second cluster acts as a disk cache in front of a tape system. • Straight use of ALICE’s vMSS mechanism, no customization • Separation of xrootd cluster from tape system • Easy upgrades, no regressions • That “stager” cluster interacts with site’s Mass Storage System • Lots of optimization is possible with disk cache in front of tape • Stable custom setup, not dependent on ALICE development cycle. • Can be used by another storage system Xrootd server Disk SAN FC Storage Xrootd server Xrootd server Tape SAN +Migration and staging daemons;TSS FC Tape drives Xrootd server +vMSS Xrootd server Xrootd server +vMSS Root://via LAN Xrootd server Xrootd server +Migration and staging daemons;TSS +vMSS Xrootd Cluster Xrootd “stager” cluster

Adding grid functionality srm:// BeStMan SRM xrootdfs • SRM interface is needed to ensure WLCG requirement. • Adapted OSG solution • Could not use OSG releases out-of-the box. • But the components are all the same • Xrootd • Globus gridftp + posix dsi backend + xrootd posix preload library • xrootdfs • BeStMan SRM in a gateway mode root:// XrootdCluster Clients root:// gridftp root:// gsiftp:// Posix preloadlibrary

Xrootd SE at Gridka – details of own development • Xrootd distro • From CERN (Fabrizio) – the official ALICE distro • Installed and configured according to ALICE wiki page – little or no deviations. • Gridftp • Used VDT rpms • Got gridftp posix lib out of OSG distro • made rpm • Installed host certificates, CAs, gridmap-file from gLite • Made sure that the DN used for transfer is mapped to a static account (not a group .alice) • Wrote own startup script • Run as root • SRM • Got xrootdfs source from SLAC • Made rpm, own startup script • Installed fuse libs with yum • Got BeStMan tar file from their web site • Run both as root (you don’t have to) • Published in BDII manually (static files)

High-Availability and Performance for SRM service XrootdCluster BeStMan SRM • Gridftp is run in split-process mode • Front-end node is co-located with BeStMan • Can run in user space • Datanode is co-located with xrootd server • Doesn’t need a host certificate • Allows to optimize host’s network settings for low latency vs. high throuput. xrootdfs root:// GridftpFront-end Gridftpbackend Xrootd server Posix preloadlibrary Gridftp control “SRM admin node”

High-Availability and Performance for SRM service XrootdCluster BeStMan SRM • Gridftp is run in split-process mode • Front-end node is co-located with BeStMan • Can run in user space • Datanode is co-located with xrootd server • Doesn’t need a host certificate • Allows to optimize host’s network settings for low latency vs. high throughput. • Bestman and gridftp instances can run under a DNS alias • Scalable performance xrootdfs root:// GridftpFront-end Gridftpbackend Xrootd server Posix preloadlibrary Gridftp control … … root:// BeStMan SRM xrootdfs Gridftpbackend Xrootd server Posix preloadlibrary Gridftp control GridftpFront-end “SRM admin nodes” Under DNS alias

More High-Availability • All admin nodes are Virtual Machines • xrootd managers • SRM admin node • BeStMan + gridftp control • 1 in production at GridKa, but nothing prevents from having 2 • KVM on SL5.5 • Shared GPFS for VM images • VMS can be restarted on the other node in case of failure Xrootd server Xrootd manager Xrootd servergridftp-data SRM, gridftp-co Xrootd server Xrootd manager Xrootd servergridftp-data SRM, gridftp-co VM hypervisors

Performance • No systematic measurements on deployed infrastructures • Previous tests on similar setups • 10G NIC testing for HEPIX • 650MB/s FTS disk-to-disk transfer between Castor and single test gridftp server with gpfs at GridKa. • Gridftp transfer on LAN to the server – 900MB/s • See more on Hepix@Umea • ALICE production transfers from CERN using xrd3cp • ~450MB/s into three servers, ~70TB in two days • ALICE analysis, root:// on LAN • Up to 600MB/s from two older servers. • See also Hepix storage group report (Andrei Maslennikov)

Import of RAW data from CERN with xrd3cp

Problems • Grid authorization • BeStMan works with GUMS or plain gridmapfile. • Only static account mapping, no group pool accounts • Looking forward for future interoperability between auth tools • ARGUS, SCAS, GUMS

Future outlook • Impossible to dedicate my time without serious commitments of real users. • Still need some more Grid integration, like GIP • Would it be a good idea to support BeStMan in LCG? • Any takers with secure national funding for such SE? • GridKa Computing School, Sep 6-10 2010 in Karlsruhe, Germany • http://www.kit.edu/gridka-school • Xrootd tutorial • More, better packaging and documentation

Summary • ALICE is happy • Second largest ALICE SE after CERN • In both allocated and used space • 25% of GridKa storage (~2.5PB) • Stateless, scalable • Low maintenance • But good deal of integration efforts • SRM frontend and tape backend • No single point of failure • Propose to Include Xrootd/BeStMan support in WLCG

Gridka: Xrootd SE with tape backend

Gridka: Xrootd SE with tape backend

Presentation Transcript

XROOTD Tutorial

Thumb Tape

CERN Tape Status Tape Operations Team IT/FIO CERN

Update on federated XrootD monitoring

XRootD Roadmap To Start The Second Decade

Tape Report

xrootd Update

Xrootd and WebDAV deployment

Who invented duct tape

TAPE DRIVE

A tier2's perspective

Wet Under floor heating

Experiences Deploying Xrootd at RAL

Hyper-Scaling Xrootd Clustering

First experiences with large SAN storage in a Linux cluster

Scalla/xrootd Introduction

XROOTD tests

Xrootd setup

Changes to osCommerce backend

Forschungszentrum Karlsruhe GmbH Institut für Wissenschaftliches Rechnen, IWR

Plans and outlook at GridKa

Developments for tape