Utilising Ceph for large scale, high throughput storage to support the LHC experiments

Utilising Ceph for large scale, high throughput storage to support the LHC experiments • Tom Byrne

What is Echo? • Echo is a large Ceph cluster based at the Rutherford Appleton laboratory, an STFC site in Oxfordshire, UK • It is used for storage of LHC experiment scientific data • 180 nodes, 5000 OSDs, 40PB raw • An extra 1000 OSDs going in this year • Data pools are erasure coded (8+3), • 64MB objects at the rados level • ~20PB stored data • 20GB/s sustained transfer rates • Currently running Luminous

The worldwide LHC computing grid • RAL is a Tier-1 site for the worldwide LHC computing grid • The Tier-1 sites combined store a complete copy of the experimental data on tape • And have the compute capacity to run large scale data processing jobs close to the data • RAL has ~800 worker nodes, 32,000 cores • Large amounts of disk needed as working space for active data • Echo provides all of RAL’s disk storage • 200 PB disk and 500 PB tape capacity across the all Tier-1s

A typical LHC experiments computing workflow(from the point of view of a tier-1 storage sysadmin) Tier-1 storage (Echo) Raw • Analysis • Object • Data User jobs (analysis) Reconstruction (production) Physics Model Monte Carlo Simulation (production) Tier-1 compute

What does Echo do for ATLAS? Deletes Writes Reads

What does Echo do for CMS?

Grid protocols and Ceph • Grid protocols are used for WLCG transfers • For WAN transfers between storage endpoints a protocol called GridFTP is used • For jobs analysing data a protocol called XRootD is used • The HEP community (CERN, RAL, others) developed plugins for the grid protocol servers on top of libradosstriper • Providing object store functionality only

Data Access Worker Node Worker Node Worker Node Other sites Jobcontainer Jobcontainer Jobcontainer Jobcontainer Jobcontainer Jobcontainer User job User job User job User job User job User job XRootD client XRootD client XRootD client XRootD client XRootD client XRootD client Jobcontainer Jobcontainer Jobcontainer XRootD server XRootD server User job User job User job XRootD server rados plugin rados plugin rados plugin XRootD client XRootD client XRootD client libradosstriper libradosstriper libradosstriper GWcontainer GWcontainer GWcontainer Gateway Gateway Gateway XRootD server XRootD server XRootD server GridFTP server GridFTP server GridFTP server x7 rados plugin rados plugin rados plugin rados plugin rados plugin rados plugin XRootD Cache Ceph Cluster libradosstriper libradosstriper libradosstriper libradosstriper libradosstriper libradosstriper Experiment data pool x800

Gateway setup • At the scale we need to run, it is not sustainable to transfer data to worker nodes via the gateway cluster. • We created gateways running in containers on every WN. • Entry in /etc/host directs transfer request to local gateway. • A small cache on each WN allows pre-fetching of data. • This is aligned with the rados object size. • Reduces latency and improves throughput from Echo. • Cache also offers some protection against pathological jobs.

In general, the worker node gateways are working very well More stable external transfers with the same gateway hardware Much better data rates to batch farm But they have added a few challenges Having two infrastructures acting as the same endpoint can lead to interesting issues when they differ An external user trying to understand why a job transfer failed will not be able to test the same gateway We now have 800+ gateways! Suitable monitoring is now a necessity Gateways on compute nodes

Monitoring OSD ops [root@ceph-adm1 ~]# ceph status cluster: id: 2336dcf1-b1b1-4c6a-b793-61cfb29f014c health: HEALTH_ERR 76 scrub errors Possible data damage: 2 pgsinconsistent 10 slow requests are blocked > 32 sec. Implicated osds 3107 services: mon: 5 daemons, quorum ceph-mon1,ceph-mon2,ceph-mon3,ceph-mon4,ceph-mon5 mgr: ceph-mon3(active), standbys: ceph-mon2, ceph-mon4, ceph-mon5, ceph-mon1 osd: 5072 osds: 4946 up, 4946 in rgw: 8 daemons active data: pools: 22 pools, 25088 pgs objects: 360.31M objects, 18.5PiB usage: 25.5PiB used, 12.1PiB / 37.7PiB avail pgs: 24964 active+clean 70 active+clean+scrubbing+deep 52 active+clean+scrubbing 2 active+clean+inconsistent io: client: 21.2GiB/s rd, 2.42GiB/s wr, 22.60kop/s rd, 7.57kop/s wr • I felt Echo still lacked monitoring for performance issues at the OSD level • And in particular, what affect these had on the transfers to the worker nodes • Some of that is due to Echo’s architecture and use case, rather than a shortcomings in Ceph • As Echo grew in size and usage we began to see transient slow OSD ops • With no concrete understanding of what effect this had on transfers • dump_ops_in_flight and friends give lots of info, but not long term trends • I put together together a few scripts to collect and dump slow requests into Elasticsearch ???

Hot files • Rare occurrences of popular files having issues • Contention for locks from hundreds of different worker node gateways • Probably caused by a misbehaving gateway not releasing locks, rather than a Ceph limitation • Only affects a tiny fraction of popular files • The blocked requests are short lived and quickly clear • affected XRootD transfers eventually succeed

Slow OSDmap creation • Slow OSDmap creation blocking ops is a consistent factor in slow requests on Echo • Cause of slow OSDmap creation fixed (thanks Xie!) • Affects operations in flight on peering PGs • Peering on Echo is mainly due to unhealthy disks crashing OSDs, and can be mitigated with more zealous disk management

OSD op monitoring findings • Similar data collection with random sampling of ‘normal’ operations helped us spot other issues • Gateway updates changing default block sizes and degrading single file transfer speeds • The number of transfers affected by slow requests in Ceph is tiny fraction of the total throughput of Echo • Less than 0.1% of transfers affected • In general, Ceph is working very well at this scale

Conclusions • If you a suitable use case, using libradosstriper directly on top of core Ceph can provide fast, simple and scalable object storage • Having monitoring appropriate to your cluster is important! • Sampled OSD operation collection and analysis has been great for providing reassurance that Ceph is working well at this scale • New tooling in Nautilus to improve monitoring capabilities out of the box • Ceph is working very well for Echo’s use case of supporting the LHC experiments, providing the scaling and performance we need

Thank you • Questions?

Utilising Ceph for large scale, high throughput storage to support the LHC experiments

Utilising Ceph for large scale, high throughput storage to support the LHC experiments

Presentation Transcript

High Throughput Scientific Computing with Condor: Computer Science Challenges in Large Scale Parallelism

Cyberinfrastructure to Support Large Scale Collaborative Water Research

Preview of a Novel Architecture for Large Scale Storage

Ceph Storage in OpenStack

Technology and Infrastructure Support for Large Scale Information

Advantages: Easy to set up and scale up High throughput

Safety Systems for LHC experiments

Designing Large-Scale Assessment to Support Learning

Sensitivity of the LHC Experiments to

Comparing Large Scale Storage Technologies

The High Performance Archiver for the LHC Experiments

Validation for LHC experiments

High Throughput and Large Scale Proteomics Analysis

Japanese Commitment to the LHC experiments

Analysis Tools for the LHC Experiments

Ceph: de factor storage backend for OpenStack

Support for Large scale Investment deals

Large-scale Experiments in Underground Research Laboratories

Experience of Development and Deployment of a Large-Scale Ceph-Based Data Storage System at RAL

Introduction to LHC Experiments

New Challenges for Large-scale Data Storage

Designing Large-Scale Assessment to Support Learning