180 likes | 198 Views
Discover how Echo, a vast Ceph cluster at RAL, supports storage for LHC experiments. Explore its architecture, capabilities, and challenges, including monitoring and performance issues.
E N D
Utilising Ceph for large scale, high throughput storage to support the LHC experiments • Tom Byrne
What is Echo? • Echo is a large Ceph cluster based at the Rutherford Appleton laboratory, an STFC site in Oxfordshire, UK • It is used for storage of LHC experiment scientific data • 180 nodes, 5000 OSDs, 40PB raw • An extra 1000 OSDs going in this year • Data pools are erasure coded (8+3), • 64MB objects at the rados level • ~20PB stored data • 20GB/s sustained transfer rates • Currently running Luminous
The worldwide LHC computing grid • RAL is a Tier-1 site for the worldwide LHC computing grid • The Tier-1 sites combined store a complete copy of the experimental data on tape • And have the compute capacity to run large scale data processing jobs close to the data • RAL has ~800 worker nodes, 32,000 cores • Large amounts of disk needed as working space for active data • Echo provides all of RAL’s disk storage • 200 PB disk and 500 PB tape capacity across the all Tier-1s
A typical LHC experiments computing workflow(from the point of view of a tier-1 storage sysadmin) Tier-1 storage (Echo) Raw • Analysis • Object • Data User jobs (analysis) Reconstruction (production) Physics Model Monte Carlo Simulation (production) Tier-1 compute
What does Echo do for ATLAS? Deletes Writes Reads
Grid protocols and Ceph • Grid protocols are used for WLCG transfers • For WAN transfers between storage endpoints a protocol called GridFTP is used • For jobs analysing data a protocol called XRootD is used • The HEP community (CERN, RAL, others) developed plugins for the grid protocol servers on top of libradosstriper • Providing object store functionality only
Data Access Worker Node Worker Node Worker Node Other sites Jobcontainer Jobcontainer Jobcontainer Jobcontainer Jobcontainer Jobcontainer User job User job User job User job User job User job XRootD client XRootD client XRootD client XRootD client XRootD client XRootD client Jobcontainer Jobcontainer Jobcontainer XRootD server XRootD server User job User job User job XRootD server rados plugin rados plugin rados plugin XRootD client XRootD client XRootD client libradosstriper libradosstriper libradosstriper GWcontainer GWcontainer GWcontainer Gateway Gateway Gateway XRootD server XRootD server XRootD server GridFTP server GridFTP server GridFTP server x7 rados plugin rados plugin rados plugin rados plugin rados plugin rados plugin XRootD Cache Ceph Cluster libradosstriper libradosstriper libradosstriper libradosstriper libradosstriper libradosstriper Experiment data pool x800
Gateway setup • At the scale we need to run, it is not sustainable to transfer data to worker nodes via the gateway cluster. • We created gateways running in containers on every WN. • Entry in /etc/host directs transfer request to local gateway. • A small cache on each WN allows pre-fetching of data. • This is aligned with the rados object size. • Reduces latency and improves throughput from Echo. • Cache also offers some protection against pathological jobs.
In general, the worker node gateways are working very well More stable external transfers with the same gateway hardware Much better data rates to batch farm But they have added a few challenges Having two infrastructures acting as the same endpoint can lead to interesting issues when they differ An external user trying to understand why a job transfer failed will not be able to test the same gateway We now have 800+ gateways! Suitable monitoring is now a necessity Gateways on compute nodes
Monitoring OSD ops [root@ceph-adm1 ~]# ceph status cluster: id: 2336dcf1-b1b1-4c6a-b793-61cfb29f014c health: HEALTH_ERR 76 scrub errors Possible data damage: 2 pgsinconsistent 10 slow requests are blocked > 32 sec. Implicated osds 3107 services: mon: 5 daemons, quorum ceph-mon1,ceph-mon2,ceph-mon3,ceph-mon4,ceph-mon5 mgr: ceph-mon3(active), standbys: ceph-mon2, ceph-mon4, ceph-mon5, ceph-mon1 osd: 5072 osds: 4946 up, 4946 in rgw: 8 daemons active data: pools: 22 pools, 25088 pgs objects: 360.31M objects, 18.5PiB usage: 25.5PiB used, 12.1PiB / 37.7PiB avail pgs: 24964 active+clean 70 active+clean+scrubbing+deep 52 active+clean+scrubbing 2 active+clean+inconsistent io: client: 21.2GiB/s rd, 2.42GiB/s wr, 22.60kop/s rd, 7.57kop/s wr • I felt Echo still lacked monitoring for performance issues at the OSD level • And in particular, what affect these had on the transfers to the worker nodes • Some of that is due to Echo’s architecture and use case, rather than a shortcomings in Ceph • As Echo grew in size and usage we began to see transient slow OSD ops • With no concrete understanding of what effect this had on transfers • dump_ops_in_flight and friends give lots of info, but not long term trends • I put together together a few scripts to collect and dump slow requests into Elasticsearch ???
Hot files • Rare occurrences of popular files having issues • Contention for locks from hundreds of different worker node gateways • Probably caused by a misbehaving gateway not releasing locks, rather than a Ceph limitation • Only affects a tiny fraction of popular files • The blocked requests are short lived and quickly clear • affected XRootD transfers eventually succeed
Slow OSDmap creation • Slow OSDmap creation blocking ops is a consistent factor in slow requests on Echo • Cause of slow OSDmap creation fixed (thanks Xie!) • Affects operations in flight on peering PGs • Peering on Echo is mainly due to unhealthy disks crashing OSDs, and can be mitigated with more zealous disk management
OSD op monitoring findings • Similar data collection with random sampling of ‘normal’ operations helped us spot other issues • Gateway updates changing default block sizes and degrading single file transfer speeds • The number of transfers affected by slow requests in Ceph is tiny fraction of the total throughput of Echo • Less than 0.1% of transfers affected • In general, Ceph is working very well at this scale
Conclusions • If you a suitable use case, using libradosstriper directly on top of core Ceph can provide fast, simple and scalable object storage • Having monitoring appropriate to your cluster is important! • Sampled OSD operation collection and analysis has been great for providing reassurance that Ceph is working well at this scale • New tooling in Nautilus to improve monitoring capabilities out of the box • Ceph is working very well for Echo’s use case of supporting the LHC experiments, providing the scaling and performance we need
Thank you • Questions?