1 / 25

Russian academic institutes participation in WLCG Data Lake project

Russian academic institutes participation in WLCG Data Lake project. Andrey Kiryanov , Xavier Espinal, Alexei Klimentov, Andrey Zarochentsev. LHC Run 3 and HL-LHC Run 4 Computing Challenges. Raw data volume. We are here.

sbrashears
Download Presentation

Russian academic institutes participation in WLCG Data Lake project

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Russian academic institutes participation in WLCG Data Lake project Andrey Kiryanov, Xavier Espinal, Alexei Klimentov, Andrey Zarochentsev

  2. LHC Run 3 and HL-LHC Run 4 Computing Challenges Raw data volume We are here • HL-LHC storage needs are a factor 10 above the expected technology evolution and flat funding. • We need to optimize storage hardware usage and operational costs.

  3. Motivation • The HL-LHC will be a multi-Exabyte challenge where the anticipated storage and compute needs are a factor of ten above the projected technology evolution and flat funding. • The WLCG community needs to evolve current models to store and manage data more efficiently. • Technologies that will address the HL-LHC computing challenges may be applicable for other communities to manage large-scale data volumes (SKA, DUNE, CTA, LSST, BELLE-II, JUNO, etc). Co-operation is in progress.

  4. Storage software emerged from HENP scientific community • DPM – WLCG storage solution for small sites (T2s). Initially supported GridFTP+SRM, but now undergoing a reincarnation phase as DOME with HTTP/WebDAV/xrootd support as well. No tapes. • dCache – a versatile storage system from DESY for both disks and tapes. Used by many T1s. • xrootd – both a protocol and a storage system optimized for physics’ data access. Can be vastly extended by plug-ins. Used as a basis for ATLAS FAX and CMS AAA federations. • EOS – based on xrootd, adds smart namespace and lots of extra features like automatic redundancy and geo-awareness. • DynaFed – designed as a dynamic federation layer on top of HTTP/WebDAV-based storages. • CASTOR – CERN’s tape storage solution, to be replaced by CTA. • On top of that various data management solutions exist: FTS, Rucio, etc.

  5. What is Data Lake • Not another software or storage solution. • It is a way of organizing a group of Data and Computing centers so that it can perform an effective data processing. • A scientific community defines a “shape” of their Data Lake, which may be different for different communities. • We see the Data Lake model as an evolution of the current infrastructure bringing reduction of the storage costs.

  6. Data Lake

  7. We’re not alone There are several storage-related R&D projects conducted in parallel: • Data Carousel • Data Lake • Data Ocean (Google) • Data Streaming All of them are in progress as a part of DOMA or/andIRIS-HEP global R&D for HL-LHC It is important to develop a coherent solution to address HL-LHC data challenges and to coordinate above and future projects

  8. Requirements for a future WLCG data storage infrastructure • Common namespace and interoperability • Coexistence of different QoS • Geo-awareness • File transitioning based on namespace rules • File layout flexibility • Distributed redundancy • Fast access to data, latency (>20 ms) compensation • File R/W cache • Namespace cache

  9. WLCG Data Lake Prototype — EUlake • Currently based on EOS • Implies xrootd as a primary transfer protocol • Other storage technologies and their possible interoperability are also considered • Primary namespace server (MGM) is at CERN • Deployment of a secondary namespace server at NRC “KI” is planned • Due to EOS transition from in-memory namespace to QuarkDB multi-MGM deployment was unsupported for a while • Storage endpoints run a simple EOS filesystem (FST) daemon • Deployed at CERN, SARA, NIKHEF, RAL, JINR, PNPI, PIC and UoM • perfSONAR endpoints are deployed at participating sites • Performance tests (HC) are running continuouslys

  10. Russian Federated Storage Project Started in 2015 EOS+dCache RUSSIA SPb Region • SPbSU • PNPI Moscow Region • JINR • NRC “KI” • MEPhI • SINP • ITEP External Sites • CERN

  11. Russia in EUlake (1) • Why? • Extensive expertise in deployment and testing of distributed storages • Russian institutes, including the ones that comprise NRC “KI”, are geographically distributed • Network interconnection between Russian sites is constantly improving • A similar prototype was successfully deployed on Russian sites (Russian Federated Storage Project) • An appealing universal infrastructure may be useful not only for HL-LHC and HEP, but also for other experiments and fields of science relevant to us (NICA, PIK, XFEL) • NRC “KI” equipment for EUlake is located at PNPI in Gatchina • 10 Gbps connection, IPv6 • ~100 TB of block storage • Storage and Compute endpoints on VMs • JINR equipment for EUlake is located in Dubna • 10 Gbps connection • ~16 TB of block storage • Storage endpoints on VMs

  12. Russia in EUlake (2) • Manpower (NRC “KI” + JINR + SPbSU) • Infrastructure deployment • FSTs • Hierarchical GeoTags • Placement-related attributes • Synthetic tests • File I/O performance • Metadata performance • Real-life tests • HammerCloud • Monitoring

  13. NRC “KI” + JINR international network infrastructure NRC “KI” PNPI JINR

  14. 100 Gbps routes

  15. NRC “KI” in EUlake Ceph Nodes x20, 128TB each Metadata request Clients Primary Head Node Disk Nodes Disk Nodes CERN Redirection Other Participants: JINR, PIC, NIKHEF, RAL, SARA, UoM Data transfer Replication & Fall-back PNPI Secondary Head Node VM hosts Reliable back-end with redundancy 10 Gbps 10 Gbps switch 160 Gbps Stack 2x10 Gbps trunk on each node 10 Gbps switch Replication via a dedicated fabric

  16. Highlights • Why Ceph? • Deploying EOS on physical storage is perfectly suitable for CERN, but • PNPI Data Centre is not a dedicated facility for HEP computing • Ceph adds necessary flexibility in block storage management (we also use it for other purposes like VM images) • Storage configuration • We have started with Luminous but quickly moved to Mimic • CephFS performance improved significantly in the new release • We have four different “types” of Ceph storage exposed to EOS: • CephFS with replicated data pool • CephFS with Erasure Coded data pool • Block device from a replicated pool • Block device from an Erasure Coded pool • Functional and performance tests are ongoing • Auxiliary infrastructure • Repository with stable EOS releases (CERN repo changes too fast, sometimes breaking the functionality) • Web server with a visualization framework and a test results storage • Compute nodes for HC tests

  17. Ceph performance measurements • Metadata performance of CephFS is much slower than of a dedicated RBD (this is expected) • Block I/O performance is on par, but CPU usage is lower with CephFS

  18. Ultimate goals • Evaluate the fusion of local (Ceph) and global (EOS, dCache) storage technologies • Figure out the strong and weak points • Come out with a high-performance, flexible yet easily manageable storage solution for major scientific centers participating in multiple collaborations • Further plans on testing converged solutions (Compute + Storage) • Evaluate Data Lake as a storage platform for Russian scientific centers and major experiments • NICA, XFEL, PIK • Possibility to have dedicated storage resources with configurable redundancy in a global system • Geolocation awareness and dynamic storage optimization • Data relocation & replication with a proper use of fast networks • Federated system with interoperable storage endpoints based on different solutions with a common transfer protocol

  19. Synthetic file locality tests on EUlake • The following combinations for layouts and placement policies were put in place and tested from a single client with geotag RU::PNPI: • Layouts: Plain, Replica (2 stripes), RAIN (4+2 stripes) • Placement policies: Gathered, Hybrid, Simple (based on client geotag) • Expected results • Availability of geo-local replicas should improve file read (stage-in) speed • An ability to tie directories to local storages (FSTs) should improve write (stage-out) speed

  20. 100 MB file I/O performance tests with different layouts and placement policies (1) no sys.forced.placementpolicy (based on client geotag) sys.forced.layout=plain sys.forced.placementpolicy="gathered:RU": sys.forced.layout=plain Read Write Write Read Replica counts CERN::0513 1 CERN::HU 3 ES::PIC 2 RU::Dubna 46 RU::PNPI 48 Replica counts CERN::HU 6 ES::PIC 1 RU::PNPI 93 sys.forced.placementpolicy="gathered:RU": sys.forced.layout=raid6 no sys.forced.placementpolicy (based on client geotag) sys.forced.layout=raid6 Read Write Read Replica counts CERN::0513 71 CERN::HU 129 ES::PIC 100 RU::Dubna 100 RU::PNPI 200 Replica counts RU::Dubna 400 RU::PNPI 200

  21. 100 MB file I/O performance tests with different layouts and placement policies (2) no sys.forced.placementpolicy (based on client geotag) sys.forced.layout=replica sys.forced.placementpolicy="gathered:RU": sys.forced.layout=replica Observed replica scattering (rebalancing) in a couple of days. Read is always redirected to the closest server. RAIN impacts I/O performance the most. Write Read Read Write Replica counts RU::Dubna 100 RU::PNPI 100 Replica counts CERN::0513 13 CERN::HU 21 ES::PIC 35 RU::Dubna 19 RU::PNPI 112 sys.forced.placementpolicy="hybrid:RU": sys.forced.layout=replica Write Read Replica counts CERN::HU 14 CERN::9918 20 ES::PIC 31 RU::Dubna 65 RU::PNPI 7

  22. ES:SiteA ES ES:SiteB RU:SiteA RU in EOS group.X replica 2 + CTA group.Y plain + CTA group.Z CTA group.W replica 2 group.U raid6 RU:SiteB RU:SiteC

  23. Summary • EUlake is currently operational as a proof-of-сoncept • Functional and real-life tests are ongoing • Effort is being made on publishing various metrics into a centralized monitoring system • Expansion of a network capacity between major scientific centers in Russia enables efficient data management for future experiments

  24. Future plans • Extensive testing of different types of QoS (possibly simulated) with different storage groups • Exploit different caching schemes • Test automatic data migration • Evolve the infrastructure from a simple Proof-of-Concept to an infrastructure capable of measuring performance of future possible distributed storage models

  25. Acknowledgements • This work was supported by the NRC "Kurchatov Institute" (№ 1608) • The authors express appreciation to the computing centers of NRC "Kurchatov Institute", JINR and other institutes for provided resources Thank you!

More Related