190 likes | 200 Views
This article highlights the data management strategies and challenges in the High Energy Physics field, including accounting, consistency between storage elements and file catalogues, distributed data management, and site cleaning.
E N D
Data Management Highlights in TSA3.3 Services for HEP Fernando Barreiro Megino, Domenico Giordano, Maria Girone, Elisa Lanciotti, Daniele Spiga on behalf of CERN-IT-ES-VOS and SA3 EGI Technical Forum – Data management highlights 22.9.2011
Outline • Introduction: WLCG today • LHCb Accounting • Storage Element and File Catalogue consistency • ATLAS Distributed Data Management: Breaking cloud boundaries • CMS Popularity and Automatic Site Cleaning • Conclusions EGI Technical Forum – Data management highlights 22.9.2011
Outline • Introduction: WLCG today • LHCb Accounting • Storage Element and File Catalogue consistency • ATLAS Distributed Data Management: Breaking cloud boundaries • CMS Popularity and Automatic Site Cleaning • Conclusions EGI Technical Forum – Data management highlights 22.9.2011
WLCG today • 4 experiments • ALICE • ATLAS • CMS • LHCb • Over 140 sites • ˜150k CPU cores • >50 PB disk • Few thousand users • O(1M) file transfers/day • O(1M) jobs/day EGI Technical Forum – Data management highlights 22.9.2011
Outline • Introduction: WLCG today • LHCb Accounting • Storage Element and File Catalogue consistency • ATLAS Distributed Data Management: Breaking cloud boundaries • CMS Popularity and Automatic Site Cleaning • Conclusions EGI Technical Forum – Data management highlights 22.9.2011
LHCb Accounting • Agent that on a daily basis generates an accounting report based on the information available on the book-keeping system • Metadata breakdown • Location • Data type • Event type • File type • Display information in dynamic web-page • Reports used are currently the main input for clean-up campaigns EGI Technical Forum – Data management highlights 22.9.2011
Outline • Introduction: WLCG today • LHCb Accounting • Storage Element and File Catalogue consistency • ATLAS Distributed Data Management: Breaking cloud boundaries • CMS Popularity and Automatic Site Cleaning • Conclusions EGI Technical Forum – Data management highlights 22.9.2011
Storage element and file catalogue consistency • Grid Storage Elements (SEs) are decoupled from the File Catalogue (FC). • Inconsistencies can arise: • Dark data:Data in the SEs, but not in the FC. Waste of disk space • Lost/corrupted files:Data in the FC, but not in the SEs. Operational problems, e.g. failing jobs • Dark datais identified through consistency checks using full storage dumps • Need one common format and procedure that covers • various SEs: DPM, dCache, StoRM and CASTOR • three experiments: ATLAS, CMS and LHCb • Decision • Text format and XML format • Required information is: Spacetoken, LFN (or PFN), file size, creation time and checksum • The storage dump should be provided on a weekly/monthly basis or on demand EGI Technical Forum – Data management highlights 22.9.2011
Example of good synchronization: LHCb storage usage at CNAF • CNAF provides storage dumps daily • Checks are done centrally with LHCb Data Management tools • Good SE-LFC agreement • Preliminary results: Small discrepancies (O(1TB)) are not a real problem. They can be due to a delay between uploading to the SE and registration to LFC and delay to refresh the information in the LHCb database EGI Technical Forum – Data management highlights 22.9.2011
Outline • Introduction: WLCG today • LHCb Accounting • Storage Element and File Catalogue consistency • ATLAS Distributed Data Management: Breaking cloud boundaries • CMS Popularity and Automatic Site Cleaning • Conclusions EGI Technical Forum – Data management highlights 22.9.2011
Original data distribution model • Hierarchical tier organization based on Monarc network topology • Developed over a decade ago • Sites are grouped into cloudsfor organizational reasons • Possible communications: • Optical Private Network • T0-T1 • T1-T1 • National networks • Intra-cloud T1-T2 • Restricted communications: General public network • Inter-cloud T1-T2 • Inter-cloud T2-T2 • But the network capabilities are not the same anymore! • Many use-cases require breaking these boundaries! EGI Technical Forum – Data management highlights 22.9.2011
Machinery in place Purpose: Generate full mesh transfer statistics for monitoring, site commissioning and to feed back the system EGI Technical Forum – Data management highlights 22.9.2011
Consequences • Link commissioning • Sites optimizing network connections • E.g. UK experience http://tinyurl.com/3p23m2p • Revealed different network issues • E.g. asymmetric network throughput for various sites (affecting also other experiments) • Definition of T2Ds: “Directly connected T2s” • Commissioned sites with good network connectivity • These sites benefit from closer transfer policies • Gradual flattening of the ATLAS Computing Model in order to reduce limitations on • Dynamic data placement • Output collection of multi-cloud analysis • Current development of generic, detailed FTS monitor • FTS servers publishing file level information (CERN-IT-GT) • Expose info through generic web interface and API (CERN-IT-ES) EGI Technical Forum – Data management highlights 22.9.2011
Outline • Introduction: WLCG today • LHCb Accounting • Storage Element and File Catalogue consistency • ATLAS Distributed Data Management: Breaking cloud boundaries • CMS Popularity and Automatic Site Cleaning • Conclusions EGI Technical Forum – Data management highlights 22.9.2011
CMS Popularity • In order to understand how to manage storage more efficiently, it is important to know what data (i.e. which files) is being accessed most and what are the access patterns • The CMS Popularity service now tracks the utilization of 30PB of files over more than 50 sites External systems (e.g. cleaning agent) Popularity web frontend CRAB CMS distributed analysis framework Dashboard DB Popularity DB Popularity information Input files Input Blocks LumiRanges Pull and translate jobs to file level entities EGI Technical Forum – Data management highlights 22.9.2011
CMS Popularity Monitoring EGI Technical Forum – Data management highlights 22.9.2011
Automatic site cleaning • Equally important to know what data is not accessed! • Automatic procedures for site clean up Popularity service & PheDEX Group pledges & PheDEX Replica popularity Used&pledged space Victor Replicas to delete Space information 1. Selection of groups filling their pledge on T2s 2. Selection of unpopular replicas PheDEX • Project initially developed for ATLAS, now extended for CMS • Plug-in architecture • Common core • Experiment specific plug-ins wrapping their Data Management API calls Deleted replicas, Group-site association information 3. Publication of decisions Popularity Web Agent running daily on a dedicated machine EGI Technical Forum – Data management highlights 22.9.2011
Outline • Introduction: WLCG today • LHCb Accounting • Storage Element and File Catalogue consistency • ATLAS Distributed Data Management: Breaking cloud boundaries • CMS Popularity and Automatic Site Cleaning • Conclusions EGI Technical Forum – Data management highlights 22.9.2011
Conclusions • First 2 years of data taking experiences on the LHC were successful • Data volumes and user activity keep increasing • We are learning how to operate the infrastructure efficiently • Common challenges for all experiments • Automate daily operations • Optimize the usage of the storage and network resources • Evolving computing models • Improving data placement strategies EGI Technical Forum – Data management highlights 22.9.2011