1 / 11

BNL Data Management

Explore current issues in data management at BNL for the ATLAS experiment, including HTTP problems, AOD replication challenges, and source site complications. Discover solutions and recommendations to streamline data workflows and improve efficiency.

santiagop
Download Presentation

BNL Data Management

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. BNL Data Management Hironori Ito Xin Zhao Wensheng Deng

  2. HTTP What datasets are for me? XYZ is for you What files in XYZ? xyz1, xyz2, xyz3 are in XYZ DQ2 Client I want a dataset XYZ at BNL callback Where are source sites for XYZ? They are sites A, B, C. What are pfn for xyz1, xyz2, xyz3? DQ2 Central Services (CERN) Get xyz1, etc.. Mysql/lfc BNL FTS Register xyz1, etc… Source LRC/LFC BNL DQ2 Site Services (BNLDISIK/BNLTAPE/BNLPANDA) Source Site Storage Dashboard/PANDA BNL MyProxy BNL LRC BNL dCache 3rd party transfer ATLAS DQ2 Diagrams Are you done yet?

  3. BNL DQ2 Setup • BNLPANDA • US production • BNLDISK • AOD replication • BNLTAPE • Archive • Non-PANDA US production data • BNLVALID • Production software validation • BNLTEST • Test the new version of DQ2 site service

  4. BNLPANDA BNLVALID BNLTEST LRC web interface dms02.usatlas.bnl.gov BNLPANDA LRC web interface grid20.usatlas.org BNLVALID PANDA Production Server BNL LRC lrc.usatlas.bnl.gov BNLDISK/BNLTAPE BNL DQ2 with other services

  5. BU_TEST OUHEP OU UTA-SWT2 UTA SLAC SLACXRD AGLT2 SMU WISC IU MWT2_IU MWT2_UC BU UC_VOB UC_TP UC IU_BC AGLT2_UM EAST SLAC Grate Lakes UTA_TEST1 MWT2 US ATLAS T2s DQ2 T3s US Cloud has 25 DQ2 site services (and 22 LRCs) DQ2 is integrated into the US production system (PANDA) SOUTH T2

  6. Current DQ2 Usage • US production system 20~40 (in + out from BNL) MB/s • AOD replication ~10MB/s (to BNL) • BNL LRC currently has about 5.5 million files.

  7. Monitoring • Monitoring • DQ2 dashboard • The production monitor (PANDA) • AOD replication monitor • Nagios • Low load-generating traffic monitoring with Nagios is under consideration • How to identify the root cause? • Too many possible problems: Storage, LRC/LFC, MyProxy, FTS, network, grid, DQ2. • How to fix the problem? • Some T2s lack the expertise. • T1s supports are essential. • T1 experts do not have access to all T2s • Syslog-ng to consolidate the logs to central log server. • Install gsissh to DQ2 site server box under consideration.

  8. Current Issues • HTTP (DQ2 client problem at BNL) • BNL HTTP proxy problem • Timeout:DQ2 central service is slow. http request times out • New version of DQ2 central catalog will be faster (hopefully). • It may require to change the timeout in proxy • Curl problem • Shell/http_proxy has character length limit. • Use urllib instead of libcurl/curl • HTTP (DQ2 central service) • Fetcher for the DQ2 site services stops working quitely. • LRC character limit • Web interface use of POOL/SEAL, which has 250 character limit • Work-around to extend the limit to 512 characters exists (but only tested to work with certain platform) • Effort to remove POOL/SEAL dependence: Direct call to mysql

  9. Current Issues 2 • AOD replication • In general, the majority of the problem for not getting 100% of files in a given dataset is related to something in the source site and not destination sites. • For example, look at trig1_misal1_csc11.005980.Pythiazz4l.recon.NTUP.v12000601_tid006335 dataset • Subscribed on 2007/03/08. • RALDISK (srm://ralsrmc.rl.ac.uk) is the source site • BNL has 14 files out of 20 files in the dataset. None of other subscribed/destination T1s got the complete list of the files. • Checking RAL LFC for those missing files: • srm://srm.epcc.ed.ac.uk/pnfs/epcc.ed.ac.uk/data/atlas/dq2/trig1_misal1_csc11/NTUP/trig1_misal1_csc11.005980.Pythiazz4l.recon.NTUP.v12000601_tid006335/trig1_misal1_csc11.005980.Pythiazz4l.recon.NTUP.v12000601_tid006335._00006.root.1 • srm://gallows.dur.scotgrid.ac.uk/dpm/dur.scotgrid.ac.uk/home/atlas/dq2/trig1_misal1_csc11/NTUP/trig1_misal1_csc11.005980.Pythiazz4l.recon.NTUP.v12000601_tid006335/trig1_misal1_csc11.005980.Pythiazz4l.recon.NTUP.v12000601_tid006335._00011.root.3 • srm://gw-3.ccc.ucl.ac.uk/dpm/ccc.ucl.ac.uk/home/atlas/dq2/trig1_misal1_csc11/NTUP/trig1_misal1_csc11.005980.Pythiazz4l.recon.NTUP.v12000601_tid006335/trig1_misal1_csc11.005980.Pythiazz4l.recon.NTUP.v12000601_tid006335._00012.root.2 • RAL T2s have these files, but RALDISK does not have these files!!!

  10. Current Issues 3 • US production system illuminated some desired features, which should be included in the DQ2. • Different queues for different priorities. • During the AOD replication with US T2s, it was found that US T2s do not get enough input files for their production jobs. However, they get good AOD replications. • Reason: the number of files in AOD replication was a lot larger than that of the US production input files. • Quick Fix: Change/hack the local DQ2 codes to use different “share” in DQ2 site service “tier0” vs “default” • Easier manageability of the storage location • T2s (with gridftp storage) tend to run out of space, requiring the change in the storage location. • DQ2 does not cancel the completed subscription request from subscription list. • BNLPANDA accumulates too many unnecessary subscriptions, which are already completed. The number can reach to 15K datasets subscription to BNLPANDA. • Fixed(?) in the new version of DQ2 • Currently, there is a program to change the shares of “real” subscription to “tier0” from “default”.

  11. Data Integrity and Availability • Data corruption • dCache bugs • File size 0 in pnfs, but correct in HPSS tape • Not fixed in dCache code. There is a local program to fix it. • File size correct in pnfs (and HPSS), but 0 in read pools • Not fixed in dCache code. There is a local program to fix it. • File is corrupt in HPSS tape as well as pnfs and read pools • It is fixed after the developer was notified. • File availability • AOD files in dCache’s disk cache? If not, the speed of dCache to get files from HPSS is too slow for users. • File check • Check the existence of the files • Compare LRC and dCache pnfs/HPSS tape • Check the integrity of the files. • Compare the md5sum • Problem: DQ2 site services do not pass md5sum between LRC and LFC sites (or LFC and LFC sites) It only passes between LRC and LRC. • There is a program to get from (source) LFC and places it to LRC. • Software to monitor/fix all of these situation are under development.

More Related