1 / 9

T2/T3 data mangement tool status (lsm, pcache, ccc)

Charles G Waldman USATLAS/University of Chicago cgw@hep.uchicago.edu ATLAS T2/T3 workshop, UTA, Nov 10-12 2009. T2/T3 data mangement tool status (lsm, pcache, ccc). pcache. Uses extra scratch space on WNs as a cache for input files Fully deployed at UC, IU for over 1 year

wayde
Download Presentation

T2/T3 data mangement tool status (lsm, pcache, ccc)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Charles G Waldman USATLAS/University of Chicago cgw@hep.uchicago.edu ATLAS T2/T3 workshop, UTA, Nov 10-12 2009 T2/T3 data mangement tool status (lsm, pcache, ccc)

  2. pcache • Uses extra scratch space on WNs as a cache for input files • Fully deployed at UC, IU for over 1 year • Reduces load on dCache by ~50% (925168 hits, 945831 misses since 8/1/09) • Tested at RAC with good results • Under testing/evaluation at BNL • http://www.mwt2.org/~cgw/talks/pcache.pdf

  3. Integration of pcache with pilot jobs • BNL: replace dccp with wrapper script • pcache.py dccp.bin -flags INFILE OUTFILE • MWT2: use local site mover (lsm-get) • US sites have not all migrated to lsm, and European sites probably never will • pcache needs to be deployed with pilot • (field has been added to scheddb)

  4. Integration of pcache with Panda • Panda server has information about what jobs have run on what nodes • If site runs pcache, then Panda can assume that files are kept on WN, and use this information for assigning jobs to nodes • pcache needs to update Panda server when files are deleted from cache (http POST with list of GUIDs)

  5. Issues: cache cleanup • Cache cleanup is slow,because of inventory (basically “find” and sort by mtime) • New cleanup code developed with help from Pedro Salgado @ BNL • Maintain a second hierarchy of symlinks, organized by YYYY/MM/DD/HH (mru)

  6. Issues: locking • pcache is single-threaded, but multiple instances may be running, hence some locking is needed • Current lock is global (inefficient) • This led to failures during UAT ('interrupted system call') • Working on more fine-grained locking mechanism

  7. Issues: cache corruption • If a checksum or size mismatch occurs, this will be detected by pilot or lsm-get • Bad file remains in cache! (cache poisoning) • Currently, we clean this up by hand (infrequent) • lsm can handle this, but not all sites will use lsm

  8. Issues: ROOT file access • Files may be opened by ROOT I/O plugin rather than staged by pilot • e.g. Conditions data • This defeats caching and job/node matching • Pcache-aware ROOT plugin? • Can we usefully cache 'scraps'?

  9. Integrity check: CCC • http://www.mwt2.org/sys/ccc • Deployed at MWT2 and AGLT2 • Has found tens of TB of 'dark data' (orphans) • Overdue: versions for xrootd & posix • Also need to develop version for sites with non-local LFC (e.g. T3) • No known bugs, but memory footprint is large (2GB for ~3 million files at UC)

More Related