1 / 15

ATLAS Use and Experience of FTS

ATLAS Use and Experience of FTS. FTS workshop 16 Nov 05. Outline. Intro to ATLAS DDM How we use FTS SC3 Tier 0 exercise experience Things we like Things we would like. Files. Files. Datasets. Sites. Files. Files. Files. Files. ATLAS DDM System.

tagnes
Download Presentation

ATLAS Use and Experience of FTS

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ATLAS Use and Experience of FTS FTS workshop 16 Nov 05

  2. Outline • Intro to ATLAS DDM • How we use FTS • SC3 Tier 0 exercise experience • Things we like • Things we would like

  3. Files Files Datasets Sites Files Files Files Files ATLAS DDM System • Moves from a file based system to one based on datasets • Hides file level granularity from users • A hierarchical structure makes cataloging more manageable • However file level access is still possible • Scalable global data discovery and access via a catalog hierarchy • No global physical file replica catalog (but global dataset replica catalog and global logical file catalog)

  4. Site ‘X’: Dataset ‘A’ Subscriptions: File1 File2 Dataset ‘A’ | Site ‘X’ Dataset ‘B’ | Site ‘Y’ (Container) Dataset ‘B’ Site ‘Y’: Data block1 Data block2 ATLAS DDM System • As well as catalogs for datasets and locations we have ‘site services’ to replicate data • We use ‘subscriptions’ of datasets to sites held in a global catalog • Site services take care of the replica resolution, transfer and registration at the destination site

  5. Subscription Agents File state (site local MySQL DB) Function Agents Fetcher Finds incomplete datasets unknownSURL ReplicaResolver Finds remote SURL knownSURL MoverPartitioner Assigns Mover agents assigned Mover Moves file Uses FTS here! toValidate ReplicaVerifier Verifies local replica validated BlockVerifier Verifies whole dataset complete done This is what runs on the VO Boxes

  6. Within the Mover agent • The python Mover agent reads in a XML file catalog of source files to copy • The destination file name is based on the SRM endpoint + dataset name + source filename <File ID="bc340aff-4057-4dcc-98aa-204432c4bb07"> <physical> <pfn filetype="" name="srm://castorgridsc.cern.ch/castor/cern.ch/grid/atlas/ddm_tier0/perm/esd.0003/esd.0003._5645.1"/> </physical> <logical/> <metadata att_name="destination" att_value="http://vobox.grid.sinica.edu.tw:8000/dq2//esd.0003"/> <metadata att_name="fsize" att_value="500000000"/> <metadata att_name="md5sum" att_value=""/> </File>

  7. Within the Mover agent • We create a file of source and dest SURLs and submit the bulk job to FTS (using CLI via python commands module) • Then query every x seconds using glite-transfer-status to see if status changes • ‘Done’: mark all files as successfully copied • ‘Hold’, ‘Failed’: some or all files failed so look through the output for successes and failures • In the case of failed file: • The file is put back to the ‘unknownSURL’ state and goes again through the chain of agents (max 5 times x 3 FTS retries = 15 retries overall) • Successful files: • The destination file is validated by using SRM commands directly (getFileMetaData) to compare file size with source catalog file size • Would like to know if this stage is really necessary or if FTS already does it (or will in future?) (more later…)

  8. Using FTS within SC3 • ATLAS’ SC3 is a Tier 0 exercise where we produce RAW data at CERN and replicate reconstructed data to Tier 1 sites (using FTS!) • We started officially on 2nd Nov so been running for ~2 weeks now • With ~1 month of small scale testing using the FTS pilot service - this was very useful for testing integration of FTS and debugging site problems with SRM paths etc..

  9. Results so far 1 - 7 Nov

  10. Results so far.. • Put latest plots here 9 - 15 Nov

  11. What worked well • The service is very reliable • virtually no failures connecting to service (apart from when CERN had unstable network) • 99.9% of failures are problems with sites/humans • It hasn’t lost any of our jobs information • The interface is friendly and self-explanatory • The throughput rate is fast enough, but we haven’t really stressed it so far • Response to reported errors is good (fts-support)

  12. What we would like • Staging from tape • In theory this is not a problem for us in SC3 but will be in the future • Would like FTS to deal with staging from tape properly (rather than giving SRM get timeouts), having a ‘staging’ status and perhaps enabling us to query through FTS whether files are on tape or disk • Integration with replica catalogs • We use LFC (LCG) and Oracle/Globus RLS (through POOL FC interface) (OSG) • So we can say move LFN x from site y to site z and FTS calls a service that takes care of resolution and registration • Bandwidth monitoring within FTS • Error reporting • Email lists again… would like to know who to tell in case of error. Can you give a hint based on the error?

  13. What we would like • TierX to TierY transfers handled by the network fabric, so channels between all sites should exist • support priorities, with possibility to do late reshuffling • plugins to allow interactions with experiment's services. Example of plug-ins - or experiment-specific services: • catalog interactions (not exclusively grid catalogs) • plugins to zip files on the fly (transparently to users but very good for MSS) - after transfer starts and/or before files are stored on storage • an idea is for FTS to provide a callback? Must understand VO agents framework and what can be done with that! • reliable: keep retrying until told to stop • but allow real-time monitoring of errors for transfer (parseable errors preferable) so that we can do reshuffling of transfers, cancel them, etc • signal conditions such as source missing, destination down, etc

  14. Some Questions(maybe already answered today!) • Would like to understand how to optimise (no of files per bulk etc) • Do you distinguish between permanent errors (channel doesn’t exist) and temporary errors (SRM timeout)? • I.e. not retrying permanent errors and is there a way to report this to us so we don’t retry either? • Do we need our own verification stage or are we just repeating what FTS does? • ‘Duration’ - is this time from submission to completion or ‘Active’ time?

  15. Conclusion • We are happy with the FTS service so far - it’s given us some good results • But we haven’t tested it til it breaks! • Probably the most reliable part of SC3 in our experience • We would like to see it integrated with more components to reduce our workload (staging, catalogs) • Look forward to further developments!

More Related