1 / 17

Management of User Requested Data in US ATLAS

Management of User Requested Data in US ATLAS. Armen Vartapetian University of Texas, Arlington US ATLAS Distributed Facility Workshop UC Santa Cruz, November 14, 2012. Outline. User Analysis Output Central Deletion Service Victor USERDISK cleanup Monitoring and Notifications

Download Presentation

Management of User Requested Data in US ATLAS

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Management of User Requested Data in US ATLAS ArmenVartapetian University of Texas, Arlington US ATLAS Distributed Facility Workshop UC Santa Cruz, November 14, 2012

  2. Outline • User Analysis Output • Central Deletion Service • Victor • USERDISK cleanup • Monitoring and Notifications • DaTRI • LOCALGROUPDISK policy US ATLAS Distributed Facility Workshop November 14, 2012

  3. Storing User Analysis Output • User analysis output in US is stored in USERDISK of the site where the job has run • Only US sites have USERDISKS. In non-US sites the destination for output is SCRATCHDISK • US has specific policy for USERDISK maintenance/cleanup – more relaxed/user-friendly than for SCRATCHDISK (details later) • Both space tokens are temporary storage, but users can subscribe their data to different locations using DaTRI request system (details later) • Typical destination for user data by DaTRI requests is LOCALGROUPDISK or GROUPDISK for longer storage, or even to SCRATCHDISK for further temporary storage • Datasets in LOCALGROUPDISK or GROUPDISK by default don’t have limited lifetime, so these space tokens (unlike some other space tokens) are not cleaned up on a regular basis US ATLAS Distributed Facility Workshop November 14, 2012

  4. Central Deletion Service • Cleanup of all space tokens is carried out through the central deletion service • The very basic command to submit a dataset for deletion is: dq2-delete-replicas <dataset> <space-token> • The command will submit the dataset deletion to the Central Deletion Service and right away put it on queue • Deletion service flow for datasets is: ToDelete -> Waiting -> Resolved -> Queued -> Deleted . It also shows the status ToDelete -> Deleted for file count, as well as for the space. Errors are also shown, if any. • Currently the typical deletion rate for US sites is 2-4 Hz for T2-s and 7-8 Hz for T1 • One can change/optimize the deletion rate tweaking some site specific parameters in deletion service configuration file • Load, bottlenecks and other srm issues can create timeouts, reduction of the deletion rate and cause errors • If site has more than 100 errors in 4 hours, the ADCoS shifter must file a ggus ticket US ATLAS Distributed Facility Workshop November 14, 2012

  5. Cleanup Decision - Victor • Daily monitoring of the space tokens to detect low space availability and trigger space cleanup is done by the system called Victor • Victor takes care of only those space tokens which need regular cleanup • It prepares a list of datasets to be sent to central deletion system. A grace period of 1 day is exercised • SCRATCHDISK – cleanup is triggered when free space is <50%. The oldest replicas are selected for deletion (older than 15 days). Target free space >55% . • DATADISK – when free space is getting low. Only “secondary” type of datasets are triggered for deletion, older than 15 days. Popularity of datasets is taken into account. • forT2-s cleanup is triggered when free space <10%, with target >15% • for T1 cleanup is triggered when free space <500 TB, with target >750TB • PRODDISK – cleanup is triggered when free space <10TB, with target free space >12TB. Only datasets older than 31 days. The issue is also to cleanup the pandamover files, done locally • GROUPDISK – cleanup defined by the group responsible person US ATLAS Distributed Facility Workshop November 14, 2012

  6. USERDISK Cleanup • The USERDISK cleanup is done on average every 2 months • We target datasets older than 2 months • Targeted user datasets are matched with dataset owner DN from dq2 catalog and dataset lists per DN are created • Notification email is sent to users about the upcoming cleanup of the datasets with a link to the list and some basic information on how to proceed if the dataset is still needed • We maintain and use a list of DN to email address associations, and regularly take care of the missing/obsolete emails • After the notification email the users have 10 days to save the data they need • This cleanup procedure is used during the last 4 years • Very smooth operation, no complains, users happy US ATLAS Distributed Facility Workshop November 14, 2012

  7. USERDISK Cleanup Notification • Question whether the user is well informed on all available options to save the data targeted for deletion • Excerpt from the notification email with the information for users: You are advised to save any dataset, which is still of interest, to your private storage area. You may also use your local group disk storage area xxx_LOCALGROUPDISK if such area has been defined. Please contact your local T1/T2/T3 responsible of disk storage for further assistance. If the list contains datasets of common interest to a particular physics group, please contact that group representative to move your datasets to xxx_ATLASGROUPDISK area. If you are going to copy your dataset to xxx_LOCALGROUPDISK or xxx_ATLASGROUPDISK please use the Subscription Request page: http://panda.cern.ch:25980/server/pandamon/query?mode=ddm_req If you are going to copy your dataset to any private storage area (not known to grid) please use dq2-get. See the link for help: https://twiki.cern.ch/twiki/bin/view/Atlas/DQ2ClientsHowTo • This must cover all the practical options… US ATLAS Distributed Facility Workshop November 14, 2012

  8. Storage Monitoring, Notifications • Storage monitoring from ddm group: http://bourricot.cern.ch/dq2/accounting/site_reports/USASITES/ • Drop-down menus provide other storage tables and plots, grouped by space tokens, clouds, etc. • Also notifications with the list of space tokens, which run low on free space, and if any space token runs out of space ( < 0.5TB ) and is blacklisted • Notification thresholds: • T1 DATADISK < 10TB • T2 DATADISK < 2TB • PRODDISK < 20% • USERDISK < 10% • Others < 10TB US ATLAS Distributed Facility Workshop November 14, 2012

  9. DaTRI • Data Transfer Request Interface (DaTRI) – to submit transfer requests, also provides monitoring of the transfer status • Request can be placed by web interface or automatically as output destination of the analysis job • All the links are available at the left bar of Panda Monitor page under the Datasets Distribution drop-down menu • Users need to be registered within DaTRI. Registration link is in the main page. Also there is a link to check the registration status. Also if you are not sure, use the opportunity to check your certificate for usatlas role • DaTRI request on web interface – basically you fill dataset pattern, destination and justification for transfer US ATLAS Distributed Facility Workshop November 14, 2012

  10. DaTRI • Submitted DaTRI request has following states/stages: PENDING -> AWAITING_APPROVAL -> AWAITING_SUBSCRIPTION -> SUBSCRIBED -> TRANSFER -> DONE • Once scheduled for approval, a request ID will be assigned • Error message if dataset pattern is not correct, dataset is empty, destination site has not enough space, group quota at the destination site is exceeded, etc. • Each cloud has DaTRI coordinators for manual approval. In US Kaushik De, Armen Vartapetian • Approval to GROUPDISKs done by group representatives • An automatic approval if summary size is < 0.5TB, and only if user has usatlas role (a very common issue/problem) • Monitoring provides also link to the dashboard, as well as replica status for each dataset • Plan to provide a functionality within DaTRI web interface to upload list/pattern of user datasets for deletion. Help users to get rid of the obsolete data US ATLAS Distributed Facility Workshop November 14, 2012

  11. LOCALGROUPDISK Policy • Intended as a long term storage for users • Unpledged resource (main concern T1/T2) • No ADC policy or recommendations for management • Central cleaning only for aborted and failed tasks • The main issue is the absence of the usage and cleanup policy. Because of that, tendency to grow in size • Usage tables for some of the US LOCALGROUPDISK-s in backup slides • Common trend is that usually there are 2-3 super users per site who occupy more than half of the space (there may be a group behind such user). A dozen of top users occupy more than 90% of the space, and there are many more users with less of a share • Similar situation with storage distribution can be seen in other clouds as well • Part of that data may have more relevance to GROUPDISK or even DATADISK (move data to pledged resources). US ATLAS Distributed Facility Workshop November 14, 2012

  12. LOCALGROUPDISK Policy • Some datasets with many replicas. Some of them owned by the same top users. The situation will become unsustainable if the number of such top users will grow over time • Some datasets with only replica, and big chunk of that is not used for a while. Put in place policy/path for their retirement • Popularity analysis may help to distinguish datasets which may be obsolete, and candidates for retirement • We may start with soft space limit of 2-3TB per user per site • Start to ask questions when size is above that • Particularly for the datasets not used for N months (1 year?) – check if user still needs them • Approval mechanism for sample transfers > N TB (10TB?). Centralized approval and decision for space allocation for big samples. • LOCALGROUPDISK management policy is currently under discussion at RAC US ATLAS Distributed Facility Workshop November 14, 2012

  13. BACKUP US ATLAS Distributed Facility Workshop November 14, 2012

  14. BNL localgroupdisk, used space 196TB

  15. SLAC localgroupdisk, used space 355TB

  16. MWT2+ILLINOISHEP localgroupdisk, used space 302TB

  17. AGLT2 localgroupdisk, used space 238TB

More Related