Management of User Requested Data in US ATLAS

Management of User Requested Data in US ATLAS ArmenVartapetian University of Texas, Arlington US ATLAS Distributed Facility Workshop UC Santa Cruz, November 14, 2012

Outline • User Analysis Output • Central Deletion Service • Victor • USERDISK cleanup • Monitoring and Notifications • DaTRI • LOCALGROUPDISK policy US ATLAS Distributed Facility Workshop November 14, 2012

Storing User Analysis Output • User analysis output in US is stored in USERDISK of the site where the job has run • Only US sites have USERDISKS. In non-US sites the destination for output is SCRATCHDISK • US has specific policy for USERDISK maintenance/cleanup – more relaxed/user-friendly than for SCRATCHDISK (details later) • Both space tokens are temporary storage, but users can subscribe their data to different locations using DaTRI request system (details later) • Typical destination for user data by DaTRI requests is LOCALGROUPDISK or GROUPDISK for longer storage, or even to SCRATCHDISK for further temporary storage • Datasets in LOCALGROUPDISK or GROUPDISK by default don’t have limited lifetime, so these space tokens (unlike some other space tokens) are not cleaned up on a regular basis US ATLAS Distributed Facility Workshop November 14, 2012

Central Deletion Service • Cleanup of all space tokens is carried out through the central deletion service • The very basic command to submit a dataset for deletion is: dq2-delete-replicas <dataset> <space-token> • The command will submit the dataset deletion to the Central Deletion Service and right away put it on queue • Deletion service flow for datasets is: ToDelete -> Waiting -> Resolved -> Queued -> Deleted . It also shows the status ToDelete -> Deleted for file count, as well as for the space. Errors are also shown, if any. • Currently the typical deletion rate for US sites is 2-4 Hz for T2-s and 7-8 Hz for T1 • One can change/optimize the deletion rate tweaking some site specific parameters in deletion service configuration file • Load, bottlenecks and other srm issues can create timeouts, reduction of the deletion rate and cause errors • If site has more than 100 errors in 4 hours, the ADCoS shifter must file a ggus ticket US ATLAS Distributed Facility Workshop November 14, 2012

Cleanup Decision - Victor • Daily monitoring of the space tokens to detect low space availability and trigger space cleanup is done by the system called Victor • Victor takes care of only those space tokens which need regular cleanup • It prepares a list of datasets to be sent to central deletion system. A grace period of 1 day is exercised • SCRATCHDISK – cleanup is triggered when free space is <50%. The oldest replicas are selected for deletion (older than 15 days). Target free space >55% . • DATADISK – when free space is getting low. Only “secondary” type of datasets are triggered for deletion, older than 15 days. Popularity of datasets is taken into account. • forT2-s cleanup is triggered when free space <10%, with target >15% • for T1 cleanup is triggered when free space <500 TB, with target >750TB • PRODDISK – cleanup is triggered when free space <10TB, with target free space >12TB. Only datasets older than 31 days. The issue is also to cleanup the pandamover files, done locally • GROUPDISK – cleanup defined by the group responsible person US ATLAS Distributed Facility Workshop November 14, 2012

USERDISK Cleanup • The USERDISK cleanup is done on average every 2 months • We target datasets older than 2 months • Targeted user datasets are matched with dataset owner DN from dq2 catalog and dataset lists per DN are created • Notification email is sent to users about the upcoming cleanup of the datasets with a link to the list and some basic information on how to proceed if the dataset is still needed • We maintain and use a list of DN to email address associations, and regularly take care of the missing/obsolete emails • After the notification email the users have 10 days to save the data they need • This cleanup procedure is used during the last 4 years • Very smooth operation, no complains, users happy US ATLAS Distributed Facility Workshop November 14, 2012

USERDISK Cleanup Notification • Question whether the user is well informed on all available options to save the data targeted for deletion • Excerpt from the notification email with the information for users: You are advised to save any dataset, which is still of interest, to your private storage area. You may also use your local group disk storage area xxx_LOCALGROUPDISK if such area has been defined. Please contact your local T1/T2/T3 responsible of disk storage for further assistance. If the list contains datasets of common interest to a particular physics group, please contact that group representative to move your datasets to xxx_ATLASGROUPDISK area. If you are going to copy your dataset to xxx_LOCALGROUPDISK or xxx_ATLASGROUPDISK please use the Subscription Request page: http://panda.cern.ch:25980/server/pandamon/query?mode=ddm_req If you are going to copy your dataset to any private storage area (not known to grid) please use dq2-get. See the link for help: https://twiki.cern.ch/twiki/bin/view/Atlas/DQ2ClientsHowTo • This must cover all the practical options… US ATLAS Distributed Facility Workshop November 14, 2012

Storage Monitoring, Notifications • Storage monitoring from ddm group: http://bourricot.cern.ch/dq2/accounting/site_reports/USASITES/ • Drop-down menus provide other storage tables and plots, grouped by space tokens, clouds, etc. • Also notifications with the list of space tokens, which run low on free space, and if any space token runs out of space ( < 0.5TB ) and is blacklisted • Notification thresholds: • T1 DATADISK < 10TB • T2 DATADISK < 2TB • PRODDISK < 20% • USERDISK < 10% • Others < 10TB US ATLAS Distributed Facility Workshop November 14, 2012

DaTRI • Data Transfer Request Interface (DaTRI) – to submit transfer requests, also provides monitoring of the transfer status • Request can be placed by web interface or automatically as output destination of the analysis job • All the links are available at the left bar of Panda Monitor page under the Datasets Distribution drop-down menu • Users need to be registered within DaTRI. Registration link is in the main page. Also there is a link to check the registration status. Also if you are not sure, use the opportunity to check your certificate for usatlas role • DaTRI request on web interface – basically you fill dataset pattern, destination and justification for transfer US ATLAS Distributed Facility Workshop November 14, 2012

DaTRI • Submitted DaTRI request has following states/stages: PENDING -> AWAITING_APPROVAL -> AWAITING_SUBSCRIPTION -> SUBSCRIBED -> TRANSFER -> DONE • Once scheduled for approval, a request ID will be assigned • Error message if dataset pattern is not correct, dataset is empty, destination site has not enough space, group quota at the destination site is exceeded, etc. • Each cloud has DaTRI coordinators for manual approval. In US Kaushik De, Armen Vartapetian • Approval to GROUPDISKs done by group representatives • An automatic approval if summary size is < 0.5TB, and only if user has usatlas role (a very common issue/problem) • Monitoring provides also link to the dashboard, as well as replica status for each dataset • Plan to provide a functionality within DaTRI web interface to upload list/pattern of user datasets for deletion. Help users to get rid of the obsolete data US ATLAS Distributed Facility Workshop November 14, 2012

LOCALGROUPDISK Policy • Intended as a long term storage for users • Unpledged resource (main concern T1/T2) • No ADC policy or recommendations for management • Central cleaning only for aborted and failed tasks • The main issue is the absence of the usage and cleanup policy. Because of that, tendency to grow in size • Usage tables for some of the US LOCALGROUPDISK-s in backup slides • Common trend is that usually there are 2-3 super users per site who occupy more than half of the space (there may be a group behind such user). A dozen of top users occupy more than 90% of the space, and there are many more users with less of a share • Similar situation with storage distribution can be seen in other clouds as well • Part of that data may have more relevance to GROUPDISK or even DATADISK (move data to pledged resources). US ATLAS Distributed Facility Workshop November 14, 2012

LOCALGROUPDISK Policy • Some datasets with many replicas. Some of them owned by the same top users. The situation will become unsustainable if the number of such top users will grow over time • Some datasets with only replica, and big chunk of that is not used for a while. Put in place policy/path for their retirement • Popularity analysis may help to distinguish datasets which may be obsolete, and candidates for retirement • We may start with soft space limit of 2-3TB per user per site • Start to ask questions when size is above that • Particularly for the datasets not used for N months (1 year?) – check if user still needs them • Approval mechanism for sample transfers > N TB (10TB?). Centralized approval and decision for space allocation for big samples. • LOCALGROUPDISK management policy is currently under discussion at RAC US ATLAS Distributed Facility Workshop November 14, 2012

BACKUP US ATLAS Distributed Facility Workshop November 14, 2012

BNL localgroupdisk, used space 196TB

SLAC localgroupdisk, used space 355TB

MWT2+ILLINOISHEP localgroupdisk, used space 302TB

AGLT2 localgroupdisk, used space 238TB

Management of User Requested Data in US ATLAS

Management of User Requested Data in US ATLAS

Presentation Transcript

US ATLAS Project Management

T3 DDM in US ATLAS

ATLAS Multi-User Potential

A Secure VO Software for ATLAS Grid User Management

Data Management User Guide

US ATLAS Computing Operations

Advanced Grid Technologies in ATLAS Data Management

Computing in US ATLAS Trigger/DAQ

Management of User Requested Data in US ATLAS

US ATLAS Pixel Meeting

Finding Data in ATLAS

US ATLAS Pixel Review

US User Training

Data Management and User Support

US ATLAS Computing Operations

US ATLAS Project Management

Event Data Model in ATLAS

ATLAS Distributed Data Management

ATLAS Distributed Data Management Operations

Data Management in the US GLOBEC Program

Finding Data in ATLAS

Advanced Grid Technologies in ATLAS Data Management