220 likes | 389 Views
Accounting . Dave Kant CCLRC, e-Science Centre. Outline. CPU User-Level Accounting Group and Role Accounting Accounting Portal Storage Accounting RUS Summary. Motivation. Accounting for Usage on the Grid
E N D
Accounting Dave Kant CCLRC, e-Science Centre
Outline • CPU • User-Level Accounting • Group and Role Accounting • Accounting Portal • Storage Accounting • RUS • Summary
Motivation • Accounting for Usage on the Grid • If 10,000 CPU production years were consumed by a VO, who are the users that submitted the work? • Are the resources being utilised efficiently? • Kinds of accounting data • Anonymous, statistical accounting information of type “aggregated consumed CPU time per VO / site / month”. • This type of data is mostly uncritical and may be provided in a “world readable” way to scientific boards and communities. • VO-related • Can provide information about usage within the activities of the VO. Access restricted to members of the VO community. • User-related • This type of data can potentially be used to record and control the work of individual persons, and allows conclusions about his/her working methods, results, performance etc.
User-Level Accounting • Focus of activities in EGEE/LCG has been in two areas: • JSPG User-Level Accounting Data Policy (D.Kelsey) • Discusses the treatment of accounting data • User Consent. • What is stored and where? • Who has Access rights? • For what purposes? • Implementation in middleware • Data Collection • APEL, DGAS • Confidentiality • Encryption • Reporting via Accounting Portal • Access control
User-Level Accounting • Request the Tier-1s publish encrypted User DNs. • Policy document isn’t official yet, so there is no actual reporting of the information available. • 18 sites have already done this:- CESGA-EGEE , CIEMAT-LCG2, CNB-LCG2, CSCS-LCG2, FZK-LCG2, IFAE, IFCA-LCG2, INFN-TORINO, ITEP,JINR-LCG2, LIP-Lisbon, RAL-LCG2, RU-SPbSU, TAU-LCG2, UB-LCG2, UKI-NORTHGRID-LANCS-HEP, UPV-GRyCAP , USC-LCG2 • Most of these sites use APEL • Confidentiality based on encrypting the UserDN using 1024 bit public key cryptography.
Group and Role Accounting • Accounting at the User-Level isn’t enough. • Users wear many hats • A requirement to implement a fine-grained management of permissions and privileges in the Grid • “Support for VOMS roles in production services” • Chris Brew et al. WLCG Workshop, CERN, 23 Jan 2007 • Need for VOMS aware grid services • Access to computing resources e.g. Job priorities • Access to data and storage resources • Example • Define how CPU resources should be allocated to different VO-related activities • Allocate 60% of CPU resources to production at a site for the next three months • Measure what has been delivered • Accounting • Implementation for CPU Accounting • On CE via the Grid Accounting Mapping File (Patch #898) • Maps grid credentials to local batch resources • userDN, userFQAN (VOMS), CE_ID, grid_job_ID + LRMS_ID • Implemented for BLAH (gLite CE and CREAM) and LCG-CE • UserFQAN is derived from LCMAPS on the CE.
VOMS Group Tree • Many possible group hierarchies. • Regional groups • /vo/regionX/siteY • Activity-related groups • /vo/analysis/StandardModel • /vo/reconstruction • It is expected to have in future O(100) distinct group/role combinations • What we have seen so far… /cms/Role=production/Capability=NULL /lhcb/lcgprod/Role=NULL/Capability=NULL /atlas/Role=lcgadmin/Capability=NULL • Breakdown of usage according to activity and VO • FQAN Chain with primary and secondary components. • No activity information in the primary • Can only account usage to the VO /dteam/Role=NULL/Capability=NULL;/dteam/cern/Role=NULL/Capability=NULL
Implemention in APEL • Data Collection via Sensors • Transportation (On-the-fly Encryption of UserDN) via RGMA • High level Aggregation and Reporting via Graphical Front-end High Level Reporting: Tables, Pies, Gantts, Metrics, Trees Aggregation
APEL Test Record • What you see in R-GMA GlobalJobID: https://lxb1762.cern.ch:9000/-cfWmtwoGI5CUwGrI5gVmA UserDN: *APEL V.0.2* P7IMQIkPndWbCpeecX/iqqBhqYdsOp+DZl+9+o9GQ3BPXx8uzcHUqZOJQSq8F6KIczAvxSw+I8x8lwHJ6l9kxIyYbLTEkELI3Ul77I6zhzT90zUDLEpgsQ+0XY3tPRGwu5uG/ibtcWxefOtvZoM6FWwf8yZmv8yLpjmNkMxZOtI= UserFQAN: /dteam/Role=NULL/Capability=NULL;/dteam/cern/Role=NULL/Capability=NULL; ExecutingCE: lxb2034.cern.ch:2119/jobmanager-lcgpbs-dteam
Is User FQAN Confidential? • User FQAN Chain readable in R-GMA by anyone with an IGTF approved certificate. • Should we encrypt this information?
MySQL HotCopy RGMA HotCopyInsert (Every Hour) Offline DB Data Life Cycle • From Site to Portal • Desirable to refresh the data shown on the portal at frequent intervals • Every 8 hours • Site Publish • Migration • Extract from R-GMA • Send data to offline Database • Off-line Data Processing • Decrypt UserDN • Extract Group and Role from UserFQAN chain • Off-line Aggregation • Build summaries of data: • Site-Level, VO-level, User-Level, intra-VO 30 Million Job Records (2005-today)
Portal • Accounting Portal • Access Right according to the “Actor Model” • Five Actors: • Users, VO-Resources Manager, VO-Member, Site-Admin GOC Admin • Three are implemented via DN proxy and ACLs • VO-Member access requires the VOMS proxy information. • Display Features • Build Resource statistics as a function of User and FQAN • How much CPU consumed by User X at Site Y • Top Ten Users in a VO • How much production work done by VO X at Site Y
VO Resource Manager • Table shows CPU, WCT and Job Eff. of the Top 10 Anonymised Users • Breakdown of Usage: DN / VO / Group / Role • Work done by Pablo Mayo, Javier Lopez at CESGA
Issues / Status • T1s consider whether they should report their non-Grid Usage • Should we attempt to distinguish grid from non-grid usage? • Not made any clear recommendations to sites for this if they choose to publish • OSG • Deployment of Gratia accounting system across the CMS and ATLAS tier 1 and 2 centres. • Centralised collection of CPU usage records. • Aggregation to deliver Monthly summaries. • Started publishing March 2007 using MySQL client. • NorduGrid • Deployment of SGAS • Have published job records in the early part of 2006 • Comparison between CERN and APEL • Found a large discrepancies in usage numbers which was traced to a bug in multiple CE support in APEL. • Implemented a bug fix. • Waiting for Feedback • Storing job level information centrally may not scale forever. • Push back to sites and only store summaries centrally • DGAS already does this. • Does Pilot jobs and GLEXEC break user-level accounting? • Do we have an answer to this question? • Account for the VO…
INFN Status • Status at end of Nov 2006. • Data from DGAS HLR has been inserted into the GOC accounting database using DGAS2APEL • Transformation from their HLR schema to APEL schema • Use APEL publisher • Tests performed by Rosario Piro (INFN) • ~ 2,000 Test records published for INFN-TORINO-DEV • To be deployed at Torino site first, then other Tier-2s • However, we are not sure why they have not deployed DGAS2APEL across their sites….
Storage Accounting • Display of Storage Information published by GIPs into the information system. • Sensor queries top-level BDII -> MySQL • Capture information every two hours • See talk by Greig Cowan • Process Data • Used and Allocated storage Disk and Tape at the VO level • Visualisation • Limitations: • Only a high level view • No breakdown of these numbers at a deeper level • UserDN, VOMS Group and Role • File Transactions (Uploads, downloads) • Protocols
http://goc02.grid-support.ac.uk/accountingDisplay/view.php • Select Resources via a Tree • Select time interval (last year, last month, last week, last day) • Select VOs, SEArchitecture
Issues • Enhance the LHC view of storage resources • Tier-1 and Tier-2 tree • Extend breakdown per site to the RRD graphs • To show AverageUsage, AverageAllocated per Week or Month per VO per LHC Tier1. • Example: If VO used 50TB tape in the first week of Jan, and then deleted the data… • How do we get data from OSG? • Bring the Storage and CPU reporting together into a single Portal. • Still only a prototype so sometimes doesn’t run. • Need to check correctness of data • See Greig Cowan talk • More work to be done on display • Comments welcome • Stability • Sensor dropouts address at WLCG • Migrate service to its own machine • What about gLite? • See Greig Talk • When will Glue Schema change? • What more can we learn at the BDII level?
RUS • RUS is a specification for a service used to publish and query resource information. • Can be useful for LCG because • Many partners have their own implementations for collecting resource information. • Cannot control how the partners implement their tools, but you can specify – through project specification – how data should be exchanged between them. • Collect usage from each partner, put the data together and give a global view of resource usage
LCG-RUS • User/VO level Accounting • HTTPS based authentication; • Every authenticated user automatically takes the role of Grid user; • The user who claimed to be the role of “VO manager” has to be authorised by query access control file; • Get aggregated usage information via “extraction” service interface; • On-demand job usage tracing through “tracing” service interface on deployed usage storages at sites; Deployed usage storages
XML Job UR XML Job UR Site Site OGSA-DAI RUS architecture These properties are refreshed each time new data inserted in order to ensure efficient query RUS Response Request RUS Implementation XACML Policy authorization OGSA-DAI Request Wrapper OGSA-DAI Response OGSA-DAI Request OGSA-DAI service Server-side Aggregation Extensions Extensions for RUS Implementations as usage Data Sources Properties for User/vo/resource … accounts RUS plolicy Authorizer Other RUS Implementations
Summary • User level accounting is deployed and many sites are publishing. T1s are encouraged to deploy. • This is not sufficient for the role/group based accounting that VOs have requested. • New release of APEL in certification; together with the accounting mapping log file, FQAN accounting is possible. • Storage Accounting Visualisation tools displaying Used and Allocated disk per VO; LHC view is needed. • RUS Prototype is based on the concept of an Aggregated Usage Record. RUS Working group aiming for implementations based on the OGSA-DAI framework.