130 likes | 138 Views
This article provides a brief overview of accounting in LCG, including integration with OSG, gLite, and APEL. It covers data collection, transportation, high-level aggregation and reporting, and demos of accounting aggregation.
E N D
Accounting in LCG Dave Kant CCLRC, e-Science Centre
APEL in LCG/EGEE 1. Quick Overview 2. The current state of play 3. Integration with OSG 4. Accounting in gLite LCG GDB Rome 2
Overview • Data Collection via Sensors • Transportation via RGMA • High level Aggregation and Reporting via Graphical Front-end High Level Reporting: Tables, Pies, Gantts, Metrics, Trees Aggregation LCG GDB Rome 3
Component View of APEL Sensors (Deployed at site) :- • Process log files; maps DN to Batch Usage; • Builds accounting records: DN, CPU, WCT, SpecInt2000 etc • Accounts for Grid Usage (Jobs) Only • Supports PBS, SunGridEngine, Condor, and LSF • Not REAL-TIME accounting Data Transport:- • Uses RGMA to send data to a central repository • 196 sites publishing, 7.7 Million Job records collected • Could use other transport protocols • Allows sites to control exports of DN information from site Presentation (GOC and Regional Portal) • View, EGEE View, GridPP View, Site View • Reporting based on data aggregation • Metrics (e.g. Time Integrated CPU Usage) • Tables, Pies, Gantt Charts, LCG GDB Rome 4
Demos of Accounting Aggregation Global views of CPU resource consumption. • LHC View • http://goc.grid-support.ac.uk/gridsite/accounting/tree/treeview.php Shows Aggregation for each LHC VO • Requirements driven by RRB • Tier-1 and Countries are the entry points • LHC VO only • All data normalised in units of 1000 . SI2000 . Hour • GridPP View • http://goc.grid-support.ac.uk/gridsite/accounting/tree/gridppview.php Shows Aggregation for an Organisation at Tier1/Tier2 level • EGEE View (New!) • http://www3.egee.cesga.es/gridsite/accounting/CESGA/tree_egee.php Regional Views and detailed site level reporting Active Development by CESGA/RAL Pablo Rey Mayo, Javier Lopez, Dave Kant LCG GDB Rome 5
VOs/LCG/EGEE Requirements • One line summary “How Much is Done, and Who did it”. • High Level Anonymous Reporting How much resource has been provided to each VO Aggregation across: VOs, Countries, Regions, Grids, Organisations Granularity: time frame: Weeks, Quarterly, Annually • Finer Granularity at User Level If 10,000 CPU hours were consumed by Atlas VO, who are the users that submitted the work? Data privacy laws A Grid “DN” is personal information which could be used to target an individual. Who has access to this data and how do you get it? LCG GDB Rome 6
APEL Developments • Extending Batch System Support (Testing Phase) Support for Condor and SGE. Both are being tested: SGE by CESGA and Condor by GridPP. Un-official releases are available on the APEL Home page. http://goc..grid-support.ac.uk/gridsite/accounting/sge.html http://goc.grid-support.ac.uk/gridsite/accounting/condor-prelim.html Gap Publisher (Testing Phase) Provide sites with better tools to identify and to publish missing data into the archiver. The reporting system uses Gantt charts to identify gaps, and enhancements to the publisher module are being tested. • LCG GDB Rome 7
APEL Issues…1 • Normalisation (Under investigation, CESGA/RAL) Recall that in order to account for usage across heterogeneous compute farms, data are scaled to a common reference in LCG Reference Scale = 1K.SI2000 Job records scale factor is SI2000_Published_by_Site / Reference Some sites have a large number of job records where the site SI2000 is zero. Identify sites via the reporting tools and provide recipe to fix. APEL Memory Usage (Important, will become urgent…) Site databases are growing ever larger: APEL requires more memory in order to join records (RAL Tier-1 requires 2GB RAM for full build) Implement a scheme to reduce the number of redundant records used in the Join process: flag rows used in a successful build and delete them as they are no longer needed. DN Accounting ? Should APEL account for local usage as well as grid usage? BNL recently sent data to us that included both Grid and local usage • • LCG GDB Rome 8
APEL Issues…2 • Handling Large Log files (Under Investigation) Condor history and SGE batch logs are very large (> 1 GB ) Large logs are problematic: large amount of memory to read / store records inline. Application run time grows! We don’t want to re-read data that was passed on a previous run (efficiency). Develop an efficient way to parse these logs? Or ask batch log providers to support log rotation? Or provide a recipe to site admins? • Recipe to site admins half-work as events are lost: event data split over multiple lines. RGMA Queries to Central Repository Query response time very slow. Prevents some sites from checking continuous consumers are actually listening for data. Would need to archive data from the central repository to another database in order to speed up such queries. Not an issue for the reporting front-end Does not appear to be something that sites urgently need (requested by IN2P3-CC). • LCG GDB Rome 9
Integration with OpenScienceGrid • A few OSG sites have deployed a minimal LCG front-end to publish accounting data into the APEL database (GOCDB registration + APEL sensors + RGMA MON node) Successful deployment at University of Indiana (PBS and Condor data published) • Due to (subtle) differences in the grid middleware, APELs Core library must be modified to build accounting records in the OSG environment. LCG: DN local batch jobId mappings encoded within three log files: LCG job manager OSG: DN local batch jobId mappings in single log file; globus job manager? • Main Issues Under Consideration Currently there are THREE versions of APEL CORE library, each sharing common batch system plugins: • LCG production release, gLite 3 development, OSG development Refactoring of core library to create a new plugin? LCG/gLite/OSG ? A more sensible approach would be to use a *common* accounting file in BOTH gLite and OSG to provide the grid DN Local Batch JobId mapping Need a common agreement for log rotation: Prefer lognname-YYYYMMDD.gz (static file) to logname-1.gz (not-static) • Very much in the early stages, need some common agreements and some more understanding of OSG middleware before proceeding. LCG GDB Rome 10
Accounting in gLite 3 • In gLite the BLAH daemon (provided by Condor) is used to mitigate jobs between the WMS and the Compute element. Consequently, accounting information needed by APEL is no longer in the gatekeeper logs but found elsewhere e.g. in local user home directory. An accounting mapping file has been proposed by DGAS and implemented by gLite middleware developers to simplify the process of building accounting records. For mapping grid-related information to the local job ID Independent of submission procedure (WMS or not ...) No services or clients required on the WN Format (one line per job, daily log rotation) timestamp=<submission time to LRMS userDN=<user's DN> userFQAN=<user's FQAN> ceID=<CE ID> jobID=<grid job ID> lrmsID=<LRMS job ID> localUser=<uid> • • • • • • Already implemented for BLAH (and CREAM) work in progress for LCG Did not make it into gLite3.0 – no accounting for gLiteCE APEL development to begin in April (D.Kant) Development and Testing expected to take most of April LCG GDB Rome 11
DGAS • DGAS meets some requirements for privacy of user identity user job info only readable by user, site manager and VO manager • DGAS cannot aggregate info across whole Grid • Solution 1 – DGAS sensors also publish anonymous data to central APEL repository, User details available in DGAS HLR for VO • Solution 2 – A higher level repository that HLRs can all publish into. GGF Resource Usage Service – RHUL working on an implementation • BUT DGAS not in gLite3.0 LCG GDB Rome 12
Summary • We have a working accounting system • but work is still required to keep it working to meet (conflicting?) outstanding requirements for • Privacy • User information LCG GDB Rome 13