180 likes | 267 Views
Status of PDC’07 and user analysis issues (from admin point of view). L. Betev August 28, 2007. The ALICE Grid. Powered by AliEn Interfaces to gLite, ARC and (future) OSG WMS As of today – 65 entry points (62 sites), 4 continents Africa (1), Asia (4), Europe (53), North America (4)
E N D
Status of PDC’07 and user analysis issues (from admin point of view) L. Betev August 28, 2007
The ALICE Grid • Powered by AliEn • Interfaces to gLite, ARC and (future) OSG WMS • As of today – 65 entry points (62 sites), 4 continents • Africa (1), Asia (4), Europe (53), North America (4) • 21 countries, 1 consortium (NDGF) • 6 Tier-1 (MSS capacity) sites, 58 Tier-2 • All together – ~5000 CPUs (pledged), 1.5PB disk, 1.5PB Tape • Contribution range: from 4 to 1200 CPUs • PIII, PIV, Itanium, Xeon, AMD • All Linux: Mandriva, Suse to Ubuntu, mostly SL3/4, no Gentoo + all possible kernel+gcc combinations GSI Darmstadt
The ALICE Grid (2) 62 active sites GSI Darmstadt
Operation • ALICE offline is: • Hosting the central AliEn services: Grid catalogue, task queue, job handling, authentication, API services, user registration • Organising (guided by the requirements of the PWGs) and running the production • AliEn site services updates and operation (together with the regional experts) • User analysis support • Sites are: • Hosting the VO-boxes (interface to site services) • Operating the local services (gLite and site fabric) • Providing CPU and storage • This model • Has been in operation with minor modification since several years and is working quite well for production • Requires minor modification to support a large user community - mostly in the area of user support GSI Darmstadt
History of PDCs • Exercise of the ALICE production model • Data production / storage/ replication • Validation of AliRoot • Validation of Grid software and operation • User analysis (not yet integral part of the PDC) • Since April 2006 the PDC is running continuously GSI Darmstadt
PDC job history Average of 1500 CPUs running continuously since April 2006 GSI Darmstadt
PDC job history - zoom on last 2 months 2900 jobs in average, saturating all available resources GSI Darmstadt
Site performance Typical operation: - Up to 10% of the sites not in production at any given moment - Half of these are undergoing scheduled upgrades - The other half - Grid or local services failures - T1s are in general better in stability than T2 - Some T2s are much better than any of the T1s Achieving better stability of the services at the computing centres is a top priority of all parties involved The central services availability is better than 95% GSI Darmstadt
Production status Total 85,837,100 events as of 26/082007 24:00 hours GSI Darmstadt
Sites contributions Standard distribution: 50/50 T1/T2 contribution GSI Darmstadt
Relative contribution - Germany Standard distribution: 50/50 T1/T2 contribution 15% of total GSI Darmstadt
Efficiencies/debugging • Workload management for production • Under control and is near production quality • We keep saying that, but this time we really mean it • Improvements (speed, stability) are expected with the new gLite version 3.1, still untested • Support and debugging • The overall situation is much less fragile now • Substantial improvements in AliEn and monitoring are making the work of the experts supporting the operations easier • gLite services at the sites are (mostly) well understood and supported • User support is still very much in need of improvement • The issues with user analysis are often unique and sometimes lead to development of new functionality • But at least the response time (if not the solution) is quick GSI Darmstadt
General • The Grid is getting better • Running conditions are improving • The Grid middleware in general and AliEn in particular are quite stable • After a long and hard work by the developers • Even user analysis, much derided in the past few months is finally not a painful exercise • The operation is more streamlined now • Better understanding of running conditions and problems by the experts • We continue with the usual PDC’07 programme • Simulation/reconstruction of MC event • Validation of new middleware components • User analysis • And in addition the Full Dress Rehearsal (FDR) GSI Darmstadt
User analysis issues - short list • Major issues - February/June 2007 • Jobs do not start/lost/output missing • Input data collections are difficult to handle and impossible to process at once • Priorities are not set - single user can ‘grab’ all resources • Unclear definition of storage elements (Disk/MSS) GSI Darmstadt
User analysis issues - short list (2) • What has been done • Failover CE for user queue (Grid partition ‘Analysis’) • Since 20 June - 100% availability • Pre staging of data (available on spinning media) and creation of xml collections centrally • The availability of the pre-staged files is checked periodically • More robust central services (see previous slides) • Use of dedicated SE for user files - this will be transparently increased to multile SEs with quotas • Priority mechanism (not the final version) put in place • We haven’t had reports of unfair use GSI Darmstadt
Job completion chart Standard distribution: 50/50 T1/T2 contribution User jobs GSI Darmstadt
User analysis issues - current • Storage availability and consistency • Still very few working SEs - common storage solutions are not yet ‘production’ quality • The effort is now concentrated on CASTOR2 with xrootd • Sites (GSI f.e.) are installing large xrootd pools - these are tested and working • With more SEs, holding replicas of the data, the Grid will naturally become more stable • Availability of specific data sets • Dependent on the storage capacity in operation • Currently TPC RAW data is being replicated to GSI • With CASTOR2+xrootd working, the number of events on spinning media will increase 20x GSI Darmstadt
User analysis issues - current (2) • User applications • Compatibility of user installation of ROOT, gcc version, OS - locally complied application will not necessarily run on the Grid • All sites are installed with ‘lowest common denominator’ middleware and packages - currnetly SLC3, gcc v.3.2, while most users have gcc v.3.4 • There is no easy way out, until the centres migrate to SL(C)4 and gcc v.3.4 • Meanwhile, the experts are looking into repackaging the Grid apps (most notably gshell) • Currently the only solution is to always compile ROOT and user application with the same compiler, before submitting to Grid GSI Darmstadt