1 / 18

Status of PDC’07 and user analysis issues (from admin point of view)

Status of PDC’07 and user analysis issues (from admin point of view). L. Betev August 28, 2007. The ALICE Grid. Powered by AliEn Interfaces to gLite, ARC and (future) OSG WMS As of today – 65 entry points (62 sites), 4 continents Africa (1), Asia (4), Europe (53), North America (4)

kalila
Download Presentation

Status of PDC’07 and user analysis issues (from admin point of view)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Status of PDC’07 and user analysis issues (from admin point of view) L. Betev August 28, 2007

  2. The ALICE Grid • Powered by AliEn • Interfaces to gLite, ARC and (future) OSG WMS • As of today – 65 entry points (62 sites), 4 continents • Africa (1), Asia (4), Europe (53), North America (4) • 21 countries, 1 consortium (NDGF) • 6 Tier-1 (MSS capacity) sites, 58 Tier-2 • All together – ~5000 CPUs (pledged), 1.5PB disk, 1.5PB Tape • Contribution range: from 4 to 1200 CPUs • PIII, PIV, Itanium, Xeon, AMD • All Linux: Mandriva, Suse to Ubuntu, mostly SL3/4, no Gentoo + all possible kernel+gcc combinations GSI Darmstadt

  3. The ALICE Grid (2) 62 active sites GSI Darmstadt

  4. Operation • ALICE offline is: • Hosting the central AliEn services: Grid catalogue, task queue, job handling, authentication, API services, user registration • Organising (guided by the requirements of the PWGs) and running the production • AliEn site services updates and operation (together with the regional experts) • User analysis support • Sites are: • Hosting the VO-boxes (interface to site services) • Operating the local services (gLite and site fabric) • Providing CPU and storage • This model • Has been in operation with minor modification since several years and is working quite well for production • Requires minor modification to support a large user community - mostly in the area of user support GSI Darmstadt

  5. History of PDCs • Exercise of the ALICE production model • Data production / storage/ replication • Validation of AliRoot • Validation of Grid software and operation • User analysis (not yet integral part of the PDC) • Since April 2006 the PDC is running continuously GSI Darmstadt

  6. PDC job history Average of 1500 CPUs running continuously since April 2006 GSI Darmstadt

  7. PDC job history - zoom on last 2 months 2900 jobs in average, saturating all available resources GSI Darmstadt

  8. Site performance Typical operation: - Up to 10% of the sites not in production at any given moment - Half of these are undergoing scheduled upgrades - The other half - Grid or local services failures - T1s are in general better in stability than T2 - Some T2s are much better than any of the T1s Achieving better stability of the services at the computing centres is a top priority of all parties involved The central services availability is better than 95% GSI Darmstadt

  9. Production status Total 85,837,100 events as of 26/082007 24:00 hours GSI Darmstadt

  10. Sites contributions Standard distribution: 50/50 T1/T2 contribution GSI Darmstadt

  11. Relative contribution - Germany Standard distribution: 50/50 T1/T2 contribution 15% of total GSI Darmstadt

  12. Efficiencies/debugging • Workload management for production • Under control and is near production quality • We keep saying that, but this time we really mean it • Improvements (speed, stability) are expected with the new gLite version 3.1, still untested • Support and debugging • The overall situation is much less fragile now • Substantial improvements in AliEn and monitoring are making the work of the experts supporting the operations easier • gLite services at the sites are (mostly) well understood and supported • User support is still very much in need of improvement • The issues with user analysis are often unique and sometimes lead to development of new functionality • But at least the response time (if not the solution) is quick GSI Darmstadt

  13. General • The Grid is getting better • Running conditions are improving • The Grid middleware in general and AliEn in particular are quite stable • After a long and hard work by the developers • Even user analysis, much derided in the past few months is finally not a painful exercise • The operation is more streamlined now • Better understanding of running conditions and problems by the experts • We continue with the usual PDC’07 programme • Simulation/reconstruction of MC event • Validation of new middleware components • User analysis • And in addition the Full Dress Rehearsal (FDR) GSI Darmstadt

  14. User analysis issues - short list • Major issues - February/June 2007 • Jobs do not start/lost/output missing • Input data collections are difficult to handle and impossible to process at once • Priorities are not set - single user can ‘grab’ all resources • Unclear definition of storage elements (Disk/MSS) GSI Darmstadt

  15. User analysis issues - short list (2) • What has been done • Failover CE for user queue (Grid partition ‘Analysis’) • Since 20 June - 100% availability • Pre staging of data (available on spinning media) and creation of xml collections centrally • The availability of the pre-staged files is checked periodically • More robust central services (see previous slides) • Use of dedicated SE for user files - this will be transparently increased to multile SEs with quotas • Priority mechanism (not the final version) put in place • We haven’t had reports of unfair use GSI Darmstadt

  16. Job completion chart Standard distribution: 50/50 T1/T2 contribution User jobs GSI Darmstadt

  17. User analysis issues - current • Storage availability and consistency • Still very few working SEs - common storage solutions are not yet ‘production’ quality • The effort is now concentrated on CASTOR2 with xrootd • Sites (GSI f.e.) are installing large xrootd pools - these are tested and working • With more SEs, holding replicas of the data, the Grid will naturally become more stable • Availability of specific data sets • Dependent on the storage capacity in operation • Currently TPC RAW data is being replicated to GSI • With CASTOR2+xrootd working, the number of events on spinning media will increase 20x GSI Darmstadt

  18. User analysis issues - current (2) • User applications • Compatibility of user installation of ROOT, gcc version, OS - locally complied application will not necessarily run on the Grid • All sites are installed with ‘lowest common denominator’ middleware and packages - currnetly SLC3, gcc v.3.2, while most users have gcc v.3.4 • There is no easy way out, until the centres migrate to SL(C)4 and gcc v.3.4 • Meanwhile, the experts are looking into repackaging the Grid apps (most notably gshell) • Currently the only solution is to always compile ROOT and user application with the same compiler, before submitting to Grid GSI Darmstadt

More Related