1 / 47

Next LCG-2 middleware release

Next LCG-2 middleware release. Zdenek Sekera (for the LCG GD-CT section) GDA 7 Jun 2004. Outline. Grid Deployment group: Certification and Testing section What are we doing? Who are the members? How are we organized? LCG support, who is helping us? What is the Certification process?

Download Presentation

Next LCG-2 middleware release

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Next LCG-2 middleware release Zdenek Sekera (for the LCG GD-CT section) GDA 7 Jun 2004 GDA Jun/6, 2004/ZS 1

  2. Outline • Grid Deployment group: Certification and Testing section • What are we doing? • Who are the members? • How are we organized? • LCG support, who is helping us? • What is the Certification process? • What is purpose of that process? • What is going to be in the next LCG-2 middleware release? • What after RH7.3? GDA Jun/6, 2004/ZS 2

  3. Certification & Testing: what are we doing? • Integrate middleware software from several different sources into a homogenous package • Provide the production quality software • Verify that: • it can actually be installed following installation instructions provided by us • it can be configured to create a proper environment allowing a site to connect to the world-wide LCG grid • it is fully functional as a production system • Our tools: • “big” certification testbed (~60 machines) • “small” certification testbed (~10 machines) • quite extensive set of tests GDA Jun/6, 2004/ZS 3

  4. LCG Certification Goal • Provide reliable software releases of the LCG software for production use • we want to make sure that when YOU download a LCG software from LCG deployment Web site, you have a guarantee • it has been certified • it installs when installed using installation instructions supplied by LCG • it will work as specified in various user and system documentation supplied by LCG • If it does NOT, we want hear from you and we will correct it GDA Jun/6, 2004/ZS 4

  5. Certification & Testing: section members • Piera BETTINI: GridICE, R-GMA, integration • Jean-Philippe BAUD: DataMgt, GFAL, dCache • Frederique CHOLLET: Testing, test suites development • Gilbert GROSDIDIER: Testing, test suites development • Mila KATZAROVA: WEB redesign • Maarten LITMAATH: VDT, dCache, general debugging • Carlos OSUNA: CVS, Autobuild, Porting • Louis PONCET: CVS, Web, sysadmin, HW, porting • Marco SERRA: CTB architect, integration,debugging • David SMITH: Workload Mgt, debugging • Di QING: Integration, debugging • Zdenek SEKERA: Management and the rest • Plus temporary visitors (1-3 months): E.Slabospitskaja, A.Kirianov, D.Olejnik, M.Sapunov, G.-T. Chiang, H.-L. Shih, M.-H Tsai, others GDA Jun/6, 2004/ZS 5

  6. LCG support • Workload Management: primary contact: Massimo Sgaravatto (EDG WP1) • Data Management: primary contact: James Casey (now member of GD, formerly EDG WP2) • dCache: primary contact: dCache support mailing list • R-GMA: primary contact: WP3 mailing list, S.Fisher, S.Traylen GDA Jun/6, 2004/ZS 6

  7. Certification, Testing and Release Cycle Certification testbed Deployment EGEE fix problems new releases Run Certification Matrix Integrate yes errors? yes errors? no LCG C&T section add features fix problems transmit problems RELEASE PRE-DEPLOYMENT Run C&T test suites site test suites no GENERAL RELEASE Basic Functionality Tests errors? certified release tagged yes no fix problems yes errors? no VDT fix problems new releases Release Candidate tagged EXPERIMENTS INTEGRATION TESTBED fix problems candidate not acceptable deployment feedback GDA Jun/6, 2004/ZS 7

  8. What is “production quality”? It is all of the following in no particular order: • availability 24 x 7 • performance • stability, robustness • user friendliness • maintainability • user support GDA Jun/6, 2004/ZS 8

  9. LCG-2 certification • basic grid functionality • connectivity • grid services • security • resource brokering • data management (replication, catalog) • configurability • error recovery • real world applications • site verification suite GDA Jun/6, 2004/ZS 9

  10. LCG-2 May/31 release • Consolidation/maintenance activities • New VDT 1.1.14 (Globus 2.4.3) • Workload Management maintenance • Data Management maintenance and features • GridICE monitoring improvements • New features • Data Management • lcg-utils - tools requested by experiments • GFAL integration • “long names” Castor client • Possibly • dCache integration • R-GMA integration • Accounting GDA Jun/6, 2004/ZS 10

  11. VDT 1.1.14 • 1.1.14 == 1.1.13 + a few patches implementing Globus "Advisories" (e.g. OpenSSL security upgrade). • It is based on Globus 2.4.3 with 48 patches applied on top, fixing bugs (memory and file descriptor leaks, race conditions, logic errors) and adding needed functionality (gridmapdir, gatekeeper accounting and logfile rotation). • Almost all our patches (31) have been submitted by VDT to Globus, many have already been incorporated into Globus 2.4.3, a few are on the to-do list for future releases. • It is compatible with VDT 1.1.8-14.edg4 currently used in the production system; the only problem is with round-robin IP address load-balancing (e.g. castorgrid), but there is an easy work-around. • It has been running on the CTB since a month without any problems. GDA Jun/6, 2004/ZS 11

  12. Workload Management maintenance (1/3) • In two steps: • move from lcg2-1-20 to lcg2-1-21-1 (only EDG bug fixes) • move from lcg2-1-21-1 to lcg2-1-25-1 (only LCG bug fixes) • gradual testing on the “small” testbed first • so when we installed it on the “big” CTB we knew the upgrade will be painless • lcg2-1-25-1 is the first version that was fully built on the LCG CVS server GDA Jun/6, 2004/ZS 12

  13. Workload Management maintenance (2/3) 2004-05-06: patch 150 - WMS lcg2_1_21-1 WMS changes with respect to lcg2_1_20: Fix EDG bugzilla bugs: 1992 - UI ignores TCP port range variable for interactive jobs 1997 - edg-job-list-match "error" messages 2357 - Job refusal at NS is reported incorrectly 2440 - edg-job-submit always produces "edglog.log" file in cwd 2469 - Brokerinfo file only lists a file once 2487 - Error occurred during mkdir for reduced part 2493 - OutputSE not working? 2540 - duplicated entries in ACL list 2566 - SocketAgent::close check for wrong no_error return code by close method GDA Jun/6, 2004/ZS 13

  14. Workload Management maintenance (3/3) 2004-05-18: patch 164 - WMS lcg2_1_25-1 WMS changes with respect to lcg2_1_21_1: Fix LCG savannah bugs: 2682 - workload manager ranking queries 2701 - WP1 & GlueCEUniqueID, GlueClusterUniqueID and GlueSubClusterUniqueID 2715 - edg-wl-lm init.d script and lockfile 2792 - edg-wl-renewd cannot handle change of MyProxy host 2909 - Error in edg-job-get-chkpt 2991 - WM crash possible when specifying OutputData 3258 - edg-wl-ns start takes a long time due to unneeded chown -R 3286 - Timezone for --from & --to options of edg-job-status 3372 - Include WMS job id in the job's globus RLS GDA Jun/6, 2004/ZS 14

  15. Data Management maintenance (1/3) The main focus of this release are: • Upgrade of GSoap runtime to 2.3 for all C++ clients (needed by ATLAS) • Addition of extra methods into catalogs for bulk operations (requested by CMS/POOL) • Refactor of info system interaction and printInfo command in Replica Manager (internal request to rewrite a buggy component that caused many error reports) • Integration of EDG-SE StorageResource for AFS interaction at RAL GDA Jun/6, 2004/ZS 15

  16. Data Management maintenance (2/3) So we have ended up with the following new versions • edg-replica-manager v1.7.2 • edg-local-replica-catalog v2.2.7 • edg-replica-metadata-catalog v2.2.7 • LRC/RMC C++ clients v2.3.0 GDA Jun/6, 2004/ZS 16

  17. Data Management maintenance (3/3) Bugs Fixed 2858 - RM misbehavior if -d option not used (and default SE not available) 2875 - edg-rm pi with -f option prints bad service endpoints. 2887 - edg-rm does not accept port number in SRM SURL 2890 - edg-rm requires VO directory to be absolute path 2947 - If unknown SE turl protocol is in MDS, edg-rm malfunctions 2996 - POOL (RLS) : Allow guid/pfname/lfname as valid query fields 2998 - POOL (RLS) : Array based getMappings() methods 2999 - POOL (RLS) : setAttributes bulk method 3014 - edg-rm and directory creation against edg-se 3428 - edg-rm cr NPEs if a bad URI is given. 3265 - WP2 C+ clients should use gSoap 2.3 3282 - edg-rm cannot handle certain LFNs that should be accepted. 3296 - edg-rm sets SRM FileStatus to Active, not Running 3300 - edg-rm pi displays httpg URIs as https GDA Jun/6, 2004/ZS 17

  18. GridICE monitoring improvements • new version of edt-sensor ( edt_sensor-1.4.17-0 ) was integrated into LCG-2 and installed on all CTB clients GDA Jun/6, 2004/ZS 18

  19. GFAL - Grid File Access Library • GFAL version 1.3.7 • better error handling in MDS interface (avoid core dump when info in MDS is missing or incorrect) • several new routines in LRC and RMC interface to support the new lcg_util tools. • the interface to the ADS (Rutherford) has been developed but may not be part of the release yet due to insufficient testing GDA Jun/6, 2004/ZS 19

  20. LCG utilities – tools requested by experiments lcg_util version 1.0.6 We now provide the following 11 methods (C API and CLI): • lcg-aa: add Alias in RMC • lcg-cp: copy file (Atlas) • lcg-cr: copy and register file with optionally specified GUID (Atlas) • lcg-del: delete file on a given SE or all replicas • lcg-gt: get TURL (Atlas) • lcg-lg: get the GUID for a given LFN or SURL • lcg-lr: list all replicas for a file having a specified GUID • lcg-ra: remove Alias in RMC • lcg-rep: replicate files between 2 Storage Elements • lcg-rf: register file with optionally specified GUID (Alice) • lcg-uf: unregister file (Alice) GDA Jun/6, 2004/ZS 20

  21. Castor client supporting “long names” • CASTOR-client-1.7.1.4-1.longname was installed on the CTB and will be released with May LCG-2 upgrade. • Potential problem exists: It will be necessary (and prudent) for the future to find a way of synchronizing Castor server/client releases with CERN. GDA Jun/6, 2004/ZS 21

  22. R-GMA integration (1/3) R-GMA is required for • accounting • for specific monitoring by some (e.g. CMS) experiments The installation could be done by one the two ways: A. R-GMA people do everything: • packaging, testing, distribution We will not get involved at all. B. R-GMA people will do: • packaging, testing, installation and configuration instructions • provide some simple tests and instructions on how to use them to verify installation • installing a R-GMA registry for us so we don't use the RAL production one for testing We will do: • certification for LCG-2, using supplied install & config instructions • testing on our C&T testbed • include it in LCG-2 distribution, as RPM's to be downloaded by sites, installation & config instructions would be yours GDA Jun/6, 2004/ZS 22

  23. R-GMA integration (2/3) R-GMA group has chosen the option B, which was also our preferred solution In this case the R-GMA packaging has to conform to LCG standards: • the RPM's must be relocatable • they should not use any pre- or post- installation scripts • if an environment that is not LCG needs to be included such as different versions of Java, Tomcat, MySQL etc), it has to be included in such a way it doesn't interfere with the deployed LCG-2. • Installed software must be tested on a real-life LCG-2 (or the C&T testbed) before it can be released. • Installation & configuration must be batch-like, via a script, no interactive updating of parameters. • It is preferable to have one configuration file as a template which may need manual update on each site and the installation & configuration script(s) take all the information from that one file. • The configuration has to consist of two parts: • the proper R-GMA config • the "system" config (setting up various services that should start on boot etc...). Two clearly separated scripts are required. GDA Jun/6, 2004/ZS 23

  24. R-GMA integration (3/3) Current status: • we have been working for a long time with R-GMA developers • list of services that must be published to enable job monitoring: still waiting • We have provided a category called "RGMA" for bug reporting in savannah • new bugs opened, we haven’t finished checking new rpm’s yet: • 3645 - /tmp is not the best place to put logs • 3647 – rgma default log level is debug • 3648 – confusing configuration file • 3655 – edg-rgma-servlets overwrite configuration file • We have provided a category called "RGMA" for bug reporting in savannah • Currently unclear if it can be included in the release, no serious testing yet GDA Jun/6, 2004/ZS 24

  25. Accounting integration • Three weeks ago, we installed one rpm which should do the work; it had bugs. • Some patches were provided since then by one of the R-GMA developers, not by the accounting group. They had to be installed by hand. • We received no new rpm’s since. • We do not have any news from accounting people. • We could not test it, obviously. • Consequently the accounting package could not be part of the release. GDA Jun/6, 2004/ZS 25

  26. dCache integration • dCache includes SRM 1.1 interface and diskpool manager • It is necessary for having a managed disk space • LCG has been working with FNAL/DESY developers to integrate their software into the LCG-2 for about 3 months now • current dCache status: • old version with patches has survived a few stress tests • but each dCache server sooner or later gets into a bad state, requiring a reboot • latest version not (yet) usable because SRM does not advertise "gsidcap" protocol • 38 open problems, 13 are major • New dCache rpm’s received over the last weekend GDA Jun/6, 2004/ZS 26

  27. dCache integration – problem summary Status of dCache problems (2004/05/18) | Major (*) | Normal | Total ---------------------+-------------+-------------+------------ Fixed (#) | 3 | 3 | 6 ---------------------+-------------+-------------+------------ In progress (@) | 3 | 1 | 4 No news yet | 5 | 21 | 26 New | 5 | 3 | 8 ---------------------+-------------+-------------+------------ Total open | 13 | 25 | 38 GDA Jun/6, 2004/ZS 27

  28. dCache integration – problem list (1/6) • *@ RPMs: should be cleaned and automatically released. We should not get TAR files. See also points 12, 13, 14, 16, 28, 35. Packaging almost OK now (2004/17/05) • # slow response time on SRM and GridFTP to be investigated (18/2/2004). Fix by David Smith has been incorporated in latest RPMs. • path too short (24/2/2004), supposed to be fixed, to be tested, important for GFAL filesystem • perror in dcap_url.c (24/2/2004) • gfalfs/fuse/dCache integration (24/2/2004) • O_TRUNC or overwrite of existing file (25/2/2004) • dcau (25/2/2004) • pinning to be tested (24/2/2004) GDA Jun/6, 2004/ZS 28

  29. dCache integration – problem list (2/6) • grid-map-file conversion (10/03/2004) --> The standard grid-map-file should be used, and any other parameters (e.g. VO root dir) should be put into a separate config. file • error message when missing VO directory (10/03/2004) • * hang when writing a file and the disk is full (12/03/2004) • *# version number (including libdcap) (16/03/2004). Fixed in latest RPMs (2004/05/17) • *@ templates should be provided, configuration files should not be overwritten (17/03/2004). Almost OK in latest RPMs (2004/05/17) • relocatable RPMs (17/03/2004) • file naming (18/03/2004) • *# some files still accessed thru their /usr/d-cache name (symbolic links currently needed). Fixed in latest RPMs (2004/05/17) GDA Jun/6, 2004/ZS 29

  30. dCache integration – problem list (3/6) • @ host proxy and srm-storage-element-info (29/03/2004) Latest code should fix it, to be tested • srmcp and X509_USER_PROXY (currently needs complicated command line options) • # pnfs config. scripts need non-interactive mode (31/03/2004). Fixed in latest RPMs (2004/05/17) • IOTunnel library for kdcap + port number for kdcap • core dump when port not specified (31/03/2004). Now getting obscure error message: "Failed to create a control line“ • # /opt/grid/gsint/gsint (01/04/2004). If it is not used, it should be removed from the RPM. If it is used, it should be moved to /opt/d-cache or /opt/gsint or ... Fixed in latest RPMs (2004/05/17) • manual garbage collection (02/04/2004) • * missing entry points: dc_chmod, dc_mkdir, dc_rename, dc_rmdir and dc_unlink (02/04/2004) GDA Jun/6, 2004/ZS 30

  31. dCache integration – problem list (4/6) • * non working dc_opendir • * dCache SRM returns a TURL even if no space available (02/04/2004) • dCache totalSpace vs. usedSpace vs. availableSpace (05/04/2004) Feature? We have a work-around (2004/05/17). • *# better "srm" script (06/04/2004) please use Maarten's one. Fixed in latest RPMs (2004/05/17) • pnfs mountd incompatible with normal mountd • getFileMetaData srm://lxshare0282.cern.ch:8443/pnfs/cern.ch/data/cms gives java exception while the directory exists • getFileMetaData does not return ownership • Admin Guide + Installation Guide GDA Jun/6, 2004/ZS 31

  32. dCache integration – problem list (5/6) • dcap User Guide (only a few APIs are currently documented, protocols and port numbers should also be documented) • We propose that a hierarchy is implemented to set port numbers: user specified, environment variable, /etc/services, default set at compile time • *@ an object should be defined in one and only one RPM. This is currently not the case: dCache and dCache-pool RPM provide same objects. Almost OK in latest RPMs (2004/05/17) • * need of reboot after parameter change or sw change. The recipe of restarting all java services does not work. GDA Jun/6, 2004/ZS 32

  33. dCache integration – problem list (6/6) New since previous list: • * many (> ~15) parallel clients causes SRM to hang • * dcache-lcg-v1.2.2 SRM does not publish gsidcap protocol (2004/05/17). This makes that version unusable. • * SRM put error reporting if the file already exists (2004/04/14) • if unsupported protocol given for get/put, request state is failed, but file state remains pending (2004/04/23) • * libgsiTunnel.so needs globus_module_activate/deactivate gssapi module to work around a Globus bug (patch available) (2004/05/18) • * dcache stop script can leave /pnfs mounted, possibly causing an RPM upgrade or a shutdown to hang • * RPMs should come with release notes saying which bugs were fixed, what new functionality exists etc. • logfiles must be cleaned up: time stamps and request parameters must be added, harmless errors must be removed GDA Jun/6, 2004/ZS 33

  34. dCache integration - conclusion • Considerable amount of time (~ 3 months) has already been spent on dCache integration into the LCG-2 • Significant number of unresolved problems remain, problems remain unresolved for sometimes many weeks • The support from dCache developers exists but it is very irregular • Due to the many existing problems the dCache software could not yet be thoroughly tested on the CTB • Consequently it cannot be deployed yet. • Probably the best way to finish the dCache integration would be to bring relevant dCache developers to CERN for some period of time GDA Jun/6, 2004/ZS 34

  35. DWS: Developers Workstation Syndrome? • Developer: It works in my environment so it must work everywhere. • Reality: It works in my environment so there is a non-zero probability it will work elsewhere, too. GDA Jun/6, 2004/ZS 35

  36. LCG deployment Web page redesign (1/3) Current status: • Official Web page • Template - done • Documentation management implementation is ready • Internal Web page • Upload files - done • News - under construction • Sections' web pages template - under construction GDA Jun/6, 2004/ZS 36

  37. LCG deployment Web page redesign (2/3) Issues: • Upload file problems: • How are we going to upload html files? They normally contain more then one file and the dll offered by "Web Services" at CERN is able to update only one file at a certain moment. Testing possible solutions. • Permission problem - setting up permissions to the Documentation directory only for our group (for upload) seems impossible ??? Need CERN help. GDA Jun/6, 2004/ZS 37

  38. LCG deployment Web page redesign (3/3) Schedule: • Documentation management ready till the end of next week (24.05 - 28.05). That will include news management too. (add news, delete, update) • Sections web pages template - (31.05 - 04.06) • Release of the static information on the official web site (07.06 - 11.06) • Map with the participating institutes (~18/06) • First internal release: middle of June for internal feedback. • Public release: ~June end. GDA Jun/6, 2004/ZS 38

  39. LCG-2 – what’s next ? • We think the LCG-2 is now fairly stable, we have no plans for major middleware upgrades, only the obvious bugfix maintenance. • We wish to add new services, hopefully as add-on upgrades: • R-GMA (including accounting) • dCache • VOMS – generate gridmapfile • What else do YOU need? Tell us! GDA Jun/6, 2004/ZS 39

  40. What after RH 7.3? • In the absence (hopefully temporary) of consensus, we have chosen to port LCG-2 to: • RH Enterprise Server 3.0 IA32, the CERN variant • This should be the original RH, recompiled by CERN (license issues), consequently with “CERN” logo • It should be freely downloadable when certified by CERN • Already well integrated in autobuild • We have started to install a small testbed which we will later connect to bi C&T for interoperability testing • we have a WN working (manual installation) • RH Enterprise Server 3.0 IA64 • Needed to support OpenLab • External OpenLab partners involved (HP, IBM) • Most of the work has already been done manually by OpenLab people, work is progressing to integrate it into the autobuild system GDA Jun/6, 2004/ZS 40

  41. Rh73 i386 CVS server cel3 i386 Server HTTP Cel3 ia64 Building system Building Publishing of RPMs and reports of the build Cvs checkout List of modules required GDA Jun/6, 2004/ZS 41

  42. After RH 7.3 • The porting to other than RH 7.3 has become much higher priority • We are working on some ports ourselves • We have already started collaboration with Irish people (they have some experience with non-Linux systems such as IRIX) • We will initialize another collaborations with QMUL and LeSC who offered their resources • We provide anonymous access to our CVS server and will advice on how to setup the build process • We will then introduce all changes into the CVS server for all LCG C&T tested architectures • If we do not have a necessary hardware, we will solicit help in providing necessary access to such resources and help in certification GDA Jun/6, 2004/ZS 42

  43. After RH 7.3 Issues to consider for testing: • HW availability (IRIX, Solaris, others) • Interoperability between different O/S We will start with Worker Nodes first, leaving service nodes on IA32 (probably RH 7.3 for now) and adding CE, SE, and others services later GDA Jun/6, 2004/ZS 43

  44. QUESTIONS ?? GDA Jun/6, 2004/ZS 44

  45. CVS/int CVS/dev Certification team devellopers Rpms list Details in next slide Compilation Configuration Rpms list & configuration Rpms Web Rpms Rpms LCFGng install Manual install Distribution process GDA Jun/6, 2004/ZS 45

  46. LCG Certification, Testing and Release Cycle CERTIFICATION TESTING EXPERIMENTS INTEGRATION DEPLOYMENT EGEE fix problems new releases Integrate Experiments software installation Basic Functionality Tests LCG C&T section add features fix problems transmit problems Run Certification Matrix Testing experiments specific features RELEASE PRE-DEPLOYMENT GENERAL RELEASE Run C&T test suites site test suites Certified release tag Release candidate tagged VDT fix problems new releases deployment feedback GDA Jun/6, 2004/ZS 46

  47. EGEE Certification, Testing and Release Cycle JRA1 SA1 CERTIFICATION TESTING EXPTS INTEGR DEPLOY SERVICES Integrate LHC EXPTS Basic Functionality Tests MEDICAL DEVELOPMENT & INTEGRATION UNIT & FUNCTIONAL TESTING Run Certification Matrix OTHER TBD RELEASE PRE-PRODUCTION DEPLOYMENT PREPARATION PRODUCTION Run tests C&T suites Site suites APPS SW Installation Release candidate tag Certified release tag Deployment release tag Production tag Dev Tag GDA Jun/6, 2004/ZS 47

More Related