170 likes | 186 Views
This presentation discusses the organization, infrastructure, and testing activities of the JRA1 Test Team for the gLite R1 services. It covers job management, data management, information system, security, and future plans.
E N D
JRA1 testing status Maite Barroso, CERN On behalf of JRA1 test team 3rd EGEE Conference, Athens, Greece April 18-22, 2005
Outline • JRA1 test team organization and infrastructure • What we test • Test reports: • Job management • Data management • Information system • Security • Plans for the future • Summary and conclusions 3rd EGEE Conference
Organization and infrastructure • JRA1 Testing team: • 3 members plus team leader based at CERN • One system administrator at each external site • Additional (and very active) contributors: S. Burke • Distributed testbed, 4 sites: • CERN: Running all available services except VOMS server • NIKHEF: UI, WN, CE, R-GMA, VOMS server • RAL: R-GMA, IO server being installed • Imperial College (UK): just joined; installing WMS, CE, WN, UI • Each site runs a binary compatible version of Red Hat Enterprise Linux • CERN: SLC3, NIKHEF: CentOS 3.2, RAL: Scientific Linux, IC: RHE3 3rd EGEE Conference
What we test • Deployment testing • Test the installation and configurations methods (deployment modules: installers plus configuration scripts, site config mechanism, apt-cache) provided by the Integration team for all services • Weekly releases from January till gLite R1 (April) • Release candidate from Iteam -> 1 week testing -> release to SA1 • Functional testing • Develop test suites to verify basic functionality of all services and interactions between them. Done in collaboration with ARDA and JRA1 development clusters • Test suites distributed with the release • Will be covered in detail in coming slides • Run test suites after each release, analyze results and produce test reports • Regression testing • Verify bugs in “ready to test” status and develop tests for the most relevant ones 3rd EGEE Conference
gLite R1 services • Job management Services • Workload Management • Computing Element • Logging and Bookkeeping • Data management Services • File and Replica catalog • File Transfer and Placement Services • gLite I/O • Information Services • R-GMA • Service Discovery • Security • Deployment Modules • Distribution available as RPM’s, Binary Tarballs, Source Tarballs and APT cache 3rd EGEE Conference
gLite R1 services: test status • Job management Services • Workload Management • Computing Element • Logging and Bookkeeping • Data management Services • File and Replica catalog • File Transfer and Placement Services • gLite I/O • Information Services • R-GMA • Service Discovery • Security • Deployment Modules • Distribution available as RPM’s, Binary Tarballs, Source Tarballs and APT cache 3rd EGEE Conference
Job mgt services: tests • Test suite: ported LCG-2 job submission certification test suite • Additional tests run each release, being plugged into the test suite: • DAG job • 1000 jobs storm • job storm with input/output sandbox use • StorageIndex interface • Proxy renewal • Tests run both in pull and push mode, and combinations of both • 2 VOs 3rd EGEE Conference
Job mgt services: test results 3rd EGEE Conference
Job mgt services: problems found • Bug summary (WMS, CE, LB, Condor): • 145 bugs submitted (50 by JRA1 test team) • 85 bugs Open, 46 of which “Ready to test” • Problems: • Failure rate ~12% (retrycount = 0), otherwise (default retrycount=3) 100% success; 1 CE, 2 WNs • This percentage is considerably reduced when increasing the number of CEs and WNs per CE (to 1%-5%) • Several reasons being investigated (e.g. race conditions) • Shallow re-submission (i.e. retry of submission, not execution) might help • Matchmaking is being blocked sometimes • Quick Fix provided (QF1.0.12.04.2005 Thu Apr 14 2005) • The very first job of every user fails since the schedd is not there 3rd EGEE Conference
Data Management: Tests • Test suites developed together with ARDA, distributed with the release • Fireman test suite: • Creation of 1000 entries (one by one, and bulk operations of 100 files) • Read entries in a particular directory • Create directories / Remove directories • Test boundary conditions, like long LFNs • Permission tests being added • gLite I/O test suite: • open, close, create, read files, fstat, lseek • Bulk operations (transfer 1000 files of 1KB max in 100 cycles, transfer 5000 files of 1KB max in 100 cycles, transfer 1000 files of 1MB max in 1000 cycles) • Regression tests for Savannah bugs 4414, 4415, 4873, 5101 and 5329 3rd EGEE Conference
Data Management: tests results 3rd EGEE Conference
Data Management: problems found • Bug summary: • 170 bugs submitted (50 by JRA1 test team) • 96 bugs Open, 31 of which “Ready to test” • Problems: gLite I/O • Sensitive to SRM failures • Error codes not always explanatory • Errors while simultaneous downloading/retrieving several files via IOserver (#6043) Fireman • Oracle SQL schema problems (#7785, fix will be included in R1.1) • Entries created in Oracle Fireman catalog (secure) are all owned by the user's group instead of the user's certificate subject (#7928) • Problems with security configuration 3rd EGEE Conference
R-GMA tests • Test plan defined in February by Stephen Burke • Test suite implemented by the JRA1 UK cluster. It is now in CVS. It will be included in R1.1 • Tests include: • Publish a tuple with a simple predicate and check that it can be consumed • Create a secondary (latest) producer and check that it can be consumed • Create a secondary (history) producer and check that it can be consumed • Publish and consume 1000 tuples, measure the response time and check that no information is lost • Set a retention period, and check that the data continue to be available for that period • Verify that there is some information in the Site, Service, ServiceStatus and GlueXXX tables 3rd EGEE Conference
R-GMA: problems found • Bug summary: • 175 bugs submitted (121 by JRA1 test team) • 109 bugs Open, 30 of which “Ready to test” • Problems: • Difficult to deploy • handling of proxies on WNs (fixed) • no opportunity for significant stress, stability or scalability testing • Creation of new tables is not currently supported except with the use of a special program available from the R-GMA web page • rgma fails frequently due to tomcat servlets hanging: JVM/dual processor issue,known Java issue on Linux - the latest Java version (j2sdk1.4.2_08 ) should fix that 3rd EGEE Conference
VOMS tests • Test plan defined in February by JRA3 (Oscar Koeroo) including test cases for voms-admin and VOMS Core Services • Test suite being implemented • Voms-admin: SA1 (Karoly Lorentey), not started • VOMS Core Services: INFN (Valerio Venturi and Vincenzo Ciaschini), 80% finished • Very little testing till now • Problems: • Incompatibility with previous VOMS versions • Due to RFC compliance • Limited deployment options (only single VO) • Due to lack of communication among the involved parties • Fixed in Release 1.1 3rd EGEE Conference
Future Plans • Continue deployment testing • Detailed functional and regression testing of all gLite services. This will have to be done in collaboration with other activities (as with VOMS and R-GMA) due to limited manpower in the JRA1 test team • Additional Job mgt services functionality (DAG, MPI) • File Transfer and Placement Services • Security • New functionality (if any) added from now on • Further development of existing test suites • Common and automated presentation of all test results • Stress, Stability and scalability testing 3rd EGEE Conference
Summary and conclusions • Testing is a task spread through different activities: JRA1, SA1, NA4. Coordination between us is essential to cover as much as possible. • JRA1 test team does: • Deployment testing • Develop functional/regression test suites and run them after each release. (Developing test suites takes time, is slower than ad-hoc testing) • Stress, stability and scalability testing • Very important to get collaboration from experienced people from other cluster/activities to cover all gLite services (R-GMA, good example). If not, we cannot cover all. • Important to have a stable distributed testbed. This requires support and commitment from the sites. 3rd EGEE Conference