140 likes | 300 Views
The DPM Testing Framework. Gilbert Grosdidier LAL-Orsay/IN2P3/CNRS & LCG. Required Resources. Build machine (managed by LP) Used for both MySQL and Oracle flavors Requires special shareable libs, often temporary ones lcg-build-sl3 Two main test clusters (mostly managed by GG)
E N D
The DPM Testing Framework Gilbert Grosdidier LAL-Orsay/IN2P3/CNRS & LCG DPM Testing Suites @ CERN - AllHands - GG
Required Resources • Build machine (managed by LP) • Used for both MySQL and Oracle flavors • Requires special shareable libs, often temporary ones • lcg-build-sl3 • Two main test clusters (mostly managed by GG) • Installed through Yaim + manual upgrades • MySQL flavor : 1 pool, 4 filesystems in total on 3 nodes • Master node : lxb1727 • Disk servers : lxb1903 + 04 • Oracle flavor : 1 pool, 2 filesystems in total on 2 nodes • Master node : lxb1902 • Disk server : lxb1901 • Additional production installation • MySQL flavor : 1 pool, 3 filesystems, all servers and pools on same node • FIO managed, through Quattor + GG for DPM stuff, through Yaim • lxdpm01 DPM Testing Suites @ CERN - AllHands - GG
Miscellaneous Resources (2) • Rather specific setup : • All servers are allowed to core dump (requires special tricks for GridFTP) • Tests are also using MySQL LFC:lxb1941 (CTB) • And eventually Oracle LFC:lxb1782 (CTB ?) • The central CTB BDII is not used any more • But each master node is required to run an up-to-date Information System • Of course, a fully installed UI is also required to run the tests • And also the GSIPWD of the operator ! DPM Testing Suites @ CERN - AllHands - GG
Building Process • For each major update provided by JPB • RPMs are rebuild for both DB flavors • But there is an additional fasttrack build tree available • To allow for rebuilding each MySQL flavor server separately • And also rebuilding each of the test areas • Then the RPMs are reinstalled • For MySQL only most of the time (lxb1727) • The pool nodes (disk servers) are reinstalled much less often • Only RFIOD and GRIDFTP are involved in this case • When fixes are provided for a single server, only this one is rebuilt • This provides a very short life cycle • Except for the DPM enabled GRIDFTP server which is a pain … • Servers can be reinstalled/restarted several times an hour • The current build system is the ‘good ol’one’ (no gLite, no Etics) • It is not a provocation: I actually tried to move to each of the above, but either it was a dead end, or it was far toooooooo long. • Got no time to investigate why it was like that ;-( last try was August DPM Testing Suites @ CERN - AllHands - GG
Brief overview of current DPM engine • The test install is made of 7 servers currently • The Name Server (DPNS) • The DPM itself (the main stuff) • The SRMv1 server • old, frozen, simplistic, but used by current lcg-XXX and GFAL) • The SRMv2.1 server • more sophisticated, but already deprecated • will be soon replaced by the next one • The SRMv2.2 server • the state of the art, but not yet in production • (the pseudo-standard is not even frozen yet) • The RFIOD server (with GSI authentication) • The GRIDFTP server (idem) • The latter 2 are replicated on each disk server • An additional SLAPD server is used for the Information System • There is roughly one test suite for each of these servers • Excluding the DPNS (and SLAPD) DPM Testing Suites @ CERN - AllHands - GG
Testing Suites Contents • The structure of each test suite is more or less identical • Each method to be tested is getting a companion module in C • A Perl driver then merges these C modules into various combinations • It is easier to add new use cases and to build a simili-job • Inside of a given suite, a command failure ought to break the suite • Because next commands in the flow need to reuse the results/objects created upstream • Most of the test suites are plugged into yet another Perl module • globalSuite • They are almost independant from each other • They are individually allowed to fail in which case control goes to the next one • The main suite is callable with only 2 arguments from the command line • globalSuite node-name proxy-type • The result is a simple score displayed by the suite • Score must be 31 for a standalone DPM, 34 if there are additional pool servers • ADMIN command issue (DPM socket) • They are now required to run on the server node itself (not from a UI) • Meaning they are not systematically tested inside of these suites • Ex: dpm-modifyfs, dpm-drain DPM Testing Suites @ CERN - AllHands - GG
Testing Suites timing • The globalSuite timing split: • Operation: rfioSuite with DPM = [OK] Duration: 28 sec. • Operation: rfioSuite with NODPM = [OK] Duration: 18 sec. • Operation: gsiftpSuite with DPM = [OK] Duration: 81 sec. • Operation: gsiftpSuite with NODPM = [OK] Duration: 64 sec. • Operation: gfal_test = [OK] Duration: 77 sec. • Operation: srmv1Suite with RFIO = [OK] Duration: 56 sec. • Operation: srmv1Suite with GSIFTP = [OK] Duration: 61 sec. • Operation: srmv2Suite = [OK] Duration: 265 sec. • Operation: socketSuite = [OK] Duration: 115 sec. • The overall suite is lasting about 14 min. • The SRMv2.2 suite is now requiring about 400 sec. • It is NOT included in the previous one • It will supersede the current srmv2Suite in the above globalSuite rather soon DPM Testing Suites @ CERN - AllHands - GG
The gory details • The SRMv2.2 suite • Is performing 240 operations (more to come) in about 400 sec. • 36 different C modules implement 39 methods (no more coming) • + 3 miscellaneous methods (rfcp, GUC, diff) • All available methods implemented and tested • The SRMv2 suite (elder brother of the above one) • Is performing 160 operations in about 250 sec. • 26 different C modules implement 28 methods • + 6 miscellaneous methods (rfcp, GUC, diff, dpm-xx, dpns-xx) • The SRMv1 suite (Jiri Kosina) • Is implemented in one single C module merging 9 methods • The socket suite • Is performing 60 operations in about 120 sec. • 9 different C modules are used in the suite and implement 10 methods • + 2 miscellaneous methods (rfcp, GUC) • 13 additional modules implement 13 more methods, but are not tested regularly • The relevant functionalities are however tested thru the SRM frontends above DPM Testing Suites @ CERN - AllHands - GG
More details about globalSuite • It also includes • RFIO suites for standard and DPM-like transfers • Standard transfer is: • rfcp stand.flxb1706S1 lxb1727.cern.ch:/tmp/grodid/fil135732S1 • DPM-like transfer is: • rfcp some.lxb1706 /dpm/cern.ch/home/dteam/ggtglobrfil135732 • GridFTP suites for standard and DPM-like transfers • A GFAL suite merging in one C module 6 different methods • Two lcg-util commands (lcg-cr and lcg-gt) • The log file is rather extensive • For each command (C module call), it displays • A short help about the module call • The full command line actually used, with every argument • The output of the command and a timestamp • The status and duration of the command • This allows for digging into the server log files to spot the origin of the failure, when required :-) DPM Testing Suites @ CERN - AllHands - GG
What is not covered in these tests ? • The DPNS is not tested per se, only indirectly • The LFC is not covered either, only briefly through indirect commands • The LCG-UTIL package is not tested per se • only for a few commands which involve the DPM back-end • mostly to check that the DPM is smoothly integrated into the Info System • However the GFAL package is tested extensively • In its current version, connected with SRMv1 back-end • The relevant test module is recompiled in place during the tests • Recompilation is part of the test DPM Testing Suites @ CERN - AllHands - GG
Where to find the source code ? • Everything is available into official CVS • http://glite.cvs.cern.ch:8180/cgi-bin/glite.cgi/LCG-DM/ • merged within the LCG-DM package • Heavy dependencies with other DPM stuff • Useful directories are: • LCG-DM/socket/dpmcli • LCG-DM/test/dpm, LCG-DM/test/srmv1, LCG-DM/test/srmv2 • The tests are not packaged in any RPM • The last commit includes all material required for testing up to DPM-1.5.10 • The lastest released version • Nothing about SRMv2.2 was committed yet • Should come along with DPM-1.6.x DPM Testing Suites @ CERN - AllHands - GG
How to build the test stuff ? • Here are the commands required to setup the DPM testing area : • log on lcg-build-sl3 (ask LP about it if you're not allowed yet) • cd to an AFS public area of yours • point your CVS env to the new lcgware area (glite.cvs.cern.ch:/cvs/glite) • - cvs checkout -r LCG-DM_R_1_5_10 LCG-DM • - cd LCG-DM • - setenv LIBRARY_PATH /opt/lcg/lib • - setenv LD_LIBRARY_PATH /opt/globus/lib • - make -f Makefile.ini Makefiles • - make clobber • - make • - cd socket • - make • - cd ../test • - make • The main suite perl script is: LCG-DM/test/dpm/globalSuite • Should run out of the shelf :-) • Ex: globalSuite node-name [globus|voms|vomsR] DPM Testing Suites @ CERN - AllHands - GG
Stress testing • For the socket, srmv1 & srmv2 suites, an upper layer was built to allow for stress testing a specific server type at a time • It launches in one shot several tens (from 10 up to 40-50) of generic suites of the selected type, to make them run in parallel • BTW, it also stresses the UI, not only the target servers :-) • This type of test was very useful to spot weaknesses within inter-server communications • It is not advisable to submit more than 50 suites at the same time • The TCP stack on the target node will be “overflowed” • One can submit these stress tests from several UIs at a time • but one has to follow the 50 job limit • In addition, a single UI seems sufficient to feed up the target node • It is often required to restart the DPM servers after such bombarding • Debugging server logs after such a storm is rather painful … ;-) • This is not required for functional testing • The relevant Perl drivers are: socketStress, srmv1Stress & srmv2Stress DPM Testing Suites @ CERN - AllHands - GG
Are these tests tested ? • They have been running towards • My own install nodes, + lxdpm01 • Most of the CTB DPM nodes, at various times (lxb1921 currently) • Very useful to spot misconfiguration issues • The LAL-Orsay DPM site (GRIF) which was in addition a multi-domain installation • It is usually not a problem to target a remote DPM, after the firewall issues have been cleared • QUESTIONS ? DPM Testing Suites @ CERN - AllHands - GG