200 likes | 359 Views
Functionality Tests and Stress Tests on a StoRM Instance at CNAF. by Elisa Lanciotti (CNAF-INFN, CERN IT/PSS) Roberto Santinelli (CERN IT/PSS) Vincenzo Vagnoni (INFN Bologna). Storage Workshop at CERN, 2-3 July 2007. Contents. Setup description
E N D
Functionality Tests and Stress Tests on a StoRM Instance at CNAF by Elisa Lanciotti (CNAF-INFN, CERN IT/PSS) Roberto Santinelli (CERN IT/PSS) Vincenzo Vagnoni (INFN Bologna) Storage Workshop at CERN, 2-3 July 2007 CERN Storage Workshop 2-3 July 2007
Contents • Setup description • Goal of the tests: to tune the parameters of StoRM to use it as T0D1 storage system at CNAF T1 • Tests about: • Data throughput • Access to data stored in StoRM • Response of the system under stress • Summary and next to do CERN Storage Workshop 2-3 July 2007
Setup description • Two different instances have been used: • storm02.cr.cnaf.infn.it • Very small instance (supposed for functionality tests and small scale stress tests) where all StoRM services run in one single box (4TB) • storm-fe.cr.cnaf.infn.it • Large instance used for throughput in write and read mode and for carrying on stress tests (36TB) Most of the results refer to this setup CERN Storage Workshop 2-3 July 2007
storm04 storm01 MySQL storm03 Please note UI is a very old machine (PIII 1GHz,512MB) storm-fe.cr.cnaf.infn.it FE accepts request, authenticates and queues data into DB DNS balanced Front End BE reads requests from DB and executes them on GPFS (running clients) Storm-gridftpd Storm-gridftpd Storm-gridftpd Storm-gridftpd GPFS GPFS GPFS GPFS CERN Storage Workshop 2-3 July 2007 4 Gridftpd servers running also GPFS servers
More details on the testbed • Front-end: storm03-storm04 • dual AMD Opteron 2.2 GHz, 4 GB ram • Back-end: storm01 • dual Intel Xeon 2.4 GHz, 2GB ram • also runs mysqld • 4 GPFS disk servers • dual Intel Xeon 1.6 GHz, 4 GB ram • also running gridftpd StoRM version 1.3-15 CERN Storage Workshop 2-3 July 2007
Throughput test description • Tests using low level tools (preliminary to FTS tests, which will be finally used by LHCb) • Multithreaded script . Each thread keeps transferring the same source files (real LHCb DST and DIGI files O(100M)) to always different destination files. • Each thread does sequentially (for a configurable period of time): • PtP on Storm • Storm polling until destination TURL ready (up to 10 retries exponential varying time between 2 retries) • Globus-url-copy (or lcg-cp) sourceturl-destturl (source_surl-dest_turl) respectively • PutDone (on StoRM) • Ls on StoRM for computing the size transferred • Iterate the previous action until total time is reached CERN Storage Workshop 2-3 July 2007
Throughput test description (cont’d) • Tuning of the optimal nstreams*nprocess*sources. • short tests varying the number of streams and number of source files (=threads) for each of the 7 sources (=T1 site endpoint) and varying the sources […] • Use case test: Running a test with the best combination files*streams for a non negligible time (>12 hours) by using/emulating the full transfer chain (i.e. SRM1 @7 source sites, SRM2 @StoRM destination with lcg_utils and StoRM Clients) • Evaluation of the maximum throughput (from CERN using directly source TURLs no delay added from SRM at the source). • Running with these last (server) extreme conditions for a sustained period of time (14 hours and more…) • Testing the removal capability offered by SRM2 and more in detail by StoRM LHCb (struggling on DC06 clean up) would be very interested. CERN Storage Workshop 2-3 July 2007
Throughput tests Only using CERN that (from previous tests) has been confirmed to be the best WAN connected site to CNAF-Storm. The throughput test is carried out without SRM @ the source (i.e. only input TURLs) 150MB/s 7.7TB in 14h Storm Failures (no TURL returned, no putdone status set, no info on the size retrieved) <0.5 % Transfer Failures 1-5% (depending on the source) Use Case Test CERN Storage Workshop 2-3 July 2007
Linux memory killer on my desktop (the auxiliary client instance) we started loosing procs Throughput tests (from CERN only) Total handled files ~100K At least 400K interactions Failure rate in copying =0.2%** Failure rate due to Storm<0.1%Amount of data (50K sec)>14TB Bandwidth peak: 370MB/s ** number of != 0 exit code from globus-url-copy from GridView: throughput form CERN CERN Storage Workshop 2-3 July 2007
Removal Test disk occupancy vs time for i in $(seq -w 1 50); do clientSRM rmdir -r -e httpg://storm-fe.cr.cnaf.infn.it:8444 -s srm://storm-fe.cr.cnaf.infn.it/lhcb/roberto/testrm/$i;done 17 TB of data spread over 50 directories deleted in 20 minutes CERN Storage Workshop 2-3 July 2007
Access to data stored in StoRM (I) Preliminary operation: transferring to StoRM some LHCb datasets (1.3 TB). A test suite has been set up for basic functionality tests: • submit a job which opens a dataset in StoRM with ROOT • submit a job which runs DaVinci on datasets in StoRM • The test job ships an executable (bash script) and an options file containing a list of SURLs • the executable: • downloads a tarball with a SRM2.2 client and installs it locally on the WN • executes some client commands to get a TURL list from StoRM • runs DaVinci, which takes in input the TURL list and opens and reads the files stored in StoRM CERN Storage Workshop 2-3 July 2007
Access to data stored in StoRM (II) • Results: on both storm02 and storm-fe the functionality tests are successfull • Ongoing activity: • repeat the test with many (~hundreds) jobs accessing data at same time to prove the feasibility of the HEP experiment analysis • Next step: • use Ganga and DIRAC interfaces to submit the jobs (PPS) CERN Storage Workshop 2-3 July 2007
First stress tests • Objective: • Test how many simultaneous requests the system can handle • What happens when the saturation is reached • Tests done on both systems: storm02 and storm-fe • First test: load the system with an increasing number parallel jobs which make ptp (PrepareToPut) requests CERN Storage Workshop 2-3 July 2007
Testbed description • A main script launches NPROC parallel processes. • Each process: • First phase: list the content of the destination directory in StoRM and removes all the files in it. • Second phase: performs N ptp requests to the system and polls it to get the TURL (No data transfer). • Measurement of: • Total time to perform the N requests • Percentage of failed requests storm-fe proc 1 ->/lhcb/../dir1/ proc 2 ->/lhcb/../dir2/ proc n ->/lhcb/../dirn/ UI Main script CERN Storage Workshop 2-3 July 2007
Preliminary results on storm-fe Mean time per request vs number of parallel requests. Slight increase with the number of parallel requests. Some examples of the distribution of the time per request for 500 (left) and 600 (right) parallel processes CERN Storage Workshop 2-3 July 2007
Results (II) Failed requests vs number of parallel requests: • Almost no failure up to 500 • For 600 parallel processes a no negligible rate of failures is observed • Causes of the failed requests: mainly 3 types of error found: • “CGSI-gSoap: error reading token data: connection reset by peer” and “CGSI-gSoap: could not open connection! TCP connect failed in tcp_connect()” • Ls returns SRM_INTERNAL_ERROR “client transport failed to execute the RPC. HTTP response: 0. • Some client commands hung for hours (mainly statusptp) • Almost 100% of failures for gSoap timeout occurr in the first phase: when creating the destination directory or listing the content of the directories and deleting the files►specific tests needed for rm, ls, mkdir • Almost no failure in the ptp-statusptp phase CERN Storage Workshop 2-3 July 2007
Ongoing activity about stress tests • Specific tests on the functionalities which have shown problems: Ls,rm,mkdir • Preliminary results show: • mkdir: 2% with 600 parallel jobs • rm: 6-7% with 600 parallel jobs all failures due to gsoap timeouts. More systematic tests are needed to study the dependecy of the failure rate with the load of the system • Noticed a very high load on front-end during the test: 85% CPU usage. Back-end only 15-20% • During the tests: collaboration with StoRM developers to investigate and fix the problems found • Done some optimization of the DB on the basis of the results of these tests CERN Storage Workshop 2-3 July 2007
Data transfer using FTS2.0 • Simple tests of data transfer using FTS 2.0 from Castor of CERN (SRM 1) to StoRM at CNAF (SRM 2.2) (already proved by Flavia Donno) • FTS service endpoint of CERN pps: https://pps-fts.cern.ch:8443/glite-data-transfer-fts/services/FileTransfer Aim: running throughput tests with production instance of FTS CERN Storage Workshop 2-3 July 2007
Summary and what is next to do • So far: • Estimation of throughput: very good results obtained from several sites to StoRM • Access to data: proved that DaVinci can access files stored in StoRM • First stress tests: very promising results. Still some tuning of StoRM parameters ongoing • File transfer from CERN Castor to StoRM (SRM 2.2) via FTS 2.0 • Next to do: • Ongoing stress tests for tuning the service parameters in collaboration with Storm developers • About access to data: perform many (~hundreds) parallel jobs on the CNAF local batch system. Include Ganga and DIRAC in the job submission sequence CERN Storage Workshop 2-3 July 2007
Acknowledgements • Thanks to the CNAF Storage staff and to the StoRM team for providing the resources for these tests and for their support CERN Storage Workshop 2-3 July 2007