190 likes | 311 Views
Running Jobs on the Grid -An ATLAS Software Developers Perspective-. Advantages to Running on the Grid. I am involved in helping the B-Physics group validate their B s → J/ ψ φ samples.
E N D
Running Jobs on the Grid-An ATLAS Software Developers Perspective-
Advantages to Running on the Grid • I am involved in helping the B-Physics group validate their Bs→J/ψφ samples. • I do this by taking generated events and running simulation, digitization, reconstruction and analysis on each major release of Athena. • This is VERY resource intensive. Up to 125,000 events need processing. • Usually it takes >1 Week to run on lxplus or csf batch nodes. • Running many large jobs at frequent intervals impacts greatly on my priority status in the batch queue leading to greater and greater job start delays. • Usually jobs on the Grid start within a few minutes of submission. • Greater resource availability means that queue prioritization is not a problem.
Getting Set Up • Make sure you have a Grid certificate • instructions at https://twiki.cern.ch/twiki/bin/view/Atlas/WorkBookStartingGrid • Join the ATLAS Virtual Organisation • see http://www.usatlas.bnl.gov/twiki/bin/view/Support/HowToAtlasVO • Log into a grid User Interface (e.g. lxplus.cern.ch, csf.rl.ac.uk) • Set up your local grid environment. E.g. on lxplus use: • source /afs/cern.ch/project/gd/LCG-share/sl3/etc/profile.d/grid_env.sh • I haven't had to set this variable on csf. • Ask your local expert what you should use elsewhere.
Prepare Scripts A typical Grid job needs a certain basic structure which is made up from a small number of scripts • JDL Script • Grid jobs need to take certain information about your job specifications (e.g. Athena release, virtual organisation, cpu requirements, input and output files…) to the remote grid site. • All of this information is contained within a file of type <YourJDLFileName>.jdl. • Executable • You need an Executable script. This is a shell script which executes commands on the remote grid site. It sets up CMT other environment variables, and it acts as a wrapper for your Athena job. • Parameters such as event number, Athena release number, and geometry version can be set here, and passed to the jdl file. • CompileReleaseCode.def, setup-release.sh & dq.tar.gz (These don’t need editing). • Some definitions for the release code • Sets some environment variables • File I/O handlers (Optional)
Atlas software tag version. Similar to release version * * For releases prior to 11.5.0 the ATLAS software tag version format is VO-atlas-release-xx-xx-xx The JDL File JobOptions file [ Executable = "MyScript.sh"; InputSandbox = {“<MyGridDir>/scripts/MyScript.sh", " {“<MyGridDir>/ scripts/setup-release.sh", " {“<MyGridDir>/ scripts/CompileReleaseCode.def", " {“<MyGridDir>/ scripts/dq.tar.gz"}; OutputSandbox = {"stdout","stderr"}; stdoutput = "stdout"; stderror = "stderr"; Arguments = "12.0.1 castorgrid.cern.ch steved <input_location> <input_file> <output_file> <nevent>” Environment = {"T_LCG_GFAL_INFOSYS=atlas-bdii.cern.ch:2170"}; VirtualOrganisation = "atlas"; Requirements = Member("VO-atlas-offline-12.0.1", other.GlueHostApplicationSoftwareRunTimeEnvironment) && other.GlueHostNetworkAdapterOutboundIP == TRUE && other.GlueHostMainMemoryRAMSize >= 512 && other.GlueCEPolicyMaxCPUTime > 1000 && (!RegExp("dgce0.icepp.jp:2119",other.GlueCEUniqueId)); Rank = ( other.GlueCEStateWaitingJobs == 0 ? other.GlueCEStateFreeCPUs : -other.GlueCEStateWaitingJobs) ; ] Atlas software release version Storage element
The JDL File – Interactive Mode Interactive jobs can be run on the grid, which can be useful for debugging purposes. N.B. The lines: OutputSandbox = {"stdout","stderr"}; stdoutput = "stdout"; stderror = "stderr"; that appear in a standard jdl file must be absent here in order for the job to run. [ JobType = "Interactive" ; Executable = "MyScript.sh"; InputSandbox = {“<MyGridDir>/scripts/MyScript.sh", " {“<MyGridDir>/ scripts/setup-release.sh", " {“<MyGridDir>/ scripts/CompileReleaseCode.def", " {“<MyGridDir>/ scripts/dq.tar.gz"}; Arguments = "12.0.1 castorgrid.cern.ch steved <input_location> <input_file> <output_file> <nevent>” Environment = {"T_LCG_GFAL_INFOSYS=atlas-bdii.cern.ch:2170"}; VirtualOrganisation = "atlas"; Requirements = Member("VO-atlas-offline-${release}",other.GlueHostApplicationSoftwareRunTimeEnvironment) && other.GlueHostNetworkAdapterOutboundIP == TRUE && other.GlueHostMainMemoryRAMSize >= 512 && other.GlueCEPolicyMaxCPUTime > 1000 && (!RegExp("dgce0.icepp.jp:2119",other.GlueCEUniqueId)) ; Rank = ( other.GlueCEStateWaitingJobs == 0 ? other.GlueCEStateFreeCPUs : -other.GlueCEStateWaitingJobs) ; ] Add this line to make the job run interactively
The Executable (Part 1) #Some environment stuff: #!/bin/sh dq_prep() { ATLAS_PYTHON="`(cd $T_DISTREL/AtlasOfflineRunTime/cmt; cmt show macro_value Python_home)`" POOL_home="`(cd $T_DISTREL/AtlasOfflineRunTime/cmt; cmt show macro_value POOL_home)`" SEAL_home="`(cd $T_DISTREL/AtlasOfflineRunTime/cmt; cmt show macro_value SEAL_home)`" DQ_LD_LIBRARY_PATH=${ATLAS_PYTHON}/lib:${LD_LIBRARY_PATH} DQ_PYTHONPATH=${POOL_home}/bin:${POOL_home}/lib:${PYTHONPATH} export SEAL_KEEP_MODULES="true" export SEAL_PLUGINS=${SEAL_home}/lib/modules:${POOL_home}/lib/modules:${SEAL_PLUGINS} export POOL_OUTMSG_LEVEL=8 } dq() { STAGER=$PWD/run_dqlcg.sh STAGERARGS="$@" # Setup DQ PYTHONPATH_SAVE=$PYTHONPATH LD_LIBRARY_PATH_SAVE=$LD_LIBRARY_PATH export LD_LIBRARY_PATH=${DQ_LD_LIBRARY_PATH} export PYTHONPATH=${DQ_PYTHONPATH} source $STAGER $STAGERARGS export LD_LIBRARY_PATH=$LD_LIBRARY_PATH_SAVE export PYTHONPATH=$PYTHONPATH_SAVE } #--------------------------------------------------------------------------------------------------
The Executable (Part 2) These are the variables that get entered in the jdl file. They are for a generic reconstruction job and get set later. #--------------------------- export T_RELEASE="${1}“ #ATLAS software release e.g. 12.0.2 export T_SE="${2}“ #Storage element export T_LCN="${3}“ #User name export T_INFN="${4}“ #Input file name export T_OUTFN_ESD="${5}“ #Output ESD file name export T_OUTFN_AOD="${5}“ #Output AOD file name export T_NEVT="${6}“ #Number of events export T_SKIP="${7}“ #Events to skip over export T_GEO_VER="${8}“ #Geometry version e.g. ATLAS-DC3-07 shift 9 #number of arguments that go in jdl file #---------------------------
The Executable (Part 3) #Set up Athena and set some environment variables export LCG_GFAL_INFOSYS=$T_LCG_GFAL_INFOSYS CLOSESE="`edg-brokerinfo getCloseSEs | head -n 1 | awk '{ print $1}'`" echo "## source setup-release.sh" source setup-release.sh echo "Unpack DQ" tar xvfz dq.tar.gz # Setup the release and prepare DQ export T_DISTREL=${SITEROOT}/AtlasOffline/${T_RELEASE} dq_prep echo "## Setting up the release:" echo "##source ${T_DISTREL}/AtlasOfflineRunTime/cmt/setup.sh" source ${T_DISTREL}/AtlasOfflineRunTime/cmt/setup.sh echo "## blah-de-blah" export LCG_CATALOG_TYPE="lfc" export LFC_HOST="lfc-atlas-test.cern.ch"
The Executable (Part 4) #Stage files and set up directories # Stage input file echo ">>> STAGE-IN: ${T_INFN}" actual_filename=$( echo ${T_INFN} | cut -d "/" -f 10) current_dir=`pwd` echo `pwd` echo ">>> lcg-cp --vo atlas lfn:${T_INFN} file:${current_dir}/${actual_filename}" lcg-cp --vo atlas lfn:${T_INFN} file:${current_dir}/${actual_filename} echo "ls file here" #ls ${T_INFN} ls ${actual_filename} echo "have ls'd file" rm PoolFileCatalog.xml echo "XXXXXXXXXXXXXXXXXXXX" pool_insertFileToCatalog ${T_INFN} cat PoolFileCatalog.xml 2> /dev/null echo "XXXXXXXXXXXXXXXXXXXX" #if [ ! -f ${T_INFN} ] ; then if [ ! -f ${actual_filename} ]; then echo "Unable to stage-in input file ${actual_filename}" exit 33 fi # Working directory T_HOMEDIR=${PWD} T_TMPDIR=${PWD}/atlas.tmp$$ mkdir -p ${T_TMPDIR} cd ${T_TMPDIR} # Move the input file to the working dir mv ${T_HOMEDIR}/${actual_filename} ${T_TMPDIR} mv ${T_HOMEDIR}/PoolFileCatalog.xml ${T_TMPDIR} mv ${T_HOMEDIR}/CompileReleaseCode.def ${T_TMPDIR}
The Executable (Part 5) #-------------------------------------------------------------------------- # transformation script call #-------------------------------------------------------------------------- echo echo "=======================" echo "TRANSFORMATION STARTING" echo "=======================" echo cat PoolFileCatalog.xml 2> /dev/null #Use Athena job transforms from the local software release csc_reco_trf.py "${actual_filename}" "${T_OUTFN_ESD}" "${T_OUTFN_AOD}" ${T_NEVT} ${T_SKIP} "${T_GEO_VER}" echo echo "=======================" echo " END OF TRANSFORMATION" echo "=======================" echo \ls -l ${T_OUTFN_SIM} \ls -l ${T_OUTFN_DIG}
The Executable (Part 6) echo "===============================" echo " REGISTERING OUTPUT FILES " echo "===============================" mv PoolFileCatalog.xml ${T_HOMEDIR} mv ${T_OUTFN_SIM} ${T_HOMEDIR} mv ${T_OUTFN_DIG} ${T_HOMEDIR} cd ${T_HOMEDIR} cat PoolFileCatalog.xml 2> /dev/null echo ">> STAGE-OUT: ${T_OUTFN_AOD}" if [ -f ${T_OUTFN_AOD} -a -f PoolFileCatalog.xml ] ; then #echo ">> . $STAGER -d ${T_SE} output /datafiles/${T_LCN} ${T_OUTFN_DIG}" #. $STAGER -d ${T_SE} output datafiles/${T_LCN} ${T_OUTFN_DIG} echo ">> lcg-cr -v -l /grid/atlas/users/steved/bphys/postroma/rec/017700.Bs_Jpsi_mu6mu3_phi_KplusKminus11041/${T_OUTFN_AOD} -n 8 -d ${T_SE} -t 3000 --vo atlas file:${pwd}/${T_OUTFN_AOD}" lcg-cr -v -l /grid/atlas/users/steved/bphys/postroma/dig/017700.Bs_Jpsi_mu6mu3_phi_KplusKminus11041/${T_OUTFN_AOD} -n 8 -d ${T_SE} -t 3000 --vo atlas file:${pwd}/${T_OUTFN_AOD} else echo ">> Not found" fi echo echo "===============================" echo " END OF JOB " echo "===============================" exit #Put all six parts together for a complete script.
Running • grid-proxy-init [lcgui0361] /afs/rl.ac.uk/user/s/steved/Atlas/SteveGridTest/scripts > grid-proxy-init Your identity: /C=UK/O=eScience/OU=CLRC/L=RAL/CN=stephen dallison Enter GRID pass phrase for this identity: Creating proxy .......................................................... Done Your proxy is valid until: Tue Aug 15 04:41:17 2006 • edg-job-list-match -rank <MyJDLFile>.jdl [lcgui0361] /afs/rl.ac.uk/user/s/steved/Atlas/SteveGridTest/jdl > edg-job-list-match -rank RecoTest.jdl Selected Virtual Organisation name (from JDL): atlas Connecting to host lcgrb01.gridpp.rl.ac.uk, port 7772 *************************************************************************** COMPUTING ELEMENT IDs LIST The following CE(s) matching your job requirements have been found: *CEId* *Rank* ce02.esc.qmul.ac.uk:2119/jobmanager-lcgpbs-lcg2_long 433 cclcgceli02.in2p3.fr:2119/jobmanager-bqs-atlas_long 369 fal-pygrid-18.lancs.ac.uk:2119/jobmanager-lcgpbs-atlas 257 lcg00125.grid.sinica.edu.tw:2119/jobmanager-lcgpbs-atlas119 lcgce.ijs.si:2119/jobmanager-pbs-atlas 71 grid003.ft.uam.es:2119/jobmanager-lcgpbs-atlas 64 grid109.kfki.hu:2119/jobmanager-lcgpbs-atlas 64 gw39.hep.ph.ic.ac.uk:2119/jobmanager-lcgpbs-atlas 44 etc… etc…. Get a proxy certificate Find grid sites where your job can run
Running Cont. • edg-job-submit --vo atlas -o <OUT_URL_FILE> <MyJDLFile>.jdl [lcgui0361] /afs/rl.ac.uk/user/s/steved/Atlas/SteveGridTest/jdl > edg-job-submit --vo atlas -o RECOFILES_TEST RecoTest.jdl Selected Virtual Organisation name (from --vo option): atlas Connecting to host lcgrb01.gridpp.rl.ac.uk, port 7772 Logging to host lcgrb01.gridpp.rl.ac.uk, port 9002 ================================ edg-job-submit Success=========================== The job has been successfully submitted to the Network Server. Use edg-job-status command to check job current status. Your job identifier (edg_jobId) is: - https://lcgrb01.gridpp.rl.ac.uk:9000/7el4yHLhwVYIAAPLZD1PdQ The edg_jobId has been saved in the following file: /afs/rl.ac.uk/user/s/steved/Atlas/SteveGridTest/jdl/RECOFILES_TEST ============================================================================== Submit job to grid
Running Cont.. • edg-job-status -i <OUT_URL_FILE> [lcgui0361] /afs/rl.ac.uk/user/s/steved/Atlas/SteveGridTest/jdl > edg-job-status -i RECOFILES_TEST ------------------------------------------------------------------ 1 : https://lcgrb01.gridpp.rl.ac.uk:9000/37K1ydLTafuPkV2_CbMm1g 2 : https://lcgrb01.gridpp.rl.ac.uk:9000/urcnVuPGlK0hC3ir9p2yCw 3 : https://lcgrb01.gridpp.rl.ac.uk:9000/uf3o-LTwAUoYuGeCqzLBOQ 4 : https://lcgrb01.gridpp.rl.ac.uk:9000/dKwQxBekGfRDYeXfuS8lPA 5 : https://lcgrb01.gridpp.rl.ac.uk:9000/a99gO9ncDIWO7R-kzb16aQ 6 : https://lcgrb01.gridpp.rl.ac.uk:9000/7el4yHLhwVYIAAPLZD1PdQ a : all q : quit ------------------------------------------------------------------ Choose one or more edg_jobId(s) in the list - [1-6]all:6 ************************************************************* BOOKKEEPING INFORMATION: Status info for the Job : https://lcgrb01.gridpp.rl.ac.uk:9000/7el4yHLhwVYIAAPLZD1PdQ Current Status: Scheduled Status Reason: Job successfully submitted to Globus Destination: ce02.esc.qmul.ac.uk:2119/jobmanager-lcgpbs-lcg2_long reached on: Mon Aug 14 20:52:18 2006 ************************************************************* • edg-job-get-output -dir <output_dir> -i <OUT_URL_FILE> Check status of jobs Retrieve output files
Running Cont… • edg-job-get-output -dir <output_dir> -i <OUT_URL_FILE> • N.B. Running ATLAS software releases earlier than 11.5.0 sometimes requires the use of the old bash style JobTransform scripts, and the old style cmt set up • cmt setup different • download tar ball of jobTransforms • tar -zxvf <JT>.tar.gz and edit relevant files if necessary. Then tar -zcvf <JT>.tar.gz • Including extra packages • cmt co <AnyExtraPackageYouMightNeed> • Include in Input Sandbox • do relevant cmt stuf fin executable • Getting expert help • https://gus.fzk.de/pages/home.php • This is a web based forum which users must register with. • You submit a “ticket” and your problem is assigned to an expert who then liaises with you directly Retrieve output files
Useful LCG Commands • List files in catalogue • lfc-ls `lcg-infosites --vo atlas lfc`:/grid/atlas/users/<MyCatalogueDir> • Make a new directory in the catalogue • lfc-mkdir `lcg-infosites --vo atlas lfc`:/grid/atlas/users/<MyNewCatalogueDir> • Register an existing file in the catalogue • lcg-rf -v --vo atlas -l lfn :/grid/atlas/users/<MyCatalogueDir> /<file> srm:/castorsrm.cern.ch/castor/<dir>/<file> • Make a replica • lcg-rep -v --vo atlas lfn :/grid/atlas/users/<MyCatalogueDir>/<file> -d srm://dcache.gridpp.rl.ac.uk/pnfs/gridpp.rl.ac.uk/data/atlas/<dir>/<file> • See this file at RAL using: • export LD_PRELOAD=/opt/dcache/dcap/lib/libpdcap.so • cd dcache.gridpp.rl.ac.uk/pnfs/gridpp.rl.ac.uk/data/atlas/<dir> • cat <file> • See the LCG manual for further info. • http://egee.itep.ru/User_Guide.html
The ATLAS software tags installed at each Grid site. Notice the different formats: E.g. VO-atlas-release-11.0.42 for prior to 11.5.0, VO-atlas-offline-11.5.0 for 11.5.0 and after. This difference must be taken into account in the *.jdl file. Check ATLAS Tag Availability • lcg-infosites --vo atlas tag [lcgui0361] /afs/rl.ac.uk/user/s/steved/Atlas/SteveGridTest/jdl > lcg-infosites --vo atlas tag Name of the CE: g03n04.pdc.kth.se Name of the CE: lcgce.ijs.si VO-atlas-release-11.0.42 VO-atlas-release-11.0.5 VO-atlas-offline-11.5.0 VO-atlas-offline-12.0.1 Name of the CE: lpnce.in2p3.fr VO-atlas-offline-12.0.1 VO-atlas-release-11.0.42 VO-atlas-release-11.0.5 etc...
Examples • The following examples of grid jobs can be found in the ATLAS Physics Workbook. They range from a very basic “Hello World” job… https://twiki.cern.ch/twiki/bin/view/Atlas/WorkBookRunningGrid • …to running an Athena job… https://twiki.cern.ch/twiki/bin/view/Atlas/WorkBookAthenaGrid • …and then a Full Chain production https://twiki.cern.ch/twiki/bin/view/Atlas/WorkBookFullChainGrid