1 / 19

Running Jobs on the Grid -An ATLAS Software Developers Perspective-

Running Jobs on the Grid -An ATLAS Software Developers Perspective-. Advantages to Running on the Grid. I am involved in helping the B-Physics group validate their B s → J/ ψ φ samples.

chelsi
Download Presentation

Running Jobs on the Grid -An ATLAS Software Developers Perspective-

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Running Jobs on the Grid-An ATLAS Software Developers Perspective-

  2. Advantages to Running on the Grid • I am involved in helping the B-Physics group validate their Bs→J/ψφ samples. • I do this by taking generated events and running simulation, digitization, reconstruction and analysis on each major release of Athena. • This is VERY resource intensive. Up to 125,000 events need processing. • Usually it takes >1 Week to run on lxplus or csf batch nodes. • Running many large jobs at frequent intervals impacts greatly on my priority status in the batch queue leading to greater and greater job start delays. • Usually jobs on the Grid start within a few minutes of submission. • Greater resource availability means that queue prioritization is not a problem.

  3. Getting Set Up • Make sure you have a Grid certificate • instructions at https://twiki.cern.ch/twiki/bin/view/Atlas/WorkBookStartingGrid • Join the ATLAS Virtual Organisation • see http://www.usatlas.bnl.gov/twiki/bin/view/Support/HowToAtlasVO • Log into a grid User Interface (e.g. lxplus.cern.ch, csf.rl.ac.uk) • Set up your local grid environment. E.g. on lxplus use: • source /afs/cern.ch/project/gd/LCG-share/sl3/etc/profile.d/grid_env.sh • I haven't had to set this variable on csf. • Ask your local expert what you should use elsewhere.

  4. Prepare Scripts A typical Grid job needs a certain basic structure which is made up from a small number of scripts • JDL Script • Grid jobs need to take certain information about your job specifications (e.g. Athena release, virtual organisation, cpu requirements, input and output files…) to the remote grid site. • All of this information is contained within a file of type <YourJDLFileName>.jdl. • Executable • You need an Executable script. This is a shell script which executes commands on the remote grid site. It sets up CMT other environment variables, and it acts as a wrapper for your Athena job. • Parameters such as event number, Athena release number, and geometry version can be set here, and passed to the jdl file. • CompileReleaseCode.def, setup-release.sh & dq.tar.gz (These don’t need editing). • Some definitions for the release code • Sets some environment variables • File I/O handlers (Optional)

  5. Atlas software tag version. Similar to release version * * For releases prior to 11.5.0 the ATLAS software tag version format is VO-atlas-release-xx-xx-xx The JDL File JobOptions file [ Executable = "MyScript.sh"; InputSandbox = {“<MyGridDir>/scripts/MyScript.sh", " {“<MyGridDir>/ scripts/setup-release.sh", " {“<MyGridDir>/ scripts/CompileReleaseCode.def", " {“<MyGridDir>/ scripts/dq.tar.gz"}; OutputSandbox = {"stdout","stderr"}; stdoutput = "stdout"; stderror = "stderr"; Arguments = "12.0.1 castorgrid.cern.ch steved <input_location> <input_file> <output_file> <nevent>” Environment = {"T_LCG_GFAL_INFOSYS=atlas-bdii.cern.ch:2170"}; VirtualOrganisation = "atlas"; Requirements = Member("VO-atlas-offline-12.0.1", other.GlueHostApplicationSoftwareRunTimeEnvironment) && other.GlueHostNetworkAdapterOutboundIP == TRUE && other.GlueHostMainMemoryRAMSize >= 512 && other.GlueCEPolicyMaxCPUTime > 1000 && (!RegExp("dgce0.icepp.jp:2119",other.GlueCEUniqueId)); Rank = ( other.GlueCEStateWaitingJobs == 0 ? other.GlueCEStateFreeCPUs : -other.GlueCEStateWaitingJobs) ; ] Atlas software release version Storage element

  6. The JDL File – Interactive Mode Interactive jobs can be run on the grid, which can be useful for debugging purposes. N.B. The lines: OutputSandbox = {"stdout","stderr"}; stdoutput = "stdout"; stderror = "stderr"; that appear in a standard jdl file must be absent here in order for the job to run. [ JobType = "Interactive" ; Executable = "MyScript.sh"; InputSandbox = {“<MyGridDir>/scripts/MyScript.sh", " {“<MyGridDir>/ scripts/setup-release.sh", " {“<MyGridDir>/ scripts/CompileReleaseCode.def", " {“<MyGridDir>/ scripts/dq.tar.gz"}; Arguments = "12.0.1 castorgrid.cern.ch steved <input_location> <input_file> <output_file> <nevent>” Environment = {"T_LCG_GFAL_INFOSYS=atlas-bdii.cern.ch:2170"}; VirtualOrganisation = "atlas"; Requirements = Member("VO-atlas-offline-${release}",other.GlueHostApplicationSoftwareRunTimeEnvironment) && other.GlueHostNetworkAdapterOutboundIP == TRUE && other.GlueHostMainMemoryRAMSize >= 512 && other.GlueCEPolicyMaxCPUTime > 1000 && (!RegExp("dgce0.icepp.jp:2119",other.GlueCEUniqueId)) ; Rank = ( other.GlueCEStateWaitingJobs == 0 ? other.GlueCEStateFreeCPUs : -other.GlueCEStateWaitingJobs) ; ] Add this line to make the job run interactively

  7. The Executable (Part 1) #Some environment stuff: #!/bin/sh dq_prep() { ATLAS_PYTHON="`(cd $T_DISTREL/AtlasOfflineRunTime/cmt; cmt show macro_value Python_home)`" POOL_home="`(cd $T_DISTREL/AtlasOfflineRunTime/cmt; cmt show macro_value POOL_home)`" SEAL_home="`(cd $T_DISTREL/AtlasOfflineRunTime/cmt; cmt show macro_value SEAL_home)`" DQ_LD_LIBRARY_PATH=${ATLAS_PYTHON}/lib:${LD_LIBRARY_PATH} DQ_PYTHONPATH=${POOL_home}/bin:${POOL_home}/lib:${PYTHONPATH} export SEAL_KEEP_MODULES="true" export SEAL_PLUGINS=${SEAL_home}/lib/modules:${POOL_home}/lib/modules:${SEAL_PLUGINS} export POOL_OUTMSG_LEVEL=8 } dq() { STAGER=$PWD/run_dqlcg.sh STAGERARGS="$@" # Setup DQ PYTHONPATH_SAVE=$PYTHONPATH LD_LIBRARY_PATH_SAVE=$LD_LIBRARY_PATH export LD_LIBRARY_PATH=${DQ_LD_LIBRARY_PATH} export PYTHONPATH=${DQ_PYTHONPATH} source $STAGER $STAGERARGS export LD_LIBRARY_PATH=$LD_LIBRARY_PATH_SAVE export PYTHONPATH=$PYTHONPATH_SAVE } #--------------------------------------------------------------------------------------------------

  8. The Executable (Part 2) These are the variables that get entered in the jdl file. They are for a generic reconstruction job and get set later. #--------------------------- export T_RELEASE="${1}“ #ATLAS software release e.g. 12.0.2 export T_SE="${2}“ #Storage element export T_LCN="${3}“ #User name export T_INFN="${4}“ #Input file name export T_OUTFN_ESD="${5}“ #Output ESD file name export T_OUTFN_AOD="${5}“ #Output AOD file name export T_NEVT="${6}“ #Number of events export T_SKIP="${7}“ #Events to skip over export T_GEO_VER="${8}“ #Geometry version e.g. ATLAS-DC3-07 shift 9 #number of arguments that go in jdl file #---------------------------

  9. The Executable (Part 3) #Set up Athena and set some environment variables export LCG_GFAL_INFOSYS=$T_LCG_GFAL_INFOSYS CLOSESE="`edg-brokerinfo getCloseSEs | head -n 1 | awk '{ print $1}'`" echo "## source setup-release.sh" source setup-release.sh echo "Unpack DQ" tar xvfz dq.tar.gz # Setup the release and prepare DQ export T_DISTREL=${SITEROOT}/AtlasOffline/${T_RELEASE} dq_prep echo "## Setting up the release:" echo "##source ${T_DISTREL}/AtlasOfflineRunTime/cmt/setup.sh" source ${T_DISTREL}/AtlasOfflineRunTime/cmt/setup.sh echo "## blah-de-blah" export LCG_CATALOG_TYPE="lfc" export LFC_HOST="lfc-atlas-test.cern.ch"

  10. The Executable (Part 4) #Stage files and set up directories # Stage input file echo ">>> STAGE-IN: ${T_INFN}" actual_filename=$( echo ${T_INFN} | cut -d "/" -f 10) current_dir=`pwd` echo `pwd` echo ">>> lcg-cp --vo atlas lfn:${T_INFN} file:${current_dir}/${actual_filename}" lcg-cp --vo atlas lfn:${T_INFN} file:${current_dir}/${actual_filename} echo "ls file here" #ls ${T_INFN} ls ${actual_filename} echo "have ls'd file" rm PoolFileCatalog.xml echo "XXXXXXXXXXXXXXXXXXXX" pool_insertFileToCatalog ${T_INFN} cat PoolFileCatalog.xml 2> /dev/null echo "XXXXXXXXXXXXXXXXXXXX" #if [ ! -f ${T_INFN} ] ; then if [ ! -f ${actual_filename} ]; then echo "Unable to stage-in input file ${actual_filename}" exit 33 fi # Working directory T_HOMEDIR=${PWD} T_TMPDIR=${PWD}/atlas.tmp$$ mkdir -p ${T_TMPDIR} cd ${T_TMPDIR} # Move the input file to the working dir mv ${T_HOMEDIR}/${actual_filename} ${T_TMPDIR} mv ${T_HOMEDIR}/PoolFileCatalog.xml ${T_TMPDIR} mv ${T_HOMEDIR}/CompileReleaseCode.def ${T_TMPDIR}

  11. The Executable (Part 5) #-------------------------------------------------------------------------- # transformation script call #-------------------------------------------------------------------------- echo echo "=======================" echo "TRANSFORMATION STARTING" echo "=======================" echo cat PoolFileCatalog.xml 2> /dev/null #Use Athena job transforms from the local software release csc_reco_trf.py "${actual_filename}" "${T_OUTFN_ESD}" "${T_OUTFN_AOD}" ${T_NEVT} ${T_SKIP} "${T_GEO_VER}" echo echo "=======================" echo " END OF TRANSFORMATION" echo "=======================" echo \ls -l ${T_OUTFN_SIM} \ls -l ${T_OUTFN_DIG}

  12. The Executable (Part 6) echo "===============================" echo " REGISTERING OUTPUT FILES " echo "===============================" mv PoolFileCatalog.xml ${T_HOMEDIR} mv ${T_OUTFN_SIM} ${T_HOMEDIR} mv ${T_OUTFN_DIG} ${T_HOMEDIR} cd ${T_HOMEDIR} cat PoolFileCatalog.xml 2> /dev/null echo ">> STAGE-OUT: ${T_OUTFN_AOD}" if [ -f ${T_OUTFN_AOD} -a -f PoolFileCatalog.xml ] ; then #echo ">> . $STAGER -d ${T_SE} output /datafiles/${T_LCN} ${T_OUTFN_DIG}" #. $STAGER -d ${T_SE} output datafiles/${T_LCN} ${T_OUTFN_DIG} echo ">> lcg-cr -v -l /grid/atlas/users/steved/bphys/postroma/rec/017700.Bs_Jpsi_mu6mu3_phi_KplusKminus11041/${T_OUTFN_AOD} -n 8 -d ${T_SE} -t 3000 --vo atlas file:${pwd}/${T_OUTFN_AOD}" lcg-cr -v -l /grid/atlas/users/steved/bphys/postroma/dig/017700.Bs_Jpsi_mu6mu3_phi_KplusKminus11041/${T_OUTFN_AOD} -n 8 -d ${T_SE} -t 3000 --vo atlas file:${pwd}/${T_OUTFN_AOD} else echo ">> Not found" fi echo echo "===============================" echo " END OF JOB " echo "===============================" exit #Put all six parts together for a complete script.

  13. Running • grid-proxy-init [lcgui0361] /afs/rl.ac.uk/user/s/steved/Atlas/SteveGridTest/scripts > grid-proxy-init Your identity: /C=UK/O=eScience/OU=CLRC/L=RAL/CN=stephen dallison Enter GRID pass phrase for this identity: Creating proxy .......................................................... Done Your proxy is valid until: Tue Aug 15 04:41:17 2006 • edg-job-list-match -rank <MyJDLFile>.jdl [lcgui0361] /afs/rl.ac.uk/user/s/steved/Atlas/SteveGridTest/jdl > edg-job-list-match -rank RecoTest.jdl Selected Virtual Organisation name (from JDL): atlas Connecting to host lcgrb01.gridpp.rl.ac.uk, port 7772 *************************************************************************** COMPUTING ELEMENT IDs LIST The following CE(s) matching your job requirements have been found: *CEId* *Rank* ce02.esc.qmul.ac.uk:2119/jobmanager-lcgpbs-lcg2_long 433 cclcgceli02.in2p3.fr:2119/jobmanager-bqs-atlas_long 369 fal-pygrid-18.lancs.ac.uk:2119/jobmanager-lcgpbs-atlas 257 lcg00125.grid.sinica.edu.tw:2119/jobmanager-lcgpbs-atlas119 lcgce.ijs.si:2119/jobmanager-pbs-atlas 71 grid003.ft.uam.es:2119/jobmanager-lcgpbs-atlas 64 grid109.kfki.hu:2119/jobmanager-lcgpbs-atlas 64 gw39.hep.ph.ic.ac.uk:2119/jobmanager-lcgpbs-atlas 44 etc… etc…. Get a proxy certificate Find grid sites where your job can run

  14. Running Cont. • edg-job-submit --vo atlas -o <OUT_URL_FILE> <MyJDLFile>.jdl [lcgui0361] /afs/rl.ac.uk/user/s/steved/Atlas/SteveGridTest/jdl > edg-job-submit --vo atlas -o RECOFILES_TEST RecoTest.jdl Selected Virtual Organisation name (from --vo option): atlas Connecting to host lcgrb01.gridpp.rl.ac.uk, port 7772 Logging to host lcgrb01.gridpp.rl.ac.uk, port 9002 ================================ edg-job-submit Success=========================== The job has been successfully submitted to the Network Server. Use edg-job-status command to check job current status. Your job identifier (edg_jobId) is: - https://lcgrb01.gridpp.rl.ac.uk:9000/7el4yHLhwVYIAAPLZD1PdQ The edg_jobId has been saved in the following file: /afs/rl.ac.uk/user/s/steved/Atlas/SteveGridTest/jdl/RECOFILES_TEST ============================================================================== Submit job to grid

  15. Running Cont.. • edg-job-status -i <OUT_URL_FILE> [lcgui0361] /afs/rl.ac.uk/user/s/steved/Atlas/SteveGridTest/jdl > edg-job-status -i RECOFILES_TEST ------------------------------------------------------------------ 1 : https://lcgrb01.gridpp.rl.ac.uk:9000/37K1ydLTafuPkV2_CbMm1g 2 : https://lcgrb01.gridpp.rl.ac.uk:9000/urcnVuPGlK0hC3ir9p2yCw 3 : https://lcgrb01.gridpp.rl.ac.uk:9000/uf3o-LTwAUoYuGeCqzLBOQ 4 : https://lcgrb01.gridpp.rl.ac.uk:9000/dKwQxBekGfRDYeXfuS8lPA 5 : https://lcgrb01.gridpp.rl.ac.uk:9000/a99gO9ncDIWO7R-kzb16aQ 6 : https://lcgrb01.gridpp.rl.ac.uk:9000/7el4yHLhwVYIAAPLZD1PdQ a : all q : quit ------------------------------------------------------------------ Choose one or more edg_jobId(s) in the list - [1-6]all:6 ************************************************************* BOOKKEEPING INFORMATION: Status info for the Job : https://lcgrb01.gridpp.rl.ac.uk:9000/7el4yHLhwVYIAAPLZD1PdQ Current Status: Scheduled Status Reason: Job successfully submitted to Globus Destination: ce02.esc.qmul.ac.uk:2119/jobmanager-lcgpbs-lcg2_long reached on: Mon Aug 14 20:52:18 2006 ************************************************************* • edg-job-get-output -dir <output_dir> -i <OUT_URL_FILE> Check status of jobs Retrieve output files

  16. Running Cont… • edg-job-get-output -dir <output_dir> -i <OUT_URL_FILE> • N.B. Running ATLAS software releases earlier than 11.5.0 sometimes requires the use of the old bash style JobTransform scripts, and the old style cmt set up • cmt setup different • download tar ball of jobTransforms • tar -zxvf <JT>.tar.gz and edit relevant files if necessary. Then tar -zcvf <JT>.tar.gz • Including extra packages • cmt co <AnyExtraPackageYouMightNeed> • Include in Input Sandbox • do relevant cmt stuf fin executable • Getting expert help • https://gus.fzk.de/pages/home.php • This is a web based forum which users must register with. • You submit a “ticket” and your problem is assigned to an expert who then liaises with you directly Retrieve output files

  17. Useful LCG Commands • List files in catalogue • lfc-ls `lcg-infosites --vo atlas lfc`:/grid/atlas/users/<MyCatalogueDir> • Make a new directory in the catalogue • lfc-mkdir `lcg-infosites --vo atlas lfc`:/grid/atlas/users/<MyNewCatalogueDir> • Register an existing file in the catalogue • lcg-rf -v --vo atlas -l lfn :/grid/atlas/users/<MyCatalogueDir> /<file> srm:/castorsrm.cern.ch/castor/<dir>/<file> • Make a replica • lcg-rep -v --vo atlas lfn :/grid/atlas/users/<MyCatalogueDir>/<file> -d srm://dcache.gridpp.rl.ac.uk/pnfs/gridpp.rl.ac.uk/data/atlas/<dir>/<file> • See this file at RAL using: • export LD_PRELOAD=/opt/dcache/dcap/lib/libpdcap.so • cd dcache.gridpp.rl.ac.uk/pnfs/gridpp.rl.ac.uk/data/atlas/<dir> • cat <file> • See the LCG manual for further info. • http://egee.itep.ru/User_Guide.html

  18. The ATLAS software tags installed at each Grid site. Notice the different formats: E.g. VO-atlas-release-11.0.42 for prior to 11.5.0, VO-atlas-offline-11.5.0 for 11.5.0 and after. This difference must be taken into account in the *.jdl file. Check ATLAS Tag Availability • lcg-infosites --vo atlas tag [lcgui0361] /afs/rl.ac.uk/user/s/steved/Atlas/SteveGridTest/jdl > lcg-infosites --vo atlas tag Name of the CE: g03n04.pdc.kth.se Name of the CE: lcgce.ijs.si VO-atlas-release-11.0.42 VO-atlas-release-11.0.5 VO-atlas-offline-11.5.0 VO-atlas-offline-12.0.1 Name of the CE: lpnce.in2p3.fr VO-atlas-offline-12.0.1 VO-atlas-release-11.0.42 VO-atlas-release-11.0.5 etc...

  19. Examples • The following examples of grid jobs can be found in the ATLAS Physics Workbook. They range from a very basic “Hello World” job… https://twiki.cern.ch/twiki/bin/view/Atlas/WorkBookRunningGrid • …to running an Athena job… https://twiki.cern.ch/twiki/bin/view/Atlas/WorkBookAthenaGrid • …and then a Full Chain production https://twiki.cern.ch/twiki/bin/view/Atlas/WorkBookFullChainGrid

More Related