100 likes | 287 Views
Parrot and ATLAS Connect. Rob Gardner Dave Lesny. ATLAS Connect. A Condor and Panda-based batch service to easily connect resources Connect to ATLAS Compliant resources like a Tier2 Connect to opportunistic resources such as campus clusters
E N D
Parrot and ATLAS Connect Rob Gardner Dave Lesny
ATLAS Connect • A Condor and Panda-based batch service to easily connect resources • Connect to ATLAS Compliant resources like a Tier2 • Connect to opportunistic resources such as campus clusters • Stampede cluster at the Texas Advance Computing Center • Midway cluster at University of Chicago • Illinois Campus Cluster at UIUC/NCSA • Each is RHEL6 or equivalent with either SLURM or PBS as local scheduler
Accessing Stampede • Use simple Condor submit using BLAHP protocol (ssh login to stampede local submit host) (factory based on http://bosco.opensciencegrid.org) • Test for prerequisites • APF uses same mechanism • PanDA queues – operated from MWT2 • APF for pilot submission • CONNECT: production queue • ANALY_CONNECT: analysis queue • MWT2 storagefor DDM endpoints • Frontier squid service
Challenges • Additional system libraries (“ATLAS compatibility libraries”) as packaged in HEP_Oslibs_SL6 • Access to CVMFS clients and cache • Environment variables normally setup by an OSG CE, needed by the pilot • $OSG_APP, $OSG_GRID, $VO_ATLAS_SW_DIR • Approach was to provide via the user job wrapper these components
Approaches • Linux Image with all libraries built using fake[ch]root • Deploy this image locally via tarball or via a CVMFS repo • Use the CERN VM3 image in /cvmfs/cernvm-prod.cern.ch • Use Parrot to provide access to CVMFS repositories • Use Parrot “–mount” to map file references into the Image /usr/lib64 /cvmfs/cernvm-prod.cern.ch/cvm3/usr/lib64 • Install a Certificate Authority and OSG WN Client • Emulate the CE by defining envvars • Some defined in APF ($VO_ATLAS_SW_DIR, $OSG_SITE_NAME) • Others defined in “wrapper” ($OSG_APP, $OSG_GRID)
Problems (1) • Symlinks cannot be followed between repositories • Not possible with Parrot due to restrictions with libcvmfs • /cvmfs/osg.mwt2.org/atlas/sw /cvmfs/atlas.cern.ch/repo/sw • In general, we find cross-referencing CVMFS repos unreliable • A python script located in atlas.cern.ch needs a lib.so • If lib.so resides in another repo, might get “File not found” • Solution was to use a local disk for the Linux Image • Solution: • Download a tarballand installed locally on disk • Also install local OSG worker-node client and CA in same location
Problems (2): Parrot stability • Parrot is very sensitive to the kernel version • When used on kernels 2.x, many atlas programs hang • Parrot uses ptrace and clones the system call • Bug in ptrace in some kernels cause a timing problem • Program being traced is awakened with “sigcont” before it should • Result is that the program stays in “T” state forever • Kernels known to have issues with Parrot ICC 2.6.32-358.23.2.el6.x86_64 Stampede 2.6.32-358.18.1.el6.x86_64 Midway 2.6.32-431.11.2.el6.x86_64 • Custom kernel at MWT2 which seems to work is “3.2.13-UL3.el6”
Towards a solution: Parrot 4.1.4rc5 • To work around the hangs, CCTools team provided a feature --cvmfs-enable-thread-clone-bugfix • Stops many (not all) hangs with a huge performance penalty • Simple ARLB with an asetup of a release take 10x to 100x longer • Needed on 2.x kernels to avoid many of the hangs • Programs which tend to run on 2.x without “bugfix” are • Atlas Local Root Base setup (and diagnostics db-readReal and db-fnget) • Reconstruction • Panda Pilots • Validation jobs • Programs which tend to hang • Sherpa (always) • Release 16.x jobs • Some HammerCloud tests (16.x always, 17.x sometimes)
Alternatives to Parrot? • The CCTools team will be working on Parrot to fix bugs • May need to use kernel 3.x on target site for reliability • Three solutions we are pursuing: • Parrot with Chirp (avoid libcvmfs) • NFS mounting of local CVMFS (requires admin) • Use Environment Modules, common on HPC facilities • Treat CVMFS client as a user application • Jobs “module load cmvfs-client” • Prefix has privileges – can load needed FUSE modules • Cache re-use my multi-core job slots • Might be more palatable to HPC admins
Conclusions • Good experience accessing opportunistic resources without WLCG or ATLAS services • A general problem for campus clusters • Would greatly help if we: • Relied on only one CVMFS repo + stock SL6 (like CMS) • Will continue pursuing the three alternatives • Hope we can learn from others here!