1 / 10

Parrot and ATLAS Connect

Parrot and ATLAS Connect. Rob Gardner Dave Lesny. ATLAS Connect. A Condor and Panda-based batch service to easily connect resources Connect to ATLAS Compliant resources like a Tier2 Connect to opportunistic resources such as campus clusters

sora
Download Presentation

Parrot and ATLAS Connect

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parrot and ATLAS Connect Rob Gardner Dave Lesny

  2. ATLAS Connect • A Condor and Panda-based batch service to easily connect resources • Connect to ATLAS Compliant resources like a Tier2 • Connect to opportunistic resources such as campus clusters • Stampede cluster at the Texas Advance Computing Center • Midway cluster at University of Chicago • Illinois Campus Cluster at UIUC/NCSA • Each is RHEL6 or equivalent with either SLURM or PBS as local scheduler

  3. Accessing Stampede • Use simple Condor submit using BLAHP protocol (ssh login to stampede local submit host) (factory based on http://bosco.opensciencegrid.org) • Test for prerequisites • APF uses same mechanism • PanDA queues – operated from MWT2 • APF for pilot submission • CONNECT: production queue • ANALY_CONNECT: analysis queue • MWT2 storagefor DDM endpoints • Frontier squid service

  4. Challenges • Additional system libraries (“ATLAS compatibility libraries”) as packaged in HEP_Oslibs_SL6 • Access to CVMFS clients and cache • Environment variables normally setup by an OSG CE, needed by the pilot • $OSG_APP, $OSG_GRID, $VO_ATLAS_SW_DIR • Approach was to provide via the user job wrapper these components

  5. Approaches • Linux Image with all libraries built using fake[ch]root • Deploy this image locally via tarball or via a CVMFS repo • Use the CERN VM3 image in /cvmfs/cernvm-prod.cern.ch • Use Parrot to provide access to CVMFS repositories • Use Parrot “–mount” to map file references into the Image /usr/lib64  /cvmfs/cernvm-prod.cern.ch/cvm3/usr/lib64 • Install a Certificate Authority and OSG WN Client • Emulate the CE by defining envvars • Some defined in APF ($VO_ATLAS_SW_DIR, $OSG_SITE_NAME) • Others defined in “wrapper” ($OSG_APP, $OSG_GRID)

  6. Problems (1) • Symlinks cannot be followed between repositories • Not possible with Parrot due to restrictions with libcvmfs • /cvmfs/osg.mwt2.org/atlas/sw /cvmfs/atlas.cern.ch/repo/sw • In general, we find cross-referencing CVMFS repos unreliable • A python script located in atlas.cern.ch needs a lib.so • If lib.so resides in another repo, might get “File not found” • Solution was to use a local disk for the Linux Image • Solution: • Download a tarballand installed locally on disk • Also install local OSG worker-node client and CA in same location

  7. Problems (2): Parrot stability • Parrot is very sensitive to the kernel version • When used on kernels 2.x, many atlas programs hang • Parrot uses ptrace and clones the system call • Bug in ptrace in some kernels cause a timing problem • Program being traced is awakened with “sigcont” before it should • Result is that the program stays in “T” state forever • Kernels known to have issues with Parrot ICC 2.6.32-358.23.2.el6.x86_64 Stampede 2.6.32-358.18.1.el6.x86_64 Midway 2.6.32-431.11.2.el6.x86_64 • Custom kernel at MWT2 which seems to work is “3.2.13-UL3.el6”

  8. Towards a solution: Parrot 4.1.4rc5 • To work around the hangs, CCTools team provided a feature --cvmfs-enable-thread-clone-bugfix • Stops many (not all) hangs with a huge performance penalty • Simple ARLB with an asetup of a release take 10x to 100x longer • Needed on 2.x kernels to avoid many of the hangs • Programs which tend to run on 2.x without “bugfix” are • Atlas Local Root Base setup (and diagnostics db-readReal and db-fnget) • Reconstruction • Panda Pilots • Validation jobs • Programs which tend to hang • Sherpa (always) • Release 16.x jobs • Some HammerCloud tests (16.x always, 17.x sometimes)

  9. Alternatives to Parrot? • The CCTools team will be working on Parrot to fix bugs • May need to use kernel 3.x on target site for reliability • Three solutions we are pursuing: • Parrot with Chirp (avoid libcvmfs) • NFS mounting of local CVMFS (requires admin) • Use Environment Modules, common on HPC facilities • Treat CVMFS client as a user application • Jobs “module load cmvfs-client” • Prefix has privileges – can load needed FUSE modules • Cache re-use my multi-core job slots • Might be more palatable to HPC admins

  10. Conclusions • Good experience accessing opportunistic resources without WLCG or ATLAS services • A general problem for campus clusters • Would greatly help if we: • Relied on only one CVMFS repo + stock SL6 (like CMS) • Will continue pursuing the three alternatives • Hope we can learn from others here!

More Related