Creating an opportunistic OSG site inside the PRP Kubernetes cluster

Creating an opportunistic OSG site inside the PRP Kubernetes cluster at HEPIX Spring 2019 Workshop by Igor Sfiligoi, UCSD together with Edgar Fajardo Hernandez and Dima Mishin, also UCSD

Pacific Research Platform (PRP) originally created as a regional networking project • Connect all UC campuses with 10G to 100G links • And keep the links in top-notch condition, through continuous testing • But scientists really need more than bandwidth tests • They need to share their data at high speed and compute on it, too • PRP v2, with CHASE-CI and Partners, added disk and GPUs • Most CPUs reserved for high-speed networking and IO • But GPUs fully dedicated for user compute • Kubernetes chosen as the resource management system • Makes it easy to mix services and user computation PRP and Kubernetes (GDC)

Originally created by Google, now maintained by Cloud Native Computing Foundation • Open source, with very large and active development community • Can be deployed on-prem, but also available out-of-the-box on all major Clouds (AWS, Azure and GCP) • Kubernetes is a container orchestration system • Containers are great for services (you can be root). • But containers are great for users, too.Many popular pre-made containers available (e.g. tensorflow) Kubernetes (k8s) as a resource manager • Provides all the usual orchestration functionality • Nodes selection • Network provisioning • Configuration management • Storage provisioning and re-mapping • A scheduling system is provided • Mostly based on quotas and priorities • Works best when there is little contention

PRP has put in place a nice Web interface, called Nautilus • Facilitates self-service use of the k8s cluster • Provides a k8s integrated JupyterLab instance for ease of use • Command line access through kubectl available for advanced users • Many users doing really interesting science • Mostly ML/AI, but not only • There was very little contention in the system • Which makes for very happy users • But some GPU cycles went unused PRP GPU k8s use before Igor showed up Image processing Object segmentation and tracking

We wanted to maximize GPU usage • Without affecting the low-contention experience Nautilus users were used to • Which implies opportunistic pods should be preempted with little or no notice • Kubernetes supports this model • Just set the Pod priority low enough Opportunistic use to maximize GPU use • Open Science Grid (OSG) users are used to preemption • May not like it, but will tolerate it • Especially if it means access to more resources • We had a OSG user group who could use more GPUs • The IceCubeexperiment Direct Photon Propagation > kubectl get priorityclasses NAME VALUE GLOBAL-DEFAULT AGE opportunistic -2000000000 false 31d nice -10 false 31d owner 10 false 24d system-cluster-critical 2000000000 false 265d system-node-critical 2000001000 false 265d

OSG does not (yet) natively support Kubernetes • So I had to do some gluing • I had some experience with HTCondorglideins • Which is a fancy word for running a dynamically provisioned HTCondor pool • And OSG does support resource providers using HTCondor as a resource manager Creating an OSG site in a k8s cluster • The HTCondor pool in PRP k8s was born • One collector/negotiator pod • One schedd pod, which also runs the OSG site portal, called HTCondor-CE • One worker node pod per available GPU • Worker node pods are assigned low priority • So they will be preempted the moment any other user needs the GPU • HTCondor gracefully deals with this

OSG site on k8s in a picture Host Host Docker / K8s pod Docker / K8s pod Startd Collectorand Negotiatior Host Docker / K8s pod WAN Docker / K8s pod Startd HTCondor-CE App Schedd … Host VPN Other k8sservices Other k8sservices Docker / K8s pod Other k8sservices Startd

The major config choice is security • Cannot be IP based (dynamic WNs) • Picked the so-called pool password method • Injected into the image usinga Secret object • Using the k8s deployment, to deal with eventual Pod failures • Registered as a Service, so it gets a persistent private IP and DNS name HTCondor collector node kind: Deployment metadata: name: osg-collector-prp-sdsc namespace: osg labels: k8s-app: osg-collector-prp spec: template: metadata: labels: k8s-app: osg-collector-prp spec: containers: - name: osg-collector-prp image: sfiligoi/prp-osg-pool:collector ports: - containerPort: 9618 volumeMounts: - name: configpasswd mountPath: /var/lock/condor/pool_password subPath: pool_password readOnly: true volumes: - name: configpasswd secret: secretName: osg-pool-sdsc-config items: - key: pool_password path: pool_password defaultMode: 256 kind: Service metadata: labels: k8s-app: osg-collector-prp name: osg-collector-prp namespace: osg spec: ports: - port: 9618 protocol: TCP targetPort: 9618 selector: k8s-app: osg-collector-prp type: ClusterIP https://github.com/sfiligoi/prp-osg-pool

HTCondor-CE is external facing • Must bind to public network interface(x509 influenced requirement) • Requires elevated privilege • Chose to mount x509 cert from host • Mounting fixed host partitions for persistency • Using local persistent volumes • User list currently hardcoded during pod startup • Plus normal OSG SE setup HTCondor-CE and Schedd kind: Deployment metadata: name: osg-head-prp-sdsc spec: template: spec: hostNetwork: true nodeSelector: kubernetes.io/hostname: prpce.sdsc.edu containers: - name: osg-head-prp image: sfiligoi/prp-osg-head volumeMounts: - name: hostcert mountPath: /etc/grid-security/hostcert.pem - name: condordata mountPath: /var/lib/condor - name: configgridmap mountPath: /etc/grid-security/grid-mapfile subPath: grid-mapfile volumes: - name : hostcert hostPath: path: /etc/grid-security/hostcert.pem type: File - name: condordata persistentVolumeClaim: claimName: pvc-persistent-1 - name: configgridmap secret: secretName: osg-head-prp-sdsc-config items: - key: grid-mapfile path: grid-mapfile defaultMode: 292 apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: local-persistent-1 # Local storage does not need a provisioner provisioner: kubernetes.io/no-provisioner reclaimPolicy: Retain volumeBindingMode: Immediate --- kind: PersistentVolumeClaim metadata: name: pvc-persistent-1 spec: accessModes: - ReadWriteOnce resources: requests: storage: 10Gi storageClassName: local-persistent-1 https://github.com/sfiligoi/prp-osg-head

Using standard OSG WN setup • But starting from NVIDIA base imageto interface with CUDA kernel • CVMFS was not trivial • Since we want to run without any special privileges • Worker nodes need to share • Pool password with collector • User list with CE • But no need for a shared file systems • Using a deployment with high replica Count • But pods have very low priority • Will be preempted the moment any other user needs the GPU Worker nodes $ cat Dockerfile FROM nvidia/cuda:10.0-runtime-centos7 …# OSG RPMs RUN yum -y install osg-wn-client && \ yum -y install condor … More on thisin next slide kind: Deployment metadata: name: osg-wn-gpu-opt labels: k8s-app: osg-wn-gpu-opt spec: replicas: 150 template: metadata: labels: k8s-app: osg-wn-gpu-opt spec: priorityClassName: opportunistic containers: - name: osg-wn-gpu-opt image: sfiligoi/prp-osg-pool:wn-gpu … https://github.com/sfiligoi/prp-osg-pool

Mounting a FUSE mountpoint from inside a container requires elevated privileges • Not something a k8s admin would want to give pods running user code • Most hosts do not install CVMFS in PRP • Making a fat image with a snapshot of CVMFS was not desirable CVMFS inside Kubernetes • The k8s recommended way is to use something calledthe Container-Storage-Interface (CSI) • A DaemonSet pod with cvmfs code is deployed on each host • PersistentVolumeClaims used to interface them with user pods • CERN version uses deprecated API • Dima ported it to latest k8s API version • PR submitted, will continue dialog https://github.com/cernops/cvmfs-csi/pull/1

Current CVMFS CSI interface requires explicit mounting of each repository • Currently mounting 5 of them just to support IceCube • Would need to add several more for a more generic setup • Some edge cases still need to be ironed out • Pods hang if CSI DaemonSet pods get restarted while claimed Using CVMFS in worker nodes kind: Deployment … spec: template: spec: containers: - name: osg-wn-gpu-opt volumeMounts: - name: cvmfs-oasis mountPath: /cvmfs/oasis.opensciencegrid.org readOnly: true - name: cvmfs-icecube mountPath: /cvmfs/icecube.opensciencegrid.org readOnly: true … volumes: - name: cvmfs-config-oasis persistentVolumeClaim: claimName: csi-cvmfs-pvc-oasis readOnly: true - name: cvmfs-icecube persistentVolumeClaim: claimName: csi-cvmfs-pvc-icecube readOnly: true … apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: csi-cvmfs-icecube provisioner: csi-cvmfsplugin parameters: repository: icecube.opensciencegrid.org --- kind: PersistentVolumeClaim metadata: name: csi-cvmfs-pvc-icecube spec: accessModes: - ReadOnlyMany resources: requests: storage: 1Gi storageClassName: csi-cvmfs-icecube https://github.com/sfiligoi/prp-osg-cvmfs

IceCube uses OpenCL • NVIDIA base image does not come with OpenCL enabled out of the box • Even though they do provide the opencl libraries! • Trivial to solve, once you know the trick Other setup issues May not be possible in the foreseeable future, but not sure yet • No nested containerization • Cannot use OSG-recommended singularity • Took a few iterations with IceCube to get all the system libraries in the base image • Some GPU models were seen via CUDA but not OpenCL • Added additional validation at pod startup • Also helps detect broken GPUs $ cat Dockerfile FROM nvidia/cuda:10.0-runtime-centos7 # Enable OpenCL # As suggested by https://github.com/WIPACrepo/pyglidein/ RUN mkdir -p /etc/OpenCL/vendors && \ echo "libnvidia-opencl.so.1" > /etc/OpenCL/vendors/nvidia.icd …

Mostly running without human intervention • After initial kinks were sorted out • CVMFS CSI daemon restarts still a concern Operational experience • Priority mechanism works as advertised • OSG pods use only otherwise unused resources PRP Grafana view GPUs used for visualization

The PRP k8s now significant IceCube resource OSG GPU GRACC view, last 21 days OSG Glidein Factory view

Current setup a monitoring nuisance • OSG pods sit idle if IceCubehas nothing in the queue • Hard to distinguish misbehaving user job from empty pod • Current setup also only works with one opportunistic user • No easy way to share with two+ communities • As mentioned, k8s not good at handling contention Any complaints with current setup? ?

Engaged the HTCondor group to add k8s support to HTCondor-G • In particular, as part of the OSG HTCondor-CE • Got assurances that it would happen, but no timeline yet • That would solve the problem of having idle OSG pods in the system • Will likely not solve the problem of multiple opportunistic users on its own • But it is a pre-requisite for k8s-native improvements HTCondor-native k8s support desired

Creating an opportunistic OSG site inside the PRP k8s cluster was a major success • The OSG IceCube community was able to get significant additional science done • Without any adverse effect to the primary PRP users • Current setup mostly working fine • Although the CVMFS setup could use some hardening • The lack of native OSG support for k8s currently just a minor nuisance • But having a proper solution in place before it escalates would be highly desirable Summary

The PRP Kubernetes cluster is supported by the US National Science Foundation (NSF) awards CNS-1730158, ACI-1541349 and OAC-1826967 The authors are part of the San Diego Supercomputer Center Acknowledgments

Creating an opportunistic OSG site inside the PRP Kubernetes cluster