90 likes | 249 Views
CVMFS Post Mortem. Doug Benjamin Duke University. What happened?. PoolFileCatalog.xml became corrupt The relevant section of the file is - <File ID="6651E9BA-061E-DD11-8F27- 00304879FC6E“> <physical>
E N D
CVMFS Post Mortem Doug Benjamin Duke University
What happened? • PoolFileCatalog.xml became corrupt • The relevant section of the file is - <File ID="6651E9BA-061E-DD11-8F27-00304879FC6E“> <physical> <pfn filetype="ROOT_All" name="/cvmfs/a<File ID="F80FEF94-CAF8-E011-8FBD-003048F0E7A2"> <physical> <pfnfiletype="ROOT_All" name="/cvmfs/atlas-condb.cern.ch/repo/conditions/cond11/cond11_data. 000012.gen.COND/cond11_data.000012.gen.COND._0002.pool.root"/> </physical> <logical> <lfn name="cond11_data.000012.gen.COND._0002.pool.root"/> </logical> </File> The first <pfnfiletype="ROOT_ALL" name="/cvmfs/a ... is bogus.
What happened (2) • Lead cvmfs developer was cleaning the repository and triggered the publishing of the bogus file. • He did not know it was bogus (There is no way he would have known) • Stratum 1 servers within 1 hour picked up the bogus file and published it. • Cron jobs on Stratum 1 servers fetch files from the Stratum 0 server hourly • Cvmfs clients fetch files from the Stratum 1 servers whenever either time to live information expires or automount of cvmfs areas is triggered
How was the PFC created • The PoolFileCatalog.xml is create by a cron scriptthat runs this command in loop: where $dir_list is dir_list="oflcondcmccondcomcond cond08 cond09 cond10 cond11 cond12 cond13 cond14 cond15 cond16 cond17 cond18 cond19 cond20" and ATLAS_POOLCOND_PATH is export ATLAS_POOLCOND_PATH=/cvmfs/atlas-condb.cern.ch/repo/conditions # loop over the directories for dir in $dir_list do # determine if there are any data sets ls -1 ${ATLAS_POOLCOND_PATH}/${dir}/* > /dev/null 2>&1 if [ "$?" = "0" ] then echo "running command - dq2-ls -T ${ATLAS_POOLCOND_PATH}/${dir}" >> $LogFile 2>&1 dq2-ls -T ${ATLAS_POOLCOND_PATH}/${dir} >> $PoolFileCatalogLog 2>&1 retcode=RC$? if [ $retcode != "RC0" ] ; then echo "Error - failed to update PoolFileCatalog - exiting " >> $LogFile 2>&1 echo "Error - failed to update PoolFileCatalog - exiting " exit 1 fi else echo "${ATLAS_POOLCOND_PATH}/${dir} does not have datasets" >> $LogFile 2>&1 fi done
What was the immediate fix? • The bogus lines were removed from the PoolFileCatalog.xml • The cron job that does the file checkout and ultimate publishing was stopped and has not been restarted
Why it happened? • Not sure why the PoolFileCatalog creation failed? • Logs did not give any indication of the failure. • Did not have a backup PFC file.
Remediation steps • Ultimately use Alessandro DeSalvo’ssw-mgr code to get the datasets, create the PFC (saves older version) • Requires ATLAS software releases available on the conditions db machine. • Steve Traylen working on cvmfs mounts – It is a bit tricky and troublesome • Run in cron job xml and file verification step from Misha Borodin
Short term plans • Resume fetching of datasets to machine • Will be done manually (with same script w/o the publishing step) • Will run PFC file creation separately. • Add xml format verification • PFC file backup (keep a few copies) • Once everything looks good. Publish manually • Will update every day or so
Intermediate plans • Once ATLAS code is available • Implement sw-mgr creation of PFC and fetch of the datasets. • Initially will be done by hand • Ultimately moved to cron job • Will add e-mail notification in case of failures