Data production using CernVM and lxCloud Dag Toppe Larsen Belgrade 2013-05-28

Data production using CernVM and lxCloudDag Toppe LarsenBelgrade 2013-05-28

NA61/NA49 meeting, Belgrade Outline • New data production scripts • Virtualised data production • Data production manager • Next steps

NA61/NA49 meeting, Belgrade Data production sequence

NA61/NA49 meeting, Belgrade New data production scripts • New set of scripts • prodna61-produce-reaction.sh • prodna61-produce-chunk.sh • prodna61-find-chunk-errors.sh • Details on next slides • Exclusively use xRootd interface to Castor • Initially, the scripts were mainly focused on CernVM, however recent involvement in “normal” data production provided an opportunity to focus on lxBatch as well • Involvement in “normal” data production gave much better overview/understanding of requirements for it • Scripts “work”, but are some issues that need to be addressed for fully automated usage • Will significantly save work/reduce chance for mistakes for data productions even when executed from command line by hand • To be executed from web data production manager

NA61/NA49 meeting, Belgrade New data production scripts • prodna61-produce-reaction.sh <reaction> • e.g. prodna61-produce-reaction.sh BeBe160 • Initiates production of reaction • Get lists of: • chunks from bookkeeping/Castor • software from file system • global keys from KEY DB • Takes latest global key and software by default (additional parameters otherwise) • Submits jobs to batch system (either CernVM/lxBatch) • Jobs run prodna61-produce-chunk.sh script (next slide) • Small differences between lxBatch/CernVM versions (related to different batch systems)

NA61/NA49 meeting, Belgrade New data production scripts • prodna61-produce-reaction.sh • Several parameters defining paths, global key, software versions, etc. • Designed to be flexible, also with regards to processing outside CERN • Configuration parameters like magnetic field, etc., not passed, job will determine this itself • Modifies a template set-up file (prodna61-setup) to fit requirements • Steps: • Get raw file from Castor, unpack it • Run legacy software • Run ROOT61 • Convert legacy to SHOE • Run native Shine reconstruction (PSD) • Merge converted legacy data and native Shine data • Create mini-Shoe • Run Anar's QA (on chunk, intended to be merged later) • Upload files to Castor and/or local disk • Compress log file and store to Castor • The process will be easier after switch to Shine native reconstruction, since most complications are related to legacy chain • Typically not run by user directly (but possible), but submitted to batch system by prodna61-produce-reaction.sh • Same version for CernVM/lxBatch

NA61/NA49 meeting, Belgrade New data production scripts • prodna61-find-chunk-errors.sh • Searches for errors for given chunk • Errors searched for: • Check Castor for too small/empty/non-existent DSPACK, ROOT, SHOE, MINI-SHOE, LOG or QA file • Scan log file for failed events (above given threshold) and job exited/killed/terminated • Intended to be run as acrontab job for production manager web page for all finished jobs • Same version for CernVM/lxBatch

NA61/NA49 meeting, Belgrade Remaining issues for DP scripts • The reconstruction can be run with either “-pA” or “-pp” for fitting of primary vertex • Which one is preferable for given reaction typically depends on target length • Run-script: • run_keys="-d all -256 -pp -keep -minipoint -points -f $setup" • run_keys="-d all -256 -pA -keep -minipoint -points -f $setup" • Setup-file • Exec $v0find -s $DSPACK_SERVER -d all -pp • Exec $v0find -s $DSPACK_SERVER -d all -pA • Exec $xi_fit -s $DSPACK_SERVER -d all -f 13 -pp • Exec $xi_fit -s $DSPACK_SERVER -d all -f 13 -pA • Question: • Would it be possible to have a KEY for this? • Otherwise need to store in separate “database”, increasing complexity • Why is -pp/-pA called both as parameter to run-script and inside set-up file? • Does it ever happen that they are both not used simultaneously?

NA61/NA49 meeting, Belgrade Remaining issues for DP scripts • KEY5 • e.g. KEY5=CALC/STD+ • Question: • Is there any reason why this is set explicitly in the set-up file, and not from the global key? • Residual corrections • e.g. Exec res_corr -s $DSPACK_SERVER -vt1_chris $CORR_DIR/vt1_2009_pp158.corr -vt2_chris $CORR_DIR/vt2_2009_pp158.corr -mtl_chris $CORR_DIR/mtl_2009_pp158.corr -mtr_chris $CORR_DIR/mtr_2009_pp158.corr -p $CORR_DIR_OLD/vdrift_2009.txt • Question: • Can we have a KEY for this as well? • Currently, we only have one set of residual correction files. Are more envisioned?

NA61/NA49 meeting, Belgrade Remaining issues for DP scripts • xRootd replacement for “nsls <path>” is “xrd castorpublic dirlist <path>” • Very slow for either directories with many files or deep directory trees • Several minutes to return data • Not very practical for user interaction • Used for obtaining list of chunks for reactions from Castor • Possible solutions: • Ask IT if it can be improved • Obtain data from bookkeeping database instead • PSD reconstruction for Shine needs different files for different run conditions • PSDReconstructor.xml,PSDCalibXMLConfig.xml • Question: • Can this be done more automatic on the Shine side?

NA61/NA49 meeting, Belgrade Any other parameters? • Set-up file is currently modified from a template for magnetic field, residual corrections and -pp/ -pA • Either by hand for “old” data production scripts • Or automatic for “new” scripts • Question: • Are there any other parameters that need to taken into account for the data production?

NA61/NA49 meeting, Belgrade Create/destroy virtual clusters • Scripts for creating/destroying virtual clusters of virtual machines on lxCloud (or other clouds) created • Will be possible to launch older virtual machines for data preservation (running older versions of data production software) • Need some tuning for latest iteration of test lxCloud • Final lxCloud processing is charged per hour a virtual machine is running (no matter if it does processing or not) • Important to be able to create/destroy virtual clusters on demand • The creating/destroying of virtual machines must be controlled by the web production manager • Some control logic needs to be developed on web production manager side

NA61/NA49 meeting, Belgrade Virtualised data production • So far legacy software v12j used for testing on virtual machine • Now installing v13b (or c?) to be able to use latest versions (also for global key) for test of whole reaction • Can use the modified version of Anar's QA (ratio, difference) to compare the outputs • However, a large contribution of the differences may be due to “random” missing events in either production?

NA61/NA49 meeting, Belgrade Virtualised data production resource estimate • Processing time of chunk depends on reaction • BeBe160 ~1.5h • pp ~45min • Consistent with experience from lxBatch • Cost estimate based on currently processed data • Whole run 15252 (BeBe40, 170 files) produced on virtual machines • 10 Virtual machines • Made sure chunks were staged on castor first • Processing time on test lxCloud ~1h/chunk (slightly less) • Assuming 1h/chunk, 10 000 chunks for reaction, and 2 days “reasonable” processing time for reaction • 10 000 / 24 / 2 = 208 virtual machines for cluster • Have been allocated quota of 200 VMs by IT for testing • The production of a full reaction for data validation will give better estimate • Installing latest legacy software 13b (c?) for this

NA61/NA49 meeting, Belgrade Web data production manager • Dimitije created the current production manager • Since he left NA61, I have stared looking into how it works • Two “parts” • Web page displaying information • Background acrontab jobs updating files with information to be accessed by the web page • Missing/incomplete parts: • Interface to new production scripts • Authentication (to verify user has rights to start production) • Production database (stores information about status of running/finished chunk jobs, initiates search of errors for chunks, and resubmits chunks as needed) • Interface to bookkeeping database (upload of finished reactions) • Interface to create/kill virtual clusters

NA61/NA49 meeting, Belgrade Data production manager database • Needed to keep track of status of ongoing/finished jobs (chunks) for productions • Some initial scripts created • Initiated from acrontab job • Search for finished jobs, update database • Check if jobs were successful, update database • Resubmit failed jobs, update databse • Based on SQLite • Scripts will be back-end for web production manager

NA61/NA49 meeting, Belgrade Automatic update of bookkeeping database • Bookkeeping database (Alexander's) needs to be updated when a production has been finished • Should be done automatic by web production manager (database) • Interface between production manager and bookkeeping database must work for both CernVM/lxBatch • Not depend on AFS • Also work for CernVM processing outside CERN • HTTP-based • First step to create scripts to do update by hand • e.g. prodna61-update-bookkeeping.sh <production details>

NA61/NA49 meeting, Belgrade Next steps • Short-term • Address remaining issues for CernVM/lxBatch unified production scripts • Install legacy software 13b (c?) on CvmFS for further validation • Further investigation of missing events • Sometimes an event can fail, but succeed next time? • By-hand way to upload data to bookkeeping database • Long-term • Web production manager • Interface to production scripts • Database for status of ongoing productions • Automatic upload data to bookkeeping database

NA61/NA49 meeting, CERN Roadmap

NA61/NA49 meeting, Belgrade Volunteers • For participating in the “normal” data production team? • If anybody is interested, we can have a “mini-workshop” later this week

Data production using CernVM and lxCloud Dag Toppe Larsen Belgrade 2013-05-28

Data production using CernVM and lxCloud Dag Toppe Larsen Belgrade 2013-05-28

Presentation Transcript

Belgrade, 25-26 Apr 2013

Belgrade, 25-26 Apr 2013

Belgrade ,8 April, 2013

Belgrade, 28 April 2011

CernVM Users Meeting

Production resources and management Bartosz Maksiak Dag Toppe Larsen Kevin Yarritu

The CERNVM infrastructure

Call 7 CIP ICT-PSP Work Programme 2013 Belgrade, 28 February 2013

28-05-2014

Belgrade, April 2013

Software Development using Production Data

28 Nov 05

CernVM News

11/28/05

Belgrade , 8.April , 2013 .

10/28/05

Belgrade ,8 April, 2013

BOINC + CernVM