1 / 24

Dataflow/workflow with real data through Tiers

Tutorial. Dataflow/workflow with real data through Tiers. N. De Filippis. Department of Physics and INFN Bari. Outline. Computing facilities in the control room, at Tier-0 and at Central Analysis facilities (CAF): Ex: Tracker Analysis Centre (TAC)

lan
Download Presentation

Dataflow/workflow with real data through Tiers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Tutorial Dataflow/workflow with real data through Tiers N. De Filippis Department of Physics and INFN Bari

  2. Outline • Computing facilities in the control room, at Tier-0 and at Central Analysis facilities (CAF): • Ex: Tracker Analysis Centre (TAC) • Local storage and automatic processing at TAC • how to register files in DBS/DLS • Automatic data shipping and remote processing at Tier-1 /Tier 2 • Injection in PhEDEx for the transfer • Re-reconstruction and skimming with Prodagent • Data analysis in a distributed environment via CRAB • simulation of cosmics in a Tier-2 site

  3. What expected in the CMS Computing Model Dataflow/workflow from Point 5 to Tiers: • The CAF will support: • diagnostics of detector problems, trigger performance services, • derivation of calibration and alignment data • reconstruction services, interactive and batch analysis facilities • Most of tasks have to be performed in remote Tier sites in distributed environ.

  4. Computing facilities in the control room, at Tier-0 and at Central Analysis facilities

  5. Example of a facility for Tracker • The TAC is a dedicated Tracker Control Room • To serve the needs of collecting and analysing the data from the 25% Tracker test at the Tracker Integration Facility (TIF) • In use since Oct. 1st 2006 by DAQ and detector people • Computing elements at TAC: • 1 disk server: CMSTKSTORAGE • 1 DB server: CMSTIBDB • 1 wireless/wired router • 12 PC’s • 2 DAQ (CMSTAC02 e CMSTAC02) • 3 DQM, 1 Visualization (CMSTKMON, CMSTAC04 e CMSTAC05) • 2 TIB/TID (CMSTAC00 e CMSTAC01) • 3 DCS (PCCMSTRDCS10, • PCCMSTRDCS11 and PCCMSTRDCS12) • 2 TEC+ (CMSTAC06 and CMSTAC07) + 1 private PC TAC is like a control room + Tier-0 + CAF “in miniatura”

  6. Local storage and processing at TAC • A dedicated PC (CMSTKSTORAGE) is devoted to store temporarily the data: • it has now 2.8 TB local fast disk (no redundancy) • it will allow local caching for about 10 days of data taking (300 GB/day expected for 25 % test) • CMSTKSTORAGE also used to perform the following tasks: • perform o2o for connection and pedestals runs to fill the Offline DB • convert RU files into EDM-compliant formats • write files to CASTOR when ready Area in castor created under …/store/… • /castor/cern.ch/cms/store/TAC/PIXEL • /castor/cern.ch/cms/store/TAC/TIB • /castor/cern.ch/cms/store/TAC/TOB • /castor/cern.ch/cms/store/TAC/TEC • register files in Data Bookkeeping Service (DBS) and Data Location Service (DLS)

  7. How to register files in DBS/DLS (1) • A grid certificate with CMS Role=Production is needed: • voms-proxy-init -voms cms:/cms/Role=production • DBS and DLS API • cvs co –r DBS_0_0_3a DBS cvs co –r DLS_0_1_2 DLS • One DBS and DLS instance: please use • MCLocal_4/Writerfor DBS • prod-lfc-cms-central.cern.ch/grid/cms/DLS/MCLocal_4 for DLS • The following info about your EDM-compliant file are needed: • --PrimaryDataset=TAC-TIB-120-DAQ-EDM • --ProcessedDataset=CMSSW_1_2_0-RAW-Run-0000505 • --DataTier=RAW • --LFN=/store/TAC/TIB/edm_2007_01_29/EDM0000505_000.root • --Size=205347982 • --TotalEvents= 3707 One processed dataset per run

  8. How to register files in DBS/DLS (2) -- GUID=38ACFC35-06B0-DB11-B463 extracted with EdmFileUtil -u file:file.root -- CheckSum=4264158233extracted with cksum command -- CMSSWVersion=CMSSW_1_2_0 -- ApplicationName=FUEventProcess -- ApplicationFamily=Online -- PSetHash= 4cff1ae0-1565-43f8-b1e9-82ee0793cc8cextracted with uuidgen • Run the script for the registration in DBS: • python dbsCgiCHWriter.py --DBSInstance=MCLocal_4/Writer • --DBSURL= “http://cmsdbs.cern.ch/cms/prod/comp/DBS/CGIServer/prodquery" • --PrimaryDataset=$primdataset --ProcessedDataset=$procdataset --DataTier=RAW --LFN=$lfn --Size=$size --TotalEvents=$nevts --GUID=$guid --CheckSum=$cksum • --CMSSWVersion=CMSSW_1_2_0 --ApplicationName=FUEventProcess • --ApplicationFamily=Online --PSetHash=$psethash • Closure of blocks in DBS: • python closeDBSFileBlock.py --DBSAddress=MCLocal_4/Writer -datasetPath=$dataset • The two scripts dbsCgiCHWriter.pyandcloseDBSFileBlock.py can be found in/afs/cern.ch/user/n/ndefilip/public/Registration/

  9. How to register files in DBS/DLS (3) Data registered in DBS • Run the script for the registration of blocks of files in DLS: • python dbsread.py --datasetPath=$dataset • or for each block of files: • dls-add -i DLS_TYPE_LFC -e prod-lfc-cmscentral.cern.ch/grid/cms/DLS/MCLocal_4 • /TAC-TIB-120-DAQ-EDM/CMSSW_1_2_0-RAW-Run-0000505#497a013d-3b49-43ad-a80f-dbc590e593d7 srm.cern.ch which is the name of the SE

  10. Results in Data discovery page: http://cmsdbs.cern.ch/discovery/expert Tracker data MTCC data

  11. Automatic data shipping and remote processing at Tier-1/Tier 2

  12. PhEDEx injection (1) • Data published in DBS and DLS are ready to be transferred via the CMS official data movement tool, PhEDEx • The injection, that is the procedure to write into the database for transfer of PhEDEx, has to be run in principle from CERN where the data are collected • but it can be run also in a remote site in a Tier-1 / Tier-2 hosting PhEDEx • It runs at Bari via an official PhEDEx agent and a component of ProdAgent modified to “close” blocks at the end of the transfer in order to enable automatic publishing in DLS (the same procedure used for Monte Carlo data) • complete automatisation is reached with a script that watches for new tracker related entries in DBS/DLS • Once data are injected in PhEDEX any Tier-1 or Tier-2 can subscribe to them

  13. PhEDEx injection (2) • ProdAgent_v0XX is needed: • configure PA to use the PhEDEX dropbox /dir/state/inject-tmdb/inbox prodAgent-edit-config --component=PhEDExInterface --parameter=PhEDExDropBox --value=/dropboxdir/ • start the PhEDExInterface component of PA: • prodAgentd --start --component=PhEDExInterface • PhEDEx_2.4 is needed: • configure the inject-tmdb agent in your Config file • ### AGENT LABEL=inject-tmdb PROGRAM=Toolkit/DropBox/DropTMDBPublisher • -db ${PHEDEX_DBPARAM} • -node TX_NON_EXISTENT_NODE • start the inject-tmdb agent of PHEDEx: ./Master -config Config start inject-tmdb

  14. PhEDEx injection (3) • For each datasetpath of a run: • python dbsinjectTMDB.py --datasetPath=$dataset --injectdir=logs/ • In the log of PhEDEX you will find the following messages In /afs/cern.ch/user/n/ndefilip/public/Registration 2007-01-31 07:55:05: TMDBInject[18582]: (re)connecting to database Connecting to database Reading file information from /home1/prodagent/state/inject-tmdb/work/_TAC-TIB-120-DAQ-EDM_CMSSW_1_2_0-RAW-Run-0000520_353a3ae2-30a0-4f30-86df-e08ba9ac6869-1170230102.09/_TAC-TIB-120-DAQ-EDM_CMSSW_1_2_0-RAW-Run-0000520_353a3ae2-30a0-4f30-86df-e08ba9ac6869.xml Processing dbs http://cmsdbs.cern.ch/cms/prod/comp/DBS/CGIServer/prodquery?instance=MCLocal_4/Writer (204) Processing dataset /TAC-TIB-120-DAQ-EDM/RAW (1364) Processing block /TAC-TIB-120-DAQ-EDM/CMSSW_1_2_0-RAW-Run-0000520#353a3ae2-30a0-4f30-86df-e08ba9ac6869 (7634) :+/ 1 new files, 1 new replicas PTB R C 2007-01-31 07:55:08: DropTMDBPublisher[5828]: stats: _TAC-TIB-120-DAQ-EDM_CMSSW_1_2_0-RAW-Run-0000520_353a3ae2-30a0-4f30-86df-e08ba9ac6869-1170230102.09 3.04r 0.18u 0.08s success

  15. Results in PhEDEx page: http://cmsdoc.cern.ch/cms/aprom/phedex http://cmsdoc.cern.ch/cms/aprom/phedex/prod/Data::Replicas?filter=TAC-T;view=global;dexp=1364;rows=;node=6;node=19;node=44;nvalue=Node%20files#d1364

  16. “Official” reconstruction/skimming (1) • Goal: to run reconstruction of raw data in a standard and official way, typically using code of a CMSSW release (no prerelease, no user patch) • ProdAgent tool evaluated to perform reconstruction with the same procedures as for monte carlo samples • ProdAgent can be run everywhere…better in a Tier-1 / Tier-2 • Running with ProdAgent will ensure that RECO data are automatically registered in DBS and DLS, ready to be shipped to Tier-1 and Tier-2 and analysed via computing tools • in the close future the standard reconstruction, calibration and alignment tasks will run on Central Analysis Facility (CAF) machines at CERN, such as expected in the Computing Model.

  17. “Official” reconstruction/skimming (2) • Input data are processed run by run and new processed datasets are created as output, one for each run • ProdAgent use the DatasetInjector component to be aware of the input files to be processed • It is needed to create the workflow file from the cfg for reconstruction; • the following example is for DIGI-RECO processing starting from GEN-SIM input files • no Pileup, StartUp and LowLumi pileup can be set for digitization • splitting of input files can be done either by event of by file

  18. “Official” reconstruction/skimming (3) • Creating the workflow file for no pileup case: python $PRODAGENT_ROOT/util/createProcessingWorkflow.py --dataset=/TAC-TIB-120-DAQ-EDM/RAW/CMSSW_1_2_0-RAW-Run-0000530 --cfg=DIGI-RECO-NoPU-OnSel.cfg --version=CMSSW_1_2_0 --category=mc --dbs-address=MCLocal_4/Writer --dbs-url=http://cmsdbs.cern.ch/cms/prod/comp/DBS/CGIServer/prodquery --dls-type=DLS_TYPE_DLI --dls-address=lfc-cms-test.cern.ch/grid/cms/DLS/MCLocal_4 --same-primary-dataset --only-closed-blocks --fake-hash --split-type=event --split-size=1000 --pileup-files-per-job=1 --pileup-dataset=/mc-csa06-111-minbias/GEN/CMSSW_1_1_1-GEN-SIM-1164410273 --name=TAC-TIB-120-DAQ-EDM-Run-0000530-DIGI-RECO-NoPU • Submitting jobs: python PRODAGENT/test/python/IntTests/InjectTestSkimLCG.py --workflow=/yourpath/TAC-TIB-120-DAQ-EDM-Run-0000530-DIGI-RECO-NoPU-Workflow.xml --njobs=300

  19. Data analysis via CRAB at Tiers (1) • Data published in DBS/DLS can be processed via CRAB in remote using the distributed environment tools • users have to edit crab.cfg and insert the dataset path of the run to be analyzed as obtained by DBS info • users have to provide their CMSSW cfg, setup the environment and compile their code via scramv1 • offline DB accessed via frontier at Tier-1/2 already tested during CSA06 with alignment data • an example cfg to perform the reconstruction chain starting from raw data can be found in /afs/cern.ch/user/n/ndefilip/public/Registration/TACAnalysis_Run2048.cfg • Thanks to D. Giordano for the support

  20. Data analysis via CRAB at Tiers (2) • The piece of the cfg useful to access the offline DB via frontier • The output files produced with CRAB are not registrered in DBS/DLS (but the implementation of the code is under development…) • Further details about CRAB in the tutorial of F. Fanzago.

  21. “Official” Cosmics simulation (1) • Goal: to make standard simulation of cosmics with official code in CMSSW release (no patch, no prereleses) • CMSSW_1_2_2 is needed: • Download AnalysisExamples/SiStripDetectorPerformance • cvs co –r CMSSW_1_2_2 AnalysisExamples/SiStripDetectorPerformance • Complete geometry of CMS, no magnetic field, cosmic filter implemented to get muon triggered by scintillators: AnalysisExamples/SiStripDetectorPerformance/src/CosmicTIFFilter.cc • The configuration file is: AnalysisExamples/SiStripDetectorPerformance/test/cosmic_tif.cfg • interactively: cmsRun cosmic_tif.cfg • by using ProdAgentto make large-scale and fully automatized productions Thanks to L. Fanò

  22. “Official” Cosmics simulation (2) • ProdAgent_v012: • create the workflow from the cfg file for GEN-SIM-DIGI: python $PRODAGENT_ROOT/util/createProductionWorkflow.py --cfg /your/path/cosmic_tif.cfg --version CMSSW_1_2_0 --fake-hash • Warnings: • when using createPreProdWorkflow.py the PoolOutputModule name in cfg should be compliant with the conventions to reflect the data tier the output file contains (i.e. GEN-SIM , GEN-SIM-DIGI, FEVT ). • so download the modified cfg from /afs/cern.ch/user/n/ndefilip/public/Registration/COSMIC_TIF.cfg • the workflow can be found in: /afs/cern.ch/user/n/ndefilip/public/Registration/COSMIC_TIF-Workflow.xml • Submit jobs via standard prodagent scripts: python $PRODAGENT_ROOT/test/python/IntTests/InjectTestLCG.py --workflow=/your/path/COSMIC_TIF-Workflow.xml --run=30000001 --nevts=10000 –njobs=100

  23. Pro and con’s • Advantages of the CMS computing approach: • Data officially published processed with official tools  so results are reproducible • the access to a large number of distributed resources • profit from the experience of the computing teams • Con’s: • initial effort to learn official computing tools • possible problems at remote sites, storage issues, instability of grid components (RB,CE), etc… • concurrence of analysis jobs and production jobs  policy/prioritization to be set in remote sites.

  24. Conclusions • First real data registered in DBS/DLS are officially available to the CMS community • Data are moved between sites and published by using official tools • Reconstruction, re-reconstruction and skimming could be “standardized” using ProdAgent • Data analysis is performed by using CRAB • Cosmic simulation for detector communities can be officially addressed • Many thanks to the people of the TAC team (fabrizio, giuseppe, domenico., livio, tommaso, subir….)

More Related