1 / 11

Notes on offline data handling

Notes on offline data handling. M. Moulson Frascati, 29 March 2006. Data flow in reprocessing. Jobs: 1 per raw file and run (datarec_reproc_ibm.csh) Prefetch: All raw files for run Script: start_recall_files.csh, recall_one_dr.tcl Method: “recall raw (PROD areas)”

kolina
Download Presentation

Notes on offline data handling

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Notes on offline data handling M. Moulson Frascati, 29 March 2006

  2. Data flow in reprocessing • Jobs: 1 per raw file and run (datarec_reproc_ibm.csh) • Prefetch: All raw files for run • Script: start_recall_files.csh, recall_one_dr.tcl • Method: “recall raw (PROD areas)” • List of PROD areas specified explicitly (from DB2 query) • Files recalled one at a time • Jobs started when files on disk • Explore option of recalling all files for run at once? • Input: raw files (1 per job) • Script: runXreproc_ibm.csh • Method: KID URL dbraw: GROUP_ID=PROD • Output: datarec files (1 per stream and job) • Written to /datarec and archived • Advance recall to DSTPROD area: ”recall datarec dstprod”

  3. Data flow in DST production • Jobs: 1 per stream and run • Disk mode (i.e. for output from reprocessing: datarec_dstfd_ibm.csh) • Prefetch: None • Input: All datarec files for stream and run • Script: dstXprocfd_ibm.csh • Method: KID URL dbdatarec GROUP_ID DSTPROD • Tape mode (datarec_dst_ibm.csh) • Prefetch: All datarec files for stream and run • Script: start_recall_files.csh recall_one_dr.tcl • Method: “recall datarec (PROD areas)” • Change to DSTPROD area to avoid inconsistency? • Input: datarec files • Script: dstXprod_ibm.csh • Method: KID URL dbdatarec GROUP_ID DSTPROD • Output: DST files (1 per stream and run) • Written to /datarec and archived

  4. Data flow in MC production (1/2) • Jobs: 1 per run and card type (mcprod.pl) • Processes: • 1 or more GEANFI processes, each followed by a datarec process • 1 DST job per requested DST stream at end • GEANFI output: 1 mco file per GEANFI process • Written to /datarec, not archived • Reconstruction prefetch: All bgg/lsb (datarec) files for run • Method: “recall datarec all” • Currently, prefetch all files at start of each reconstuction job • Instead, prefetch once before first GEANFI process • Reconstruction input: mco file • Method: KID URL “ybos:” (files on /datarec) • Reconstruction input: Subset of bgg/lsb files for run • Method: KID URL “dbdatarec:”

  5. Data flow in MC production (2/2) • Reconstruction output: 1 mcr file per mco file (GEANFI process) • Written to /datarec and archived • Advance recall to DSTPROD area ”recall mc dstprod” • Is this really a good idea? • DSTs start right away from same directory • See notes below • DST input: All mcr files for job, for each requested DST stream (process) • Method: KID URL dbmc DSTPROD • DST output: 1 MC DST file per process • Written to /datarec and archived

  6. Data flow for standalone MC DSTs • Jobs: 1 per run and card type (mcprod_dst.pl) • Processes: 1 per requested DST stream • Prefetch: None • Input: All mcr files for run and card type • Method: KID URL dbmc DSTPROD • Output: 1 MC DST per process • Written to /datarec and archived

  7. Standard offline file types MC files (mcr) are technically datarec streams of type ALL (stream_id = 0) DESCRIPT.STREAM_OFFLINE contains separate DSRV groups for data and MC DSRV groups shown are for data, except for mcr files, for which the MC DSRV group is shown

  8. DSRV groups for background files • All datarec types already have DSRV group dir = USER • All DST types (data and MC) already have DSRV group dir = DST • By default these are recalled to “DST cache” • Exception is background files (bgg, lsb): • Change to DST group? • Leave as USER and recall to PROD? • Leave as USER and recall with kcp?

  9. Summary of recall areas • /datarec currently 470 GB SSA disk • Must add more disks to string: • Access bandwidth: • All MC output to /datarec • 84 MB/s with 300 B80 for MC • 138 MB/s if input to MCDST also from /datarec • Adding disks to string helps parallelism • Size: • Archiver maintains /datarec at 40% full • MC requires <90% full to start • /datarec filled to 50% within 1 hour • Archiving bandwidth: • Saturated with 150 B80 for MC • Must increase by system tuning: • Any amount of /datarec space “immediately” filled if archiving bandwidth insufficient

  10. Transfers to and from /datarec Assumes: 0.5 B80 s to fully produce 1 event, including DSTs 4 DST processes per job, zero overlap in DSTs

  11. Recommended tape space allocation • Allocations include currently occupied space • MC DSTs probably appear as datarec files to archiver • Current library system capacity ~720 GB New cassettes will have to be ordered in future • Temporary allocation based on 720 GB library Assumes MC production slow • Final allocation assumes completion of KLOE offline program

More Related