190 likes | 303 Views
SAM Production Status. Review of SAMonCAF project needs Status and Statistics on Testing: limitations Documentation Readiness and needs User Feedback Problems and Schedules: Rampup Thoughts. Project needs. http://projects.fnal.gov/samgrid/cdf/cdfdeploy. DCAF.
E N D
SAM Production Status • Review of SAMonCAF project needs • Status and Statistics on Testing: limitations • Documentation Readiness and needs • User Feedback • Problems and Schedules: Rampup Thoughts
Project needs • http://projects.fnal.gov/samgrid/cdf/cdfdeploy
DCAF • Sam_cp_config: Art implementing and distributing • File storage • Alan signs off • Request id: Rick write procedures. • Pinning utilities for datasets. • Done: Alan to sign off – migrate to using sam pin dataset: where the dataset is a sam dataset definition – stefan
CAF • Proof we can run 50 projects all consuming files. Have script, need to run • Proof we can handle/train users to use large datasets • Ran into trouble with >100 procs on 25K files – core dump of pmaster
CAF cont • Ability for users to easily manage failures in their code: sam command imminent • Modifications to the start section as needed for smooth operation, projects should not start until there are slots. • Training of sam shifters on errors and escalation of the importance of sam shifting
Deployment Requirements • 20TB/Day => got to 18, limited by caf slots • Hbot1i=> failed to do entire set (perhaps too much) • Restart station/servers while running (done but there can be situations where projects fail)
Plots and Stats • TB and Files Moved • Loads on DBServer, SAM Station
Limitations • 25KFiles and >100 Projects caused problem • <6K files ok with ? Projects • Ran most of golden dataset through • Running Dcache mounts now • Limits one project (batch slot) to 200G of Dcache (with DH policy of >250M/file, 800Files/project max, batches of 30) • Limit: 200 processes, 50 projects, 8000 files
Documentation • In CAF doc • Want to make a similar one for sam: otherwise Florida pages • Better description of sam commands and how to use. • Have a SAM FastNavigator • http://projects.fnal.gov/samgrid
User Feedback • Matthew Jones: first non-expert • Ability to recover: wants one file out per one in • Able to solve with a for loop and sam giving files • Attracted by idea to save ntuples – also for TOF calibration studies (1G)
Problems and Schedules • Adjust/tune params: no evidence of further need of another station or dbserver machine; have a dedicated process for caf. Want to move to one dbserver for offsite and one for users queries. • File storage bug: fixed • Slow delivery for CondorCAF • Slow delivery offsite (or disk?)
Bugs • Zombie Pmaster (project done, but process still on station node) • 25K file core dump of pmaster
RampUp • Condor-CAF uses resources (one way or another). • Continue job running for the week: after Dcache is back • Can we control 10% of caf. Go to 20% next week. • Sam process type: don’t allow submission unless <8K files. Samtest type?