160 likes | 262 Views
The RunJob Project. A Proposal. What is RunJob?. Automatic Job Creation and Submission Metadata description of job steps Produces jobs for a variety of environments Easy to extend to new applications or new environments
E N D
The RunJob Project A Proposal Greg Graham, FNAL CD
What is RunJob? • Automatic Job Creation and Submission • Metadata description of job steps • Produces jobs for a variety of environments • Easy to extend to new applications or new environments • Metadata model extends to catalogs or services to do production control or tracking • Links jobs together in tree-like dataflow arrangements Greg Graham, FNAL CD
Who Uses RunJob? • DZero • Monte Carlo Challenges (CHEP 2000, CHEP 2001) • User Monte Carlo production • SAMGrid production (Under construction) • Data Reprocessing • CMS • Monte Carlo Challenges (CHEP 2003) • USCMS Grid, LCG based Grid production • User Monte Carlo Production (Under construction) • Data Reprocessing Greg Graham, FNAL CD
The RunJob Pilot Project • Begun in early Spring 2003 to “merge” the then divergent DZero and CMS versions • ShahKar package created and developed during Summer 2003 with input from DZero and CMS reps. • ShahKar merged with CMS variant MCRunjob in Fall 2003 • but not propagated. • DZero integration pushed back to April 2004. Greg Graham, FNAL CD
Proposal for a Full RunJob Project • Increase manpower to accomplish • better integration with experiments’ planning processes: CDF, CMS, DZero, others? • integration of codebases with the ShahKar code base from pilot project • features and core development happen in RunJob project to satisfy experiments’ needs and schedule • rigorous testing and debugging support, documentation, and release management Greg Graham, FNAL CD
Requirements and Features • The RunJob Project Plan that was distributed comes with some very generally stated “requirements” and a lot of very specific work items • Reflects the need to begin talking to the experiments to tighten up the requirements and map them to specific work items. • The work items reflect developments ongoing within the RunJob pilot project • 12 man-years of experience building production processing systems for HEP applications in many different environments. Greg Graham, FNAL CD
Requirements and Features • “It automatically generates jobs to run my application(s) in a variety of environments” • scriptObject design is a way to better abstract the job descriptions away from the jobs themselves and therefore away from the environments. These are like internal “sandboxes”. (Critical, needed by all) • Development work will include modules tailored for specific environments such as LSF, FBS, PBS, Condor, etc. (Critical, needed by all) • Development work will also include Grid environments and Web Services design work. (TBD, who needs this and when?) Greg Graham, FNAL CD
Requirements and Features • “Later, I can go back and determine ho the job was configured.” • Physics parameters and defaults should not come from RunJob itself (Critical) • Contexts are documents that can record suitable defaults for various applications, groups of applications, or environments. (Critical) • Contexts can currently be combined in a rudimentary fashion; better (algebraic) combination rules lead to more expressiveness and better control in complex environments (TBD; who needs it and when?) Greg Graham, FNAL CD
Requirements and Features • “I need to build jobs across datasets listed in a catalog using parameters in a control DB.” • Observation: everyone comes around to doing this eventually ;-) • Uniform interfaces to catalogs and control databases potentially decrease maintenance costs for all experiments and increase adaptibility to new systems. (TBD, who needs it and when?) • Interfaces to specific catalogs and control DBS are an integration task. (Critical.) Greg Graham, FNAL CD
Requirements and Features • “I need to resubmit jobs when they fail” • Specification of RunJob state just before job creation/submission; this is the “XML” milestone. (Critical) • Storage of RunJob state specifications in an XML database or filesystem. (Critical) • Interface to specific job tracking systems designed by the experiments to do this. (TBD, who needs this and when?) Greg Graham, FNAL CD
Requirements and Features • “I need feature X working by my experiments’ milestone Y.” • These need to be worked out during the negotiation phase this Spring. • The stated specific work items listed in the plan are probably a good cover of the forseeable requirements to come during the negotiation phase • … so on to the manpower estimates ;-) Greg Graham, FNAL CD
Manpower Estimates • My favorite quote: “The plan is OK except possibly for the schedule and the manpower.” • For each listed milestone/deliverable/feature, a SWAG estimate was produced. The SWAGs were then summed, and the result was inflated by 25%. • 40 man-months total effort, not including management or testing. • The Level of Effort (LOE) was used • essentially equal to the average number of “warm bodies” active for the duration of the schedule. • Total FTE = LOE * project duration. Greg Graham, FNAL CD
Manpower Estimates Greg Graham, FNAL CD
Schedule Changes • Deferment cost estimates • Project management and essential functions LOE remain constant • Development driven functions scale against schedule length • Adjusted average LOE = 1.6 + 4.2/(length) • Risk: Can we satisfy the experiments’ milestones? Greg Graham, FNAL CD
Schedule Changes • Cutting Work Items • Analysis cannot really be done without experiments’ input • Cutting Project Roles (eg- dedicated testing) • Analysis cannot really be done without experiments’ input • Probably there is some savings here: development could be pushed further up the integration food chain and into experiments’ variant codebases themselves. • We recommend against this because it dilutes the benefits of cooperation. Greg Graham, FNAL CD
Conclusion • The RunJob project is an exciting opportunity for the RunII experiments and CMS to collaborate on software. • DZero and CMS already use fairly closely related variants. • The RunJob project can build upon • the experience of many people who have been working on it already for years • a successful pilot project that minimally satisfies many requirements already • We are eager to work with the experiments to effectively gather and address their requirements and milestones coherently across the experiments. Greg Graham, FNAL CD