130 likes | 237 Views
Version 1.0 11 September 2008 Rob Kennedy and Adam Lyon. D0 Grid Data Production: Evaluation. Outline. Roadmap Rough Work Plan with Priorities (discussion). Roadmap. September 2008: Planning (verbatim from last week)
E N D
Version 1.0 11 September 2008 Rob Kennedy and Adam Lyon D0 Grid Data Production:Evaluation D0 Grid Data Production
D0 Grid Data Production Outline • Roadmap • Rough Work Plan with Priorities • (discussion)
D0 Grid Data Production Roadmap • September 2008: Planning (verbatim from last week) • Rob Kennedy, working with Adam Lyon, charged by Vicky White to lead effort to pursue this. • First stage is to list, understand, and prioritize the problems and the work in progress. • Next, develop a broad coarse-grained plan to address issues and improve the efficiency. • Present plan to Vicky and D0 Spokespeople towards end of September 2008. • October 2008 – December 2008: Phase 1 (In detail in later slides) • 1. Server Expansion and Decoupling Data/MC Production at Services • 2. Condor 7.0 Upgrade and Support • 3. Metrics • 4. Small Quick Wins • January 2009: Re-Assess • 1. Formally re-assess D0 Grid Data Production • 2. Plan new work as needed… • January/February 2009 – April-ish 2009: Phase 2 … very rough picture • 1. Additional work on systems as needed… • 2. After basic Data Production goals achieved though, recommend moving on to MC Production issues
D0 Grid Data Production Priorities (1 of 4) • 1. Server Expansion and Decoupling of Data/MC Prod at Services • Where are the nodes? Can we horse-trade to get first arrivals? • Support: Transfer Queue node support from FGS to FEF • Logistics: Location in FCC or GCC? Move of Que1 needed? or FwdN? • Config: Optimal h/w and svc config for data prod and MC prod -- start planning this soon too. • SAM Station(s): Possible in time frame to move station off Fwd1 to own node; 2 stations for decoupling? • (Some VDT/Condor deployment procedure and metrics work assumed) • 2. Condor 7 Upgrade • 3. Metrics • 4. Small Quick Wins
D0 Grid Data Production Priorities (2 of 4) • 1. Server Expansion and Decoupling of Data/MC Prod at Services • 2. Condor 7 Upgrade and Support • Issues: Eliminates (we are told) the Periodic Expression hangs • Communication: establish closer communications with REX/FGS/Condor Supp • Good discussion (Steve, Keith, Rob, Adam) after the last FermiGrid Users Meeting (post notes... To be done). • Upgrade is prerequisite for working closely with Condor developers on issues • Packaging: In past, depended on VDT packaging by Grid/OSG group. Major dependency. Avoidable? • Requires root to install? VDT as part of platform install? • CDF Offline also facing this upgrade, opportunity for leveraging? • 3. Metrics • 4. Small Quick Wins
D0 Grid Data Production Priorities (3 of 4) • 1. Server Expansion and Decoupling of Data/MC Prod at Services • 2. Condor 7 Upgrade • 3. Metrics: See list in last week’s slides • (Basics will be done. This topic covers a more thorough treatment of metrics) • Document inputs and definitions, relate to concepts • Automate gathering of data and presentation, where not already done • Where historical data repository, put into usable format and automate input of new data • Set goals for each (arbitrary if they must be) to set expectations • Tie to the overall sub-goals with view towards capacity planning • Tie into monitoring and operations and get experts online asap when something happens for specific issues • Emphasize sustainability and long-term view of system as a service • 4. Small Quick Wins
D0 Grid Data Production Priorities (4 of 4) • 1. Server Expansion and Decoupling of Data/MC Prod at Services • 2. Condor 7 Upgrade • 3. Metrics: See list in last week’s slides • 4. Small Quick Wins: Scale= 1-2 FTE-weeks from 1-2 people • Preliminary planning to allow prioritization/selection to be done, then... • Select and execute 1 or 2 from list with best benefit/cost: • a. Reliable Status Info returned by SAM-Grid *** big problem, easy fix? Pursue planning ASAP... Friday? • b. JobID tracking - try out GG's procedure. Simplification? More automated approach later? • c. Slow FWD node to CAB Job Transition Investigation - can see when slow, but upstream or downstream cause? • d. Intelligent selection of Fwd nodes from a pool of candidates (taking into account which are busy) • e. Auto-restart on reboot (or make as restartable as possible)... and also why are nodes rebooting? Integrating D0Farm occupancy into ops reporting. Evaluating setup of an automated quantitative report of occupancy averaged over a day or week as well.
D0 Grid Data Production Backup Slides • …
D0 Grid Data Production Metrics List • Resource Utilization • Issue: We want to maximize resource utilization, mitigate SPOF, understand system capacity • % Job slots occupied from point of view of job queuing system • % CPU used for assigned job slots: includes data handling and DB access wait time outside of our scope. • MEvts/day successfully processed: another top-level metric. Compare to data-taking rate. • Effort to Run Data Production • Issue: Reduce effort to operate • Mean time between touches: how often does coordinator have to interact with system just to keep job queues full, assuming no unusual errors in the system. • Hours spent per week: Launching jobs, working on error recovery, debugging jobs, etc. ESTIMATE.... • Metrics to Quantify Data Production Service Quality (what can we measure?) • First-pass success rate: Fraction of jobs succeeding on first try • May or may not be able to disentangle user executable failure from this • N-pass success rate: Fraction of jobs succeeding after N retries – do jobs eventually succeed? • Mean tries to success: Average number of tries until job succeeds – quantifies “rework” effort
D0 Grid Data Production Issues List (p.1/4) • 1) Unreliable state information returned by SAM-Grid: SAM-Grid under some circumstances does not return correct state information for jobs. Fixing this may entail adding some logic to SAM-Grid. • 2) Cleanup of globus-job-managers on forwarding nodes, a.k.a. “stale jobs”: The globus job managers on the forwarding nodes are sometimes left running long after the jobs have actually terminated. This eventually blocks new jobs from starting. • 3) Scriptrunner on samgrid needs to be controlled, a.k.a. the Periodic Expression problem: This is now locking us out of all operation for ~1 hour each day. This is due to a feature in Condor 6 which we do not use, but which cannot be fully disabled either. Developers say this is fixed in Condor 7, but this has not been proven yet. • 4) CORBA communication problems with SAM station: The actual source of all CORBA problems is hard to pin down, but at least some of them seem to be associated with heavy load on samgfwd01 where the SAM station runs. Since the forwarding nodes are prone to getting bogged down at times, the SAM station needs to be moved to a separate node. • 5) Intelligent job matching to forwarding nodes: SAM-Grid appears to assign jobs to the forwarding nodes at random without regard to the current load on the forwarding nodes. It will assign jobs to a forwarding node that has reached CurMatch max even if another forwarding node has job slots available.
D0 Grid Data Production Issues List (p.2/4) • 6) Capacity of durable location servers: Merge jobs frequently fail due to delivery timeouts of the unmerged thumbnails. We need to examine carefully what functions the durable location servers are providing and limit activity here to production operations. Note that when we stop running Recocert as part of the merge this problem will worsen. • 7) CurMatch limit on forwarding nodes: We need to increase this limit which probably implies adding more forwarding nodes. We would also like to have MC and data production separated on different forwarding nodes so response is more predictable. • 8) Job slot limit on forwarding nodes: The current limit of 750 job slots handled by each forwarding node has to be increased. Ideally this would be large enough that one forwarding node going down only results in slower throughput to CAB rather than a complete cutoff of half the processing slots. Could be addressed by optimizing fwd node config for data production. • 9) Input queues on CAB: We have to be able to fill the input queues on CAB to their capacity of ~1000 jobs. The configuration coupling between MC and data production that currently limits this to ~200 has to be removed. Could be addressed by optimizing fwd node config for data production.
D0 Grid Data Production Issues List (p.3/4) • 10) 32,001 Directory problem: Band-aid is in place, but we should follow up with Condor developers to communicate the scaling issue of storing job state in a file system given the need to retain job state for tens of thousands of jobs in a large production system.11) Spiral of Death problem: See for example reports from 19-21 July 2008. Rare, but stop all processing. We do not understand the underlying cause yet. The only known way to address this situation is to do a complete kill/cold-stop and restart of the system. • 12) Various Globus errors: We have repeated episodes where a significant number of jobs lose all state information and fall into a "Held" state due to various Globus errors. These errors are usually something like "Job state file doesn't exist", "Couldn't open std out or std err", "Unspecified job manager error". Mike doesn't think we have ever clearly identified the source of these errors. His guess is they have a common cause. The above errors tend to occur in clusters (about half a dozen showed up last night, that's what brought it to mind). They usually don't result in the job failing, but such jobs have to be tracked by hand until complete and in some cases all log information is lost. • 13) Automatic restart of services on reboot: Every node in the system (samgrid, samgfwd, d0cabosg, etc) needs to be set up to automatically restart all necessary services on reboot. We have lost a lot of time when nodes reboot and services do not come back up. SAM people appear to not get any notification when some of these nodes reboot.
D0 Grid Data Production Issues List (p.4/4) • 14) SRM needs to be cleanly isolated from the rest of the operation: This might come about as a natural consequence of some of the other decoupling actions. A more global statement would be that we have to ensure that problems at remote sites cannot stop our local operations (especially if the problematic interaction has nothing to do with the data processing operation). • 15) Lack of Transparency: No correlation between the distinct grid and PBS id’s and inadequate monitoring mean it is very difficult to track a single job through the entire grid system, especially important for debugging. • 16) Periods of Slow Fwd node to CAB Job transitions: newly added • MC-specific #1) File Delivery bottlenecks: use of SRM at site helps • MC-specific 2) Redundant SAM caches needed in the field • MC-specific 3) ReSS improvements needed, avoid problem sites,…. • MC-specific 4) Get LCG forwarding nodes up and running reliably • MC Note) 80% eff. on CAB, but 90% on CMS – Why? Something about CAB/PBS to note here for data production expectations too?