130 likes | 239 Views
Version 1.0 (meeting edition) 29 January 2009 Rob Kennedy and Adam Lyon Attending: …. D0 Grid Data Production Initiative: Coordination Mtg. Overview. News and Summary System running less smoothly over past 1.5 weeks Unplanned Oracle downtime Planned Tape library downtime
E N D
Version 1.0 (meeting edition) 29 January 2009 Rob Kennedy and Adam Lyon Attending: … D0 Grid Data Production Initiative:Coordination Mtg D0 Grid Data Production
D0 Grid Data Production Overview • News and Summary • System running less smoothly over past 1.5 weeks • Unplanned Oracle downtime • Planned Tape library downtime • Unplanned D0srv071 durable storage hardware downtimes • Coordination role hand-off led to empty queue condition • Follow-up mtg during the week on Keith/Steve’s proposal for FermiGrid/CAB work • Seeking a CPB time slot for Jan 29 • Brief summary of Phase 1 progress, Phase 2 Work list, and seeking D0 input on priorities • Exec Mtg w/D0 Spokespeople and Vicky on Feb. 6 (to be confirmed w/CD) • Phase 1 close-out with more operational experience, investigation into events/day = f(luminosity), etc. • Phase 2 proposal (may be in draft form at that time) • Agenda • Phase 2 Work List (discussion 2) • Thin proposal for what and when... For discussion and start considering who involved. • Start to pare down to a “short list”, but want broader input before finalizing the goals list.
D0 Grid Data Production Phase 2 Work List - Feb • First Tuesday downtime of February 2009 – work out outside meeting • Enable auto-update of gridmap files on Queuing nodes • FWD3 not rebooted yet, so has not picked up ulimit-FileMaxPerProcess increase. • Work in Progress or Leftover from Phase 1 • Capacity Planning… Model nEvents per Day – essentially done, some improvements discussed • Tevatron profile over next year – a few scenarios to allow simple adjustments to conclusions later • Can existing infrastructure handle the capacity to meet the needs of D0? What is needed, if not? • Resource Utilization: top-down investigation as well as bottom-up investigation. Examples… • Major cause of difference between calculated max events/day and actual? (8.0E6 vs 6.5E6) • Data Handling tuning (AL, RI): tuning SAM station, investigate if CPU waiting on data. • AL’s Queuing Node monitoring – deploy on all QUE. Defer productizing until broader evaluation. • Planning/Observation/Investigation/Testing Work • Overall CPU Capacity and Utilization on CAB • Can add usable CPUs to system? Can more use of idle cycles be made without too negative an impact? • Future Configuration of CAB to maximize resource utilization (without degrading analysis user experience) • See Keith and Steve’s recommendations related to configuration. • Virtualization: ready yet? How to best apply to this system? FWD4 FWD4a, FWD4b? • Monitoring: Gather “what we have” and viable requirements • Plan/prioritize next layer of decoupling and/or capacity increase: SAM Station and/or Durable Storage • Plan out the Dependent Work… watch to see if/when can be addressed: List on next slide
D0 Grid Data Production Phase 2 Dependent Work ListWaiting on some condition • Waiting on new Condor release with requisite functionality • New SAM-Grid Release with support for new Job status value at Queuing node • Defect in new Condor (Qedit) prevents feature from working. Dev done and tested, feature disabled by default. • Kluge alternative or wait for Condor fix? Fix is in new Condor (? check), will be in new VDT release in a few weeks… • If fix is in new Condor, then we wait on that propagating to us via VDT. PM to check if fix in new Condor. • Waiting on PBS Head Node upgrade .OR. New Condor release with requisite functionality • Optimize Configurations separately for Data and MC Production • January 2009: first trial of increasing “nJobs in queue on CAB” limits: trade-off with scheduler polling overhead • PBS head node upgrade may mitigate impact of polling • New version of Condor may lessen the polling itself • Increase Data Production “queue” depth to avoid empty queue conditions. Goal?: “Fill once for the weekend.” • Desire to have either ONE of FWD nodes able to fill the running slots. • Waiting on “push to do” by RDK/AL, Before major re-installations involving QUE1 or CAB system • Formalize transfer of QUE1 (samgrid.fnal.gov) to FEF from FGS (before an OS upgrade) .and. Complementary transfer of batch system from FEF to FGS. • Needs preparation, have each system meet requirements of new stewards. • Waiting on demonstrated need for Initiative to be involved rather than CD and/or D0 groups in place • MC Data Production issues… so far REX/Ops handling with D0 experimenters involvement.
D0 Grid Data Production Phase 2 Work List - Mar • General Grid System Work • Implement FWD4 FWD4a (Data) and FWD4b (MC) • Retire FWD5 to test stand duty… or whatever. • Next Layer of decoupling/capacity increases • Decouple SAM Station: 1 each for Data and MC Prod =SRM use decoupled • Decouple Data vs MC local durable storage and add capacity if needed. • Next REX/Ops deployable Condor Upgrades (VDT/Condor product config) • Model in use elsewhere, takes Developers out of the loop. • Broader Work based on previous planning tasks • FermiGrid/CAB config changes per plan being developed in Feb. • May be worthwhile to wait on PBS head node upgrade • Monitoring Design, early Implementation • Slow FWD-CAB Job Transition: Install Monitoring and Alarm • Enable alarms, monitoring for all fcpd and Local Cache Storage Services • Workflow diagram-oriented monitoring – “see” bottlenecks, debugging.
D0 Grid Data Production Phase 2 Work List - Apr • Finish Task Chains in Progress • Monitoring Implementation, robustness to change in system over time. • Example: we exploit on all of d0farm being on cab2 for simple plotting. • Address Issues for Full-Load System? • FWD load balancing or approximation • Look at Preserving Progress over Time • Processes for Change Management, Continuous Service Improvement, Dev and Ops roles, … • Provisioning for dev/int/prd environments • Virtualization to provide 32/64-bit dev environment
D0 Grid Data Production Phase 2 Deferred Work ListDo not do until proven necessary and worthwhile • No known need at present • Uniform OS’s: Upgrade FWD1-3 and QUE1 to latest SLF 4.0, same as FWD4-5. • Only a minor OS version difference. Wait until this is needed to avoid another disruption and risk
D0 Grid Data Production Discussion Notes • … • …
D0 Grid Data Production Background Slides Original 16+4 Issues List
D0 Grid Data Production Issues List (p.1/4)(Red = not treated in Phase 1, Green = treated or non-issue, Yellow = notes) • 1) Unreliable state information returned by SAM-Grid: SAM-Grid under some circumstances does not return correct state information for jobs. Fixing this may entail adding some logic to SAM-Grid. • SAM-Grid Job Status development (see discussion on earlier slides). Delayed by Condor defect. • 2) Cleanup of globus-job-managers on forwarding nodes, a.k.a. “stale jobs”: The globus job managers on the forwarding nodes are sometimes left running long after the jobs have actually terminated. This eventually blocks new jobs from starting. • AL: Improved script to identify them and treat symptoms. Not happened recently.But why happening at all? • Not specific to SAM-Grid Grid Production • 3) Scriptrunner on samgrid needs to be controlled, a.k.a. the Periodic Expression problem: This is now locking us out of all operation for ~1 hour each day. This is due to a feature in Condor 6 which we do not use, but which cannot be fully disabled either. Developers say this is fixed in Condor 7, but this has not been proven yet. • Condor 7 Upgrade – RESOLVED! • 4) CORBA communication problems with SAM station: The actual source of all CORBA problems is hard to pin down, but at least some of them seem to be associated with heavy load on samgfwd01 where the SAM station runs. Since the forwarding nodes are prone to getting bogged down at times, the SAM station needs to be moved to a separate node. • Move SAM station off of FWD1 – DONE! Context Server move as well – DONE! • 5) Intelligent job matching to forwarding nodes: SAM-Grid appears to assign jobs to the forwarding nodes at random without regard to the current load on the forwarding nodes. It will assign jobs to a forwarding node that has reached CurMatch max even if another forwarding node has job slots available. • Nothing in Phase 1. Later Phase may include a less effort-intensive approach to accomplish same result.
D0 Grid Data Production Issues List (p.2/4) (Red = not treated in Phase 1, Green = treated or non-issue, Yellow = notes) • 6) Capacity of durable location servers: Merge jobs frequently fail due to delivery timeouts of the unmerged thumbnails. We need to examine carefully what functions the durable location servers are providing and limit activity here to production operations. Note that when we stop running Recocert as part of the merge this problem will worsen. • Nothing in Phase 1. Later Phase may include decoupling of durable location servers? • No automatic handling of hardware failure. System keeps trying even if storage server down. • 7) CurMatch limit on forwarding nodes: We need to increase this limit which probably implies adding more forwarding nodes. We would also like to have MC and data production separated on different forwarding nodes so response is more predictable. • Decouple FWD nodes between Data and MC Production and tune separately for each – DONE. • Can now tune to optimize for Data Production. • 8) Job slot limit on forwarding nodes: The current limit of 750 job slots handled by each forwarding node has to be increased. Ideally this would be large enough that one forwarding node going down only results in slower throughput to CAB rather than a complete cutoff of half the processing slots. Could be addressed by optimizing fwd node config for data production. • Decouple FWD nodes between Data and MC Production and tune separately for each – DONE. • Can now tune to optimize for Data Production. • 9) Input queues on CAB: We have to be able to fill the input queues on CAB to their capacity of ~1000 jobs. The configuration coupling between MC and data production that currently limits this to ~200 has to be removed. Could be addressed by optimizing fwd node config for data production. • Decouple FWD nodes between Data and MC Production and tune separately for each – DONE. • Can now tune to optimize for Data Production.
D0 Grid Data Production Issues List (p.3/4)(Red = not treated in Phase 1, Green = treated or non-issue, Yellow = notes) • 10) 32,001 Directory problem: Acceptable band-aid is in place, but we should follow up with Condor developers to communicate the scaling issue of storing job state in a file system given the need to retain job state for tens of thousands of jobs in a large production system. • Already a cron job to move information into sub-directories to avoid this. • 11) Spiral of Death problem: See for example reports from 19-21 July 2008. Rare, but stop all processing. We do not understand the underlying cause yet. The only known way to address this situation is to do a complete kill/cold-stop and restart of the system. • Condor 7 Upgrade?May be different causes in other episodes... Only one was understood. • Decouple FWD nodes between Data and MC Production and tune separately for each. (mitigation only) • 12) Various Globus errors: We have repeated episodes where a significant number of jobs lose all state information and fall into a "Held" state due to various Globus errors. These errors are usually something like "Job state file doesn't exist", "Couldn't open std out or std err", "Unspecified job manager error". Mike doesn't think we have ever clearly identified the source of these errors. His guess is they have a common cause. The above errors tend to occur in clusters (about half a dozen showed up last night, that's what brought it to mind). They usually don't result in the job failing, but such jobs have to be tracked by hand until complete and in some cases all log information is lost. • Later Phase to include more detailed debugging with more modern software in use. • At least some issues are not SAM-Grid specific and known not fixed by VDT 1.10.1m. (KC). • For example: GAHP server... Part of Condor • 13) Automatic restart of services on reboot: Every node in the system (samgrid, samgfwd, d0cabosg, etc) needs to be set up to automatically restart all necessary services on reboot. We have lost a lot of time when nodes reboot and services do not come back up. SAM people appear to not get any notification when some of these nodes reboot. • Done during Evaluation Phase. Make sure this is setup on new nodes as well. – DONE!
D0 Grid Data Production Issues List (p.4/4)(Red = not treated in Phase 1, Green = treated or non-issue, Yellow = notes) • 14) SRM needs to be cleanly isolated from the rest of the operation: This might come about as a natural consequence of some of the other decoupling actions. A more global statement would be that we have to ensure that problems at remote sites cannot stop our local operations (especially if the problematic interaction has nothing to do with the data processing operation). • Nothing in Phase 1.Later Phase to include decoupling of SAM stations, 1 each for Data and MC Production. • 15) Lack of Transparency: No correlation between the distinct grid and PBS id’s and inadequate monitoring mean it is very difficult to track a single job through the entire grid system, especially important for debugging. • Tool identified in Evaluation Phase to help with this. Consider refinement in later Phase. • 16) Periods of Slow Fwd node to CAB Job transitions: related to Spiral of Death issue? • Condor 7 Upgrade and increase ulimit-OpenFileMaxPreProcess to high value used elsewhere. • Cures all observed cases? Not yet sure. • MC-specific Issue #1) File Delivery bottlenecks: use of SRM at site helps • Out of scope for Phase 1. SRM specification mechanism inadequate. Should go by the site name or something more specific. • MC-specific 2) Redundant SAM caches needed in the field • Out of scope for Phase 1 • MC-specific 3) ReSS improvements needed, avoid problem sites,…. • Out of scope for Phase 1. PM sent doc, met with Joel. • MC-specific 4) Get LCG forwarding nodes up and running reliably • Out of scope for Phase 1. This is being worked on outside of Initiative Phase 1 though.