D0 Grid Data Production Initiative: Coordination Mtg

Version 1.0 (meeting edition) 22 January 2009 Rob Kennedy and Adam Lyon Attending: … D0 Grid Data Production Initiative:Coordination Mtg D0 Grid Data Production

D0 Grid Data Production Overview • News and Summary • System ran smoothly over past 2 weeks for the most part… • Some special run processing led to modest resource idle time, as forewarned by Mike D. • Incident with Oracle DB + planned long LTO4 tape downtime + … led to downtime 1/21/2009. • Resource Utilization metrics remained high, 95%+, up to the downtime. • CPU/event = f(Luminosity) appears to be driver of the events/day processing variations. • Still some tasks to follow up on: new SAM-Grid state feature (blocking on Condor fix) • Exec Mtg w/D0 Spokespeople and Vicky in late Jan/early Feb. (Yet to be scheduled) • Phase 1 close-out with more operational experience, investigation into events/day = f(luminosity), etc. • Phase 2 proposal (may be in draft form at that time) • Agenda • Brief: status of system, Jan/Feb tasks and deployments. • Progress: modeling nEvents/day = f(cpu's in system, luminosity, etc) • Focus = Phase 2 goals • List potential goals and briefly discuss what they mean and the anticipated benefits/costs and time/effort • Develop a process to prioritize/select goals... • Start to pare down to a “short list”, but want broader input before finalizing the goals list.

D0 Grid Data Production Phase 1 Follow-up Status of Open Tasks

D0 Grid Data Production Phase 1 Follow-up Notes • Start in January 2009, take in steps to insure stability. May continue into Phase 2. • 1. Config: Optimize Configurations separately for Data and MC Production • January 2009: first trial of increasing limits • Increase Data Production “queue” length to reduce number of “touches” per day, avoid empty queue conditions • Cannot increase arbitrarily without hitting CAB queue limits (~3000)… also issue with Grid System polling • Batch queue limit now about ~1000 for d0farm on CAB . 2640 max queue-able + running • Desire to have either ONE of FWD nodes able to fill the running slots. • First Tuesday of February 2009 • 2. Queuing nodes auto-update code in place (v0.3), not enabled to avoid possible format confusion. • Defer downtime needed to cleanly enable auto-update. Hand edit gridmap files until then. • 3. FWD3 not rebooted yet, so have not picked up ulimit-FileMaxPerProcess… • Sometime in Phase 2 (Feb-Apr 2009) • 4. AL’s Queuing Node monitoring still being productized, not running on QUE2 yet. • 5. New SAM-Grid Release with support for new Job status value at Queuing node • Defect in new Condor (Qedit) prevents feature from working. Dev done and tested, feature disabled by default. • Kluge alternative or wait for Condor fix? Get schedule estimate for fixed release before deciding. • 6. Uniform OS’s: Upgrade FWD1-3 and QUE1 to latest SLF 4.0, same as FWD4-5. DEFER? • 7. Formalize transfer of QUE1 (samgrid.fnal.gov) to FEF from FGS (before an OS upgrade)

D0 Grid Data Production Modeling nEvents/day CPU in system CPU per event First-order estimate

D0 Grid Data Production NEvents/day = f(cpu, L, …) • Final set of parameters Asymptotic Standard Error • a = 0.00204141 ± 0.0001052 (5.154%) • b = -0.0736117 ± 0.03557 (48.32%) • c = 11.1163 ± 2.611 (23.49%) • CPU-sec in system • MD: get_pbsnodes.csh  Unit = PIII GHz • (Total nodes, process slots, cpu power) = 434 1606 4781.33 • Resource Utilization • CPU-sec/Wall-sec ~ 96.5% with steady running. Does not consider edge effects like executable start-up. Use 90% to accommodate this and other endemic inefficiencies under best conditions. • CPU-sec/event as f(L) • MD: 20_timing.ps  parabolic fit of processing time to initial luminosity • See top right table. T = aL^2 + bL + c • L profile of Tevatron • Dec 2008: average initial luminosity = 155 E30 • nEvents/day = CPU-sec in system * Resource Utilization / CPU-sec/event as f(L) * (sec/day) • See bottom right table • Million of Events/day (init L = 155E30) = 8.0 MEvts/day • We observe about 6.5 MEvts/day for the smoothest ops period in Dec • OK… not so far off considering the relative precision of inputs.

D0 Grid Data Production Plots Behind the Numbers • External Files… shown briefly at meeting • CPU/event = f(L) fit • Average initial L by Run • Caveats • Assumes L profile of stores is static • We have heard that stores may end sooner intentionally to increase the average L with the same average initial L. • Recall the L plots shown are when data was recorded, NOT when data was processed • Large day-to-day variation not fully understood. May be related to order of processing (High L to Low L).

D0 Grid Data Production Phase 2 Goals List potential goals, benefits/costs, time/effort Develop a process to prioritize/select goals... Start to pare down to a “short list”

D0 Grid Data Production Phase 2 Goal Process • Guidelines: Resource Utilization, Production Capacity, Effort to Operate • Process • Today: candidate goals wish list, pared down to mostly likely candidates • Next week: lightweight planning – what is possible to do in time/effort allotted, benefits/costs • Form draft proposal, present to CPB for comment • Present draft plan to D0 spokes and VW … preferably after CPB, depends on availability. • Overall Categories of Goals • Finish Phase 1 Leftovers and remaining 16+4 Issues • Address next layer of capacity and performance issues in Grid Data Production • Work directly on reducing effort to operate Grid Data Production • Look at broadening scope within reason to get the most benefit for cost • MC Production issues • D0 Offline CPU Farm Utilization in general

D0 Grid Data Production Phase 2 Goals Wish List • Phase 1 and Re-Assessment Leftovers • Items listed in previous slide: Big 2 IMHO are: • 1. Config: Optimize Configurations separately for Data and MC Production • 2. New SAM-Grid Release with support for new Job status value at Queuing node • Resource Utilization analysis done. (add a plot, integrate into ops better?) • Effort to Operate analysis not done. Improvements as side effect – good enough? • Capacity Planning started. Good enough? Average L per day? L at processing time? • Ideas collected over time for Phase 2 • Slow FWD-CAB Job Transition: Install Monitoring and Alarm • REX/Ops deployable Condor Upgrades (VDT/Condor product config) • Decouple SAM Station: 1 each for Data and MC Prod … means SRM use decoupled • Decouple Local Data Processing Storage (Data vs MC durable storage), Add Capacity. • Enable alarms, monitoring for all fcpd and Local Cache Storage Services • Workflow diagram-oriented monitoring of job status – “see” bottlenecks, debugging • Broadening Scope • Address MC Prod major 4 issues … and remaining relevant Data Production issues. • New: Look at D0 Farm CPU utilization overall – part appears 40% idle, many SPOF. • <break out to Keith Chadwick’s slides> + <Ganglia plots of d0cabsrv1 CPU and load plots>

D0 Grid Data Production Discussion Notes • … • … • … • … • … • …

D0 Grid Data Production Background Slides Original 16+4 Issues List

D0 Grid Data Production Issues List (p.1/4)(Red = not treated in Phase 1, Green = treated or non-issue, Yellow = notes) • 1) Unreliable state information returned by SAM-Grid: SAM-Grid under some circumstances does not return correct state information for jobs. Fixing this may entail adding some logic to SAM-Grid. • SAM-Grid Job Status development (see discussion on earlier slides). Delayed by Condor defect. • 2) Cleanup of globus-job-managers on forwarding nodes, a.k.a. “stale jobs”: The globus job managers on the forwarding nodes are sometimes left running long after the jobs have actually terminated. This eventually blocks new jobs from starting. • AL: Improved script to identify them and treat symptoms. Not happened recently.But why happening at all? • Not specific to SAM-Grid Grid Production • 3) Scriptrunner on samgrid needs to be controlled, a.k.a. the Periodic Expression problem: This is now locking us out of all operation for ~1 hour each day. This is due to a feature in Condor 6 which we do not use, but which cannot be fully disabled either. Developers say this is fixed in Condor 7, but this has not been proven yet. • Condor 7 Upgrade – RESOLVED! • 4) CORBA communication problems with SAM station: The actual source of all CORBA problems is hard to pin down, but at least some of them seem to be associated with heavy load on samgfwd01 where the SAM station runs. Since the forwarding nodes are prone to getting bogged down at times, the SAM station needs to be moved to a separate node. • Move SAM station off of FWD1 – DONE! Context Server move as well – DONE! • 5) Intelligent job matching to forwarding nodes: SAM-Grid appears to assign jobs to the forwarding nodes at random without regard to the current load on the forwarding nodes. It will assign jobs to a forwarding node that has reached CurMatch max even if another forwarding node has job slots available. • Nothing in Phase 1. Later Phase may include a less effort-intensive approach to accomplish same result.

D0 Grid Data Production Issues List (p.2/4) (Red = not treated in Phase 1, Green = treated or non-issue, Yellow = notes) • 6) Capacity of durable location servers: Merge jobs frequently fail due to delivery timeouts of the unmerged thumbnails. We need to examine carefully what functions the durable location servers are providing and limit activity here to production operations. Note that when we stop running Recocert as part of the merge this problem will worsen. • Nothing in Phase 1. Later Phase may include decoupling of durable location servers? • No automatic handling of hardware failure. System keeps trying even if storage server down. • 7) CurMatch limit on forwarding nodes: We need to increase this limit which probably implies adding more forwarding nodes. We would also like to have MC and data production separated on different forwarding nodes so response is more predictable. • Decouple FWD nodes between Data and MC Production and tune separately for each – DONE. • Can now tune to optimize for Data Production. • 8) Job slot limit on forwarding nodes: The current limit of 750 job slots handled by each forwarding node has to be increased. Ideally this would be large enough that one forwarding node going down only results in slower throughput to CAB rather than a complete cutoff of half the processing slots. Could be addressed by optimizing fwd node config for data production. • Decouple FWD nodes between Data and MC Production and tune separately for each – DONE. • Can now tune to optimize for Data Production. • 9) Input queues on CAB: We have to be able to fill the input queues on CAB to their capacity of ~1000 jobs. The configuration coupling between MC and data production that currently limits this to ~200 has to be removed. Could be addressed by optimizing fwd node config for data production. • Decouple FWD nodes between Data and MC Production and tune separately for each – DONE. • Can now tune to optimize for Data Production.

D0 Grid Data Production Issues List (p.3/4)(Red = not treated in Phase 1, Green = treated or non-issue, Yellow = notes) • 10) 32,001 Directory problem: Acceptable band-aid is in place, but we should follow up with Condor developers to communicate the scaling issue of storing job state in a file system given the need to retain job state for tens of thousands of jobs in a large production system. • Already a cron job to move information into sub-directories to avoid this. • 11) Spiral of Death problem: See for example reports from 19-21 July 2008. Rare, but stop all processing. We do not understand the underlying cause yet. The only known way to address this situation is to do a complete kill/cold-stop and restart of the system. • Condor 7 Upgrade?May be different causes in other episodes... Only one was understood. • Decouple FWD nodes between Data and MC Production and tune separately for each. (mitigation only) • 12) Various Globus errors: We have repeated episodes where a significant number of jobs lose all state information and fall into a "Held" state due to various Globus errors. These errors are usually something like "Job state file doesn't exist", "Couldn't open std out or std err", "Unspecified job manager error". Mike doesn't think we have ever clearly identified the source of these errors. His guess is they have a common cause. The above errors tend to occur in clusters (about half a dozen showed up last night, that's what brought it to mind). They usually don't result in the job failing, but such jobs have to be tracked by hand until complete and in some cases all log information is lost. • Later Phase to include more detailed debugging with more modern software in use. • At least some issues are not SAM-Grid specific and known not fixed by VDT 1.10.1m. (KC). • For example: GAHP server... Part of Condor • 13) Automatic restart of services on reboot: Every node in the system (samgrid, samgfwd, d0cabosg, etc) needs to be set up to automatically restart all necessary services on reboot. We have lost a lot of time when nodes reboot and services do not come back up. SAM people appear to not get any notification when some of these nodes reboot. • Done during Evaluation Phase. Make sure this is setup on new nodes as well. – DONE!

D0 Grid Data Production Issues List (p.4/4)(Red = not treated in Phase 1, Green = treated or non-issue, Yellow = notes) • 14) SRM needs to be cleanly isolated from the rest of the operation: This might come about as a natural consequence of some of the other decoupling actions. A more global statement would be that we have to ensure that problems at remote sites cannot stop our local operations (especially if the problematic interaction has nothing to do with the data processing operation). • Nothing in Phase 1.Later Phase to include decoupling of SAM stations, 1 each for Data and MC Production. • 15) Lack of Transparency: No correlation between the distinct grid and PBS id’s and inadequate monitoring mean it is very difficult to track a single job through the entire grid system, especially important for debugging. • Tool identified in Evaluation Phase to help with this. Consider refinement in later Phase. • 16) Periods of Slow Fwd node to CAB Job transitions: related to Spiral of Death issue? • Condor 7 Upgrade and increase ulimit-OpenFileMaxPreProcess to high value used elsewhere. • Cures all observed cases? Not yet sure. • MC-specific Issue #1) File Delivery bottlenecks: use of SRM at site helps • Out of scope for Phase 1. SRM specification mechanism inadequate. Should go by the site name or something more specific. • MC-specific 2) Redundant SAM caches needed in the field • Out of scope for Phase 1 • MC-specific 3) ReSS improvements needed, avoid problem sites,…. • Out of scope for Phase 1. PM sent doc, met with Joel. • MC-specific 4) Get LCG forwarding nodes up and running reliably • Out of scope for Phase 1. This is being worked on outside of Initiative Phase 1 though.

D0 Grid Data Production Initiative: Coordination Mtg