D0 Grid Data Production Initiative: Coordination Mtg

Version 1.0 (meeting edition) 08 January 2009 Rob Kennedy and Adam Lyon Attending: … D0 Grid Data Production Initiative:Coordination Mtg D0 Grid Data Production

D0 Grid Data Production Overview • Summary • System ran smoothly over holidays. SUCCESS! • Resource Utilization metrics are all high, 95%+. SUCCESS! • Events/day increased, but short of goal level. Higher luminosity data? • Still some tasks to follow up on: doc, pkg, new SAM-Grid state feature • News • Exec Mtg w/D0 Spokespeople and Vicky on Dec 12. Much positive feedback. • Requested a Phase 1 close-out executive meeting in late Jan/early Feb after more operational experience, investigation into events/day = f(luminosity), etc. • No Coordination Meeting next week (1/15/2009), resumes weekly on 1/22/2009. • Agenda • Phase 1 Open Tasks, Close-out • Review of Metrics and (value) Goals • Understanding: nEvents/day = f(cpu, L, …)

D0 Grid Data Production Phase 1 Close-Out Status of Open Tasks Current Configuration Close-out

D0 Grid Data Production Phase 1 Follow-up Notes • Assign to: January 2009, Phase 2 (Feb-Apr), or Future Wish List • Deployment 1: Split Data/MC Production Services – Completed, with follow-up on: • 1. Queuing nodes auto-update code in place (v0.3), not enabled to avoid possible format confusion. • Defer downtime needed to cleanly enable auto-update. Hand edit gridmap files until then. • 2. AL’s Queuing Node monitoring still being productized, not running on QUE2 yet. • 3. Switch assignments: QUE1 = MC, and QUE2 = DATA PROD • Keep Data Production information on QUE1 for expected time. • Remote MC users no longer need to change their usage with this switch, simpler to implement. • 4. FWD3 not rebooted yet, so have not picked up ulimit-FileMaxPerProcess… No hurry. • 5. Integrating experience into installation procedures (see Deployment 1 Review notes) • Deployment 2: Optimize Data and MC Production Configurations – DEFERRED to January 2009 • 1. Config: Optimize Configurations separately for Data and MC Production • Increase Data Production “queue” length to reduce number of “touches” per day, avoid empty queue conditions • 2. New SAM-Grid Release with support for new Job status value at Queuing node • Defect in new Condor (Qedit, old version OK) prevents feature from working. • Kluge alternative or wait for Condor fix? Get schedule estimate for fixed release before deciding. • 3. Uniform OS’s: Upgrade FWD1-3 and QUE1 to latest SLF 4.0, same as FWD4-5 • 4. Formalize transfer of QUE1 (samgrid.fnal.gov) to FEF from FGS (before an OS upgrade)

D0 Grid Data Production Deployment Configuration(Green = now, Blue = in progress,Yellow = future) • Reco • FWD1: GridMgrMaxSubmitJobs/Resource = 1250 (was 750, default 100) • FWD5: 1250 • MC, MC Merge • FWD2: 1250 (was 750, default 100) • FWD4: 1250 • Reco Merge • FWD3: 750/300 grid each • QUE1: Reco, Reco Merge – keep here to maintain history • QUE2: MC, MC Merge - not used by MC Prod at first, now is. • Future: Switch these to simplify transition... Remote MC and test users then make no change since default in jim client = QUE1. • SAM Station: All job types • Jim Client: submit to QUE1 or QUE2 depending on qualifier, QUE1 default

D0 Grid Data Production Phase 1 Close-Out Discussion • Anything else left to do from Phase 1? • 1. … • Comments before “phase 1 close-out” • 1. … • Onward!

D0 Grid Data Production Metrics and Goals CPU Utilization (2) Job Slot Utilization Unmerged Events/Day Produced

D0 Grid Data Production Metrics Relationships Events Produced/Day (for given N job slots) Effort to Coordinate Production/Day (for given level of production) Top-most Customer (D0) View: Job Slot Utilization Job Processing Stability Grid/Batch Level: CPU Utilization Compute Level: Timely Input Data Delivery Infrastructure Level: Before/After Plots on slides

Metric: CPU Utilization (1) Deployment, fcpd fixed • Metric: Wall/CPU time ratio • Wall clock / CPU time used by “d0farm” as reported by CAB accounting. • Before Dec. 11: Unsteady and falling off • Fell from ~95% to less than 80% • 87% in Oct seems inconsistent with low job slot utilization at that time. Interpretation issue? • After Dec. 11: Steady and high • Since deployment, fcpd stabilized: >95%! • SUCCESS. Goal = consistent > 95% • Note: CPU/Job Increase Recently • Implies increased CPU/event? May help explain the underwhelming nEvents/day increase. • Or side effect of high job success rate? • Source Link: CAB Accounting Deployment, fcpd fixed CPU/Job Climbing! D0 Grid Data Production

D0 Grid Data Production Metric: CPU Utilization (2) fcpd fails. Jobs wait for data Mid- Dec ‘08 • Before: Occupancy & Load Dips • Metric: convolutes Job Slot & CPU utilization. Looking for CPU inefficiency when load dips but job slot use does not • Top: dips only in load (black trace) are due to a file transfer daemon failure mode (fixed) • Side effect of more stable system: Easier to see low-level issues AND debug larger issues. Less entanglement. • After: Occupancy & Load Steady • Bottom: Steady Load (little 4am bump is OK) • SUCCESS. Consistently ~100% Utilization • Source Link deployment Mid- Dec ‘08 A few minor empty queue instances

D0 Grid Data Production Metric: Job Slot Utilization deployment • Before: Job Slot Inefficiency • Bad: dips in green trace • 10% unused job slots in Sep ‘08 ... Part of resource utilization problem. • Smaller effect: job queue going empty (blue trace hits bottom) • After: Steady and efficient! • Bottom: negligible dips in green trace. • Issue: Few instances of empty queue. Treatable partly via config tweaks. • SUCCESS. Consistently ~100% w/treatable issue. • Source Link • See plot near page bottom.

D0 Grid Data Production Metric: Unmerged Events/Day Sep-Nov ‘08 Average 5.2 Mevts/day • Before: Wild swings, “low” average • Top-level Metric. Dependent on all... • Production output wildly varying • Includes >10% “known” utilization inefficiency from job slot utilization • May ‘08: 5.8 MEvts/day • Sep-Nov ‘08: 5.2 MEvts/day • After: Not as high as expected • Dec 2-Jan 6: 5.7 MEvts/day • Low days still, just no ~0 days. • Eventual goal: 7.5 – 8 MEvts/day with existing system (node count, etc.) and after addressing more subtleties. • Ops stable. Resources well used.But, Production output not much greater. • Why not much more? And… • Why large day-to-day variations? • Is CPU/event increasing over “time”? • Luminosity effect? • Can Production keep up w/Raw? • Source Link Recent Average 5.7 Mevts/day Sep-Nov ‘08 Average 5.2 Mevts/day

D0 Grid Data Production Understanding nEvents/day Identify Dependencies Measure Dependencies Automate Measurement: Monitoring

D0 Grid Data Production NEvents/day = f(cpu, L, …) • Best case nEvents/day = (CPU-sec in system) / (CPU-sec/event) * (sec/day) • CPU-sec in system = (max CPU-sec in system) • Standard unit of processing at D0 = ? • Estimate processing power available to D0Farm on CAB. GHz or bogo-mips measurable by Ganglia? • CPU-sec/event = (average CPU-sec/event on benchmark machine and data) • Benchmark machine and data? Benchmark values for each code release? • A Little Reality Never Hurt: Overall CPU Utilization Efficiency • CPU-sec in system = (max CPU-sec in system) * (CPU-sec/Wall-sec) • CPU-sec/Wall-sec > 95% and steady now. Track this, but no longer a large effect. • (But Full) Reality Bites: CPU/event is not a static value • Dependent on: Luminosity, Data stream (min-bias vs. high-pt), Reconstruction code, … • Data stream differences average out as all are processed per run, but a source of variation? • Dependence on Luminosity a concern as Tevatron breaks record after record • Do we have a measure of Luminosity which we can correlate to cpu/event? • Can we adapt CPU-hrs/Job from CAB to help fill potential measurement gap here? • Rough average events/job? Job defined as an integrated luminosity increment?

D0 Grid Data Production NEvents/day Discussion • Identify Dependencies • Resource Utilization is largely addressed • CPU/event in general, and CPU/event = f(L) • Measure Dependencies • What single-point measurements are available • What metrics are available to help identify cause • Other experiences, wisdom to apply? • Automate Measurement: Monitoring • What can be done to “build this into the system” so even if a component is not a problem now, it can be watched just in case. • Next Steps • Meetings/tasks before Jan 22 and next D0 Spokes+Vicky meeting?

D0 Grid Data Production Background Slides Original 16+4 Issues List

D0 Grid Data Production Issues List (p.1/4)(Red = not treated in Phase 1, Green = treated or non-issue, Yellow = notes) • 1) Unreliable state information returned by SAM-Grid: SAM-Grid under some circumstances does not return correct state information for jobs. Fixing this may entail adding some logic to SAM-Grid. • SAM-Grid Job Status development (see discussion on earlier slides). Delayed by Condor defect. • 2) Cleanup of globus-job-managers on forwarding nodes, a.k.a. “stale jobs”: The globus job managers on the forwarding nodes are sometimes left running long after the jobs have actually terminated. This eventually blocks new jobs from starting. • AL: Improved script to identify them and treat symptoms. Not happened recently.But why happening at all? • Not specific to SAM-Grid Grid Production • 3) Scriptrunner on samgrid needs to be controlled, a.k.a. the Periodic Expression problem: This is now locking us out of all operation for ~1 hour each day. This is due to a feature in Condor 6 which we do not use, but which cannot be fully disabled either. Developers say this is fixed in Condor 7, but this has not been proven yet. • Condor 7 Upgrade – RESOLVED! • 4) CORBA communication problems with SAM station: The actual source of all CORBA problems is hard to pin down, but at least some of them seem to be associated with heavy load on samgfwd01 where the SAM station runs. Since the forwarding nodes are prone to getting bogged down at times, the SAM station needs to be moved to a separate node. • Move SAM station off of FWD1 – DONE! Context Server move as well – DONE! • 5) Intelligent job matching to forwarding nodes: SAM-Grid appears to assign jobs to the forwarding nodes at random without regard to the current load on the forwarding nodes. It will assign jobs to a forwarding node that has reached CurMatch max even if another forwarding node has job slots available. • Nothing in Phase 1. Later Phase may include a less effort-intensive approach to accomplish same result.

D0 Grid Data Production Issues List (p.2/4) (Red = not treated in Phase 1, Green = treated or non-issue, Yellow = notes) • 6) Capacity of durable location servers: Merge jobs frequently fail due to delivery timeouts of the unmerged thumbnails. We need to examine carefully what functions the durable location servers are providing and limit activity here to production operations. Note that when we stop running Recocert as part of the merge this problem will worsen. • Nothing in Phase 1. Later Phase may include decoupling of durable location servers? • No automatic handling of hardware failure. System keeps trying even if storage server down. • 7) CurMatch limit on forwarding nodes: We need to increase this limit which probably implies adding more forwarding nodes. We would also like to have MC and data production separated on different forwarding nodes so response is more predictable. • Decouple FWD nodes between Data and MC Production and tune separately for each – DONE. • Can now tune to optimize for Data Production. • 8) Job slot limit on forwarding nodes: The current limit of 750 job slots handled by each forwarding node has to be increased. Ideally this would be large enough that one forwarding node going down only results in slower throughput to CAB rather than a complete cutoff of half the processing slots. Could be addressed by optimizing fwd node config for data production. • Decouple FWD nodes between Data and MC Production and tune separately for each – DONE. • Can now tune to optimize for Data Production. • 9) Input queues on CAB: We have to be able to fill the input queues on CAB to their capacity of ~1000 jobs. The configuration coupling between MC and data production that currently limits this to ~200 has to be removed. Could be addressed by optimizing fwd node config for data production. • Decouple FWD nodes between Data and MC Production and tune separately for each – DONE. • Can now tune to optimize for Data Production.

D0 Grid Data Production Issues List (p.3/4)(Red = not treated in Phase 1, Green = treated or non-issue, Yellow = notes) • 10) 32,001 Directory problem: Acceptable band-aid is in place, but we should follow up with Condor developers to communicate the scaling issue of storing job state in a file system given the need to retain job state for tens of thousands of jobs in a large production system. • Already a cron job to move information into sub-directories to avoid this. • 11) Spiral of Death problem: See for example reports from 19-21 July 2008. Rare, but stop all processing. We do not understand the underlying cause yet. The only known way to address this situation is to do a complete kill/cold-stop and restart of the system. • Condor 7 Upgrade?May be different causes in other episodes... Only one was understood. • Decouple FWD nodes between Data and MC Production and tune separately for each. (mitigation only) • 12) Various Globus errors: We have repeated episodes where a significant number of jobs lose all state information and fall into a "Held" state due to various Globus errors. These errors are usually something like "Job state file doesn't exist", "Couldn't open std out or std err", "Unspecified job manager error". Mike doesn't think we have ever clearly identified the source of these errors. His guess is they have a common cause. The above errors tend to occur in clusters (about half a dozen showed up last night, that's what brought it to mind). They usually don't result in the job failing, but such jobs have to be tracked by hand until complete and in some cases all log information is lost. • Later Phase to include more detailed debugging with more modern software in use. • At least some issues are not SAM-Grid specific and known not fixed by VDT 1.10.1m. (KC). • For example: GAHP server... Part of Condor • 13) Automatic restart of services on reboot: Every node in the system (samgrid, samgfwd, d0cabosg, etc) needs to be set up to automatically restart all necessary services on reboot. We have lost a lot of time when nodes reboot and services do not come back up. SAM people appear to not get any notification when some of these nodes reboot. • Done during Evaluation Phase. Make sure this is setup on new nodes as well. – DONE!

D0 Grid Data Production Issues List (p.4/4)(Red = not treated in Phase 1, Green = treated or non-issue, Yellow = notes) • 14) SRM needs to be cleanly isolated from the rest of the operation: This might come about as a natural consequence of some of the other decoupling actions. A more global statement would be that we have to ensure that problems at remote sites cannot stop our local operations (especially if the problematic interaction has nothing to do with the data processing operation). • Nothing in Phase 1.Later Phase to include decoupling of SAM stations, 1 each for Data and MC Production. • 15) Lack of Transparency: No correlation between the distinct grid and PBS id’s and inadequate monitoring mean it is very difficult to track a single job through the entire grid system, especially important for debugging. • Tool identified in Evaluation Phase to help with this. Consider refinement in later Phase. • 16) Periods of Slow Fwd node to CAB Job transitions: related to Spiral of Death issue? • Condor 7 Upgrade and increase ulimit-OpenFileMaxPreProcess to high value used elsewhere. • Cures all observed cases? Not yet sure. • MC-specific Issue #1) File Delivery bottlenecks: use of SRM at site helps • Out of scope for Phase 1. SRM specification mechanism inadequate. Should go by the site name or something more specific. • MC-specific 2) Redundant SAM caches needed in the field • Out of scope for Phase 1 • MC-specific 3) ReSS improvements needed, avoid problem sites,…. • Out of scope for Phase 1. PM sent doc, met with Joel. • MC-specific 4) Get LCG forwarding nodes up and running reliably • Out of scope for Phase 1. This is being worked on outside of Initiative Phase 1 though.

D0 Grid Data Production Initiative: Coordination Mtg