D0 Grid Data Production: Initiative Mtg 5

Version 1.0 (draft meeting edition) 09 October 2008 Rob Kennedy and Adam Lyon Attending: … D0 Grid Data Production:Initiative Mtg 5 D0 Grid Data Production

D0 Grid Data Production Outline • News Items • Walk through WBS • Status Check • Schedule Review (not all of latest discussion included… will need to update) • Tasks for Next Week

D0 Grid Data Production News Items • Executive Presentation 8-Oct-2008 with Vicky White and D0 Spokespeople • Green light to proceed. Evaluation closed, Initiative begun. • Roadmap same as presented in these meetings. • Schedule same as presented in these meetings, though encouraged to accomplish as much as possible before Thanksgiving holidays. • Risk point: FWD4, QUE2 schedule • Discussion Points: Condor maturity, Why is “slow FWD-CAB job transition” not debuggable from tracing? Code Review applicable? • Slides to be posted later today. • Initiative Schedule • Draft version in spreadsheet firming up… working on some staff commitments and start estimates (based on leveling effort) • Will put into MS Project after all to get easy graphical Gantt Chart

D0 Grid Data Production 1.1 Server Expansion and Decoupling • 1.1.1. FWD4, QUE2 Platform (CRITICAL PATH) • 1.1.1.1. Server Hardware On-site – FEF • Status: reported done 1 Oct 2008 • 1.1.1.2. Server Hardware Installed (Physical and OS) – FEF • Start: 03 Oct 2008 Finish: 12 Oct 2008 • Issues: Network topology – D0 switch on FCC2? • Status: … • 1.1.1.3. Server Hardware Burn-in – FEF • Start: 13 Oct 2008 Finish: 19 Oct 2008 • Issue: … • Status: … …

D0 Grid Data Production 1.1 Server Expansion and Decoupling • 1.1.2. FWD4, QUE2 Grid Software (CRITICAL PATH) • 1.1.2.1. Receive Fwd4, Ques – REX • Start: 13 Oct 2008 Finish: 19 Oct 2008 • Includes: Check OS during burn-in… products installed? • 1.1.2.2. FWD4 Setup – REX • Start: 20 Oct 2008 Finish: 27 Oct 2008 • Depends: Must have a “Split Data/MC” Configuration defined • 1.1.2.3. FWD4 Test – REX • Start: 27 Oct 2008 Finish: 9 Nov 2008 • Includes: 2 weeks to test and have absolutely ready for production • 1.1.2.4. QUE2 Setup – REX • Start: 27 Oct 2008 Finish: 2 Nov 2008 • Depends: Must have a “Split Data/MC” Configuration defined • 1.1.2.5. QUE2 Test – REX • Start: 3 Nov Oct 2008 Finish: 16 Nov 2008 • Includes: 2 weeks to test and have absolutely ready for production • Status: … • 1.1.2.6. Jim_Client 2-QUE Preparation – REX • Start: 3 Nov Oct 2008 Finish: 9 Nov 2008 • Includes: Some kind of pre-release test, perhaps as part of QUE2 testing.

D0 Grid Data Production 1.1 Server Expansion and Decoupling • 1.1.3. Grid System Configuration (CRITICAL PATH) • New: Planning Meeting held. Decided to use 2-stage approach (not three), and the basic configuration to be used for the Split Data/MC Production stage. • 1.1.3.1. Split Data/MC Production – REX • Start: 20 Oct 2008 Finish: 26 Oct 2008 • Includes: Configuration defined and stored in retrievable form • Status: … • 1.1.3.2. Optimize Data and MC Production – REX • Start: After initial deployment, based on operations experience. • Depends: Initial Deployment • Status: … …

D0 Grid Data Production 1.1 Server Expansion and Decoupling • 1.1.4. New SAM Station Platform – FEF (CRITICAL PATH) • Input: Identify Node(s) to prepare. Availability of Nodes if other role served. • Effort Required: ? - Move hardware? Wipe and install as station server. Support? • Duration: 2 weeks, Start: Soon, End: Soon + 2 weeks • Needed/Involved: (more detailed plan) • 1.1.5. New SAM Station Software and Config – REX (CRITICAL PATH) • Input: Station platform prepared • Effort Required: ? – Basic Station install and internal adaptation is trivial, but should be thought out more given risks. Adapting World to Use them may require coordination and some calendar time. • Duration: 2 weeks (not 4 weeks), Start: 20 Oct End: 2 Nov • Includes DEV task to create 2-QUE capable Jim_Client packaging

D0 Grid Data Production Area of Work 2 • 2. Condor 7 Upgrade and Support • A. VDT Package Preparation – GRID, REX • Tactic: Same as before? • Input: VDT with necessary features, stability • Effort Required: 1 week? (may be non-trivial due to SAM-Grid coupling. Requires some testing to know.) • Duration: 4 weeks??, Start: “Now”, End: Oct 29 Wed?? (see discussion above) • Needed/Involved: Release of patched VDT 1.10.1, testing by SAM-Grid Dev and REX to be evaluate usability. • B. Install new VDT and Condor 7 – REX, FGS? • Input: VDT/Condor package, SAM-Grid package ready to deploy • Scope: Only FWD and QUE nodes would be upgraded in this task. • Effort Required: 1d • Duration: 1d, Start: ?, End: ? • Needed/Involved: downtime preferred? QUE node yes, FWD nodes done in rolling fashion perhaps. Work this out. • AL: What is the transition to Condor 7? Do we need to nuke all jobs first? Start Condor out fresh? • C. Condor Support Communication Processes: Reminder that Steve Timm is a central point for contact for Condor team. Please reference the unified Condor issues webpage. E-mail on issues should cc him.

D0 Grid Data Production 1.1 Server Expansion and Decoupling • 1.1.6. Deployment Stage 1 (CRITICAL PATH) • Planning: 10~16 Nov, be sure all necessary components & Split Data/MC config are ready. • 1.1.6.1. Execute Deployment 1 – REX • Start: 17 Nov 2008 Finish: 24 Nov 2008 • Consider: Deploy as early in time frame as possible to insure smooth holiday operations • 1.1.6.2. Sign-Off Deployment 1 – REX • Start: After deployment 1, BEFORE Thanksgiving holiday • Contains: Formal sign-off from involved groups that stable operations achieved • 1.1.7. Deployment Stage 2 (CRITICAL PATH) • Planning: 1~7 Dec, be sure all necessary components and optimized config are ready. • 1.1.6.1. Execute Deployment 2 – REX • Start: 8 Dec 2008 Finish: 14 Dec 2008 • Consider: Deploy as early in time frame as possible to insure smooth holiday operations • 1.1.6.2. Sign-Off Deployment 2 – REX • Start: After deployment 2, BEFORE winter holidays • Contains: Formal sign-off from involved groups that stable operations achieved

D0 Grid Data Production 1.3. Small Quick Wins • 1.3.1.Reliable Status Info returned by SAM-Grid: – dev by GRID, test by REX • Resolution in two parts: unlimited proxy for gridftp (1d) and distinguish status of jobs in HELD and DONE states (1w+) • Input: Is this specified enough, or was past meeting discussion sufficient? • Effort Required: 2 FTE-weeks (SAM-Grid Dev) to address. Overlap with VDT/SAM-Grid changes? • Duration: 3 weeks, Start: Soon, End: Oct 22 Wed • MD: Another big example of the impact of this recently. Please address this at highest priority. • Needed/Involved: Testing by REX before production deployment. Overlap with VDT/SAM-Grid upgrade testing? • Ideally this could be deployed before Data/MC Prod split, to distinguish problem causes. • 1.3.2.Slow FWD node to CAB Job Transition Investigation - can see when it is slow, but what is upstream or downstream cause? • This is a work in progress, PM in collaboration with a Condor developer. Need to catch slowdown in situ. Work on an alarm to catch this... Metric to measure time to transition? • EB: Try applying Andrew’s monitoring to this. RDK: Perhaps ask for his effort to apply the monitoring to catch this in situ. • Recent “nuke FWD1,2” episodes: evidence or hints of cause found? • 1.3.3.Improved Hardware Uptime – Medium Priority. • Define this a little better as a package of work. Concern about this blowing up into biggest issue in mid-Initiative. • Sanity check run after a reboot or restart? Depth of Spare Pool? Fast procedure to quickly swap out flaky/degraded server? Virtualization? Aggressive replacement of servers if repeated error seen? Re-qualify pulled servers on test stand?

D0 Grid Data Production Backup Slides • … not used in the meeting …

D0 Grid Data Production Area of Work 1 • 1.1. Server Expansion and Decoupling of Data/MC Prod at Services • Related Support Issues • Transfer support: Queue node support from FGS to FEF – note... No timescale stated here. • Transfer support: CAB head nodes from FEF to FGS (d0cabsrv1,2) – No timescale stated here. • Resolve details in a FEF/FGS meeting. May wipe/install OS per group’s standards. (discussed at CD FY09 Budget Review)

D0 Grid Data Production Issues List (p.1/4) • 1) Unreliable state information returned by SAM-Grid: SAM-Grid under some circumstances does not return correct state information for jobs. Fixing this may entail adding some logic to SAM-Grid. • 2) Cleanup of globus-job-managers on forwarding nodes, a.k.a. “stale jobs”: The globus job managers on the forwarding nodes are sometimes left running long after the jobs have actually terminated. This eventually blocks new jobs from starting. • 3) Scriptrunner on samgrid needs to be controlled, a.k.a. the Periodic Expression problem: This is now locking us out of all operation for ~1 hour each day. This is due to a feature in Condor 6 which we do not use, but which cannot be fully disabled either. Developers say this is fixed in Condor 7, but this has not been proven yet. • 4) CORBA communication problems with SAM station: The actual source of all CORBA problems is hard to pin down, but at least some of them seem to be associated with heavy load on samgfwd01 where the SAM station runs. Since the forwarding nodes are prone to getting bogged down at times, the SAM station needs to be moved to a separate node. • 5) Intelligent job matching to forwarding nodes: SAM-Grid appears to assign jobs to the forwarding nodes at random without regard to the current load on the forwarding nodes. It will assign jobs to a forwarding node that has reached CurMatch max even if another forwarding node has job slots available.

D0 Grid Data Production Issues List (p.2/4) • 6) Capacity of durable location servers: Merge jobs frequently fail due to delivery timeouts of the unmerged thumbnails. We need to examine carefully what functions the durable location servers are providing and limit activity here to production operations. Note that when we stop running Recocert as part of the merge this problem will worsen. • 7) CurMatch limit on forwarding nodes: We need to increase this limit which probably implies adding more forwarding nodes. We would also like to have MC and data production separated on different forwarding nodes so response is more predictable. • 8) Job slot limit on forwarding nodes: The current limit of 750 job slots handled by each forwarding node has to be increased. Ideally this would be large enough that one forwarding node going down only results in slower throughput to CAB rather than a complete cutoff of half the processing slots. Could be addressed by optimizing fwd node config for data production. • 9) Input queues on CAB: We have to be able to fill the input queues on CAB to their capacity of ~1000 jobs. The configuration coupling between MC and data production that currently limits this to ~200 has to be removed. Could be addressed by optimizing fwd node config for data production.

D0 Grid Data Production Issues List (p.3/4) • 10) 32,001 Directory problem: Band-aid is in place, but we should follow up with Condor developers to communicate the scaling issue of storing job state in a file system given the need to retain job state for tens of thousands of jobs in a large production system.11) Spiral of Death problem: See for example reports from 19-21 July 2008. Rare, but stop all processing. We do not understand the underlying cause yet. The only known way to address this situation is to do a complete kill/cold-stop and restart of the system. • 12) Various Globus errors: We have repeated episodes where a significant number of jobs lose all state information and fall into a "Held" state due to various Globus errors. These errors are usually something like "Job state file doesn't exist", "Couldn't open std out or std err", "Unspecified job manager error". Mike doesn't think we have ever clearly identified the source of these errors. His guess is they have a common cause. The above errors tend to occur in clusters (about half a dozen showed up last night, that's what brought it to mind). They usually don't result in the job failing, but such jobs have to be tracked by hand until complete and in some cases all log information is lost. • 13) Automatic restart of services on reboot: Every node in the system (samgrid, samgfwd, d0cabosg, etc) needs to be set up to automatically restart all necessary services on reboot. We have lost a lot of time when nodes reboot and services do not come back up. SAM people appear to not get any notification when some of these nodes reboot.

D0 Grid Data Production Issues List (p.4/4) • 14) SRM needs to be cleanly isolated from the rest of the operation: This might come about as a natural consequence of some of the other decoupling actions. A more global statement would be that we have to ensure that problems at remote sites cannot stop our local operations (especially if the problematic interaction has nothing to do with the data processing operation). • 15) Lack of Transparency: No correlation between the distinct grid and PBS id’s and inadequate monitoring mean it is very difficult to track a single job through the entire grid system, especially important for debugging. • 16) Periods of Slow Fwd node to CAB Job transitions: newly added • MC-specific #1) File Delivery bottlenecks: use of SRM at site helps • MC-specific 2) Redundant SAM caches needed in the field • MC-specific 3) ReSS improvements needed, avoid problem sites,…. • MC-specific 4) Get LCG forwarding nodes up and running reliably • MC Note) 80% eff. on CAB, but 90% on CMS – Why? Something about CAB/PBS to note here for data production expectations too?

D0 Grid Data Production: Initiative Mtg 5